• Open

    CARE: Certifiably Robust Learning with Reasoning via Variational Inference. (arXiv:2209.05055v2 [cs.LG] UPDATED)
    Despite great recent advances achieved by deep neural networks (DNNs), they are often vulnerable to adversarial attacks. Intensive research efforts have been made to improve the robustness of DNNs; however, most empirical defenses can be adaptively attacked again, and the theoretically certified robustness is limited, especially on large-scale datasets. One potential root cause of such vulnerabilities for DNNs is that although they have demonstrated powerful expressiveness, they lack the reasoning ability to make robust and reliable predictions. In this paper, we aim to integrate domain knowledge to enable robust learning with the reasoning paradigm. In particular, we propose a certifiably robust learning with reasoning pipeline (CARE), which consists of a learning component and a reasoning component. Concretely, we use a set of standard DNNs to serve as the learning component to make semantic predictions, and we leverage the probabilistic graphical models, such as Markov logic networks (MLN), to serve as the reasoning component to enable knowledge/logic reasoning. However, it is known that the exact inference of MLN (reasoning) is #P-complete, which limits the scalability of the pipeline. To this end, we propose to approximate the MLN inference via variational inference based on an efficient expectation maximization algorithm. In particular, we leverage graph convolutional networks (GCNs) to encode the posterior distribution during variational inference and update the parameters of GCNs (E-step) and the weights of knowledge rules in MLN (M-step) iteratively. We conduct extensive experiments on different datasets and show that CARE achieves significantly higher certified robustness compared with the state-of-the-art baselines. We additionally conducted different ablation studies to demonstrate the empirical robustness of CARE and the effectiveness of different knowledge integration.  ( 2 min )
    Reconnoitering the class distinguishing abilities of the features, to know them better. (arXiv:2211.12771v2 [cs.LG] UPDATED)
    The relevance of machine learning (ML) in our daily lives is closely intertwined with its explainability. Explainability can allow end-users to have a transparent and humane reckoning of a ML scheme's capability and utility. It will also foster the user's confidence in the automated decisions of a system. Explaining the variables or features to explain a model's decision is a need of the present times. We could not really find any work, which explains the features on the basis of their class-distinguishing abilities (specially when the real world data are mostly of multi-class nature). In any given dataset, a feature is not equally good at making distinctions between the different possible categorizations (or classes) of the data points. In this work, we explain the features on the basis of their class or category-distinguishing capabilities. We particularly estimate the class-distinguishing capabilities (scores) of the variables for pair-wise class combinations. We validate the explainability given by our scheme empirically on several real-world, multi-class datasets. We further utilize the class-distinguishing scores in a latent feature context and propose a novel decision making protocol. Another novelty of this work lies with a \emph{refuse to render decision} option when the latent variable (of the test point) has a high class-distinguishing potential for the likely classes.  ( 2 min )
    Causal Graph Recovery for Sepsis-Associated Derangements via Interpretable Hawkes Networks. (arXiv:2106.02600v2 [cs.LG] UPDATED)
    Continuous, automated surveillance systems that incorporate machine learning models are becoming increasingly common in healthcare environments. These models can capture temporally dependent changes across multiple patient variables and can enhance a clinician's situational awareness by providing an early warning alarm of an impending adverse event such as sepsis. However, most commonly used methods, e.g., XGBoost, fail to provide an interpretable mechanism for understanding why a model produced a sepsis alarm at a given time. The ``black box'' nature of many models is a severe limitation as it prevents clinicians from independently corroborating those physiologic features that have contributed to the sepsis alarm. To overcome this limitation, we propose a generalized linear model (GLM) approach to fit a Granger causal graph based on the physiology of several major sepsis-associated derangements (SADs). We adopt a recently developed stochastic monotone variational inequality (VI)-based estimator coupled with forwarding feature selection to learn the graph structure from both continuous and discrete-valued as well as regularly and irregularly sampled time series. Theoretically, we develop a non-asymptotic upper bound on the estimation error for any monotone link function in the GLM. Using synthetic and real-data examples, we demonstrate that the proposed method enjoys result interpretability while achieving comparable performance to popular methods such as XGBoost.  ( 2 min )
    Dataset Distillation for Medical Dataset Sharing. (arXiv:2209.14603v4 [cs.CR] UPDATED)
    Sharing medical datasets between hospitals is challenging because of the privacy-protection problem and the massive cost of transmitting and storing many high-resolution medical images. However, dataset distillation can synthesize a small dataset such that models trained on it achieve comparable performance with the original large dataset, which shows potential for solving the existing medical sharing problems. Hence, this paper proposes a novel dataset distillation-based method for medical dataset sharing. Experimental results on a COVID-19 chest X-ray image dataset show that our method can achieve high detection performance even using scarce anonymized chest X-ray images.  ( 2 min )
    SAGDA: Achieving $\mathcal{O}(\epsilon^{-2})$ Communication Complexity in Federated Min-Max Learning. (arXiv:2210.00611v2 [cs.LG] UPDATED)
    To lower the communication complexity of federated min-max learning, a natural approach is to utilize the idea of infrequent communications (through multiple local updates) same as in conventional federated learning. However, due to the more complicated inter-outer problem structure in federated min-max learning, theoretical understandings of communication complexity for federated min-max learning with infrequent communications remain very limited in the literature. This is particularly true for settings with non-i.i.d. datasets and partial client participation. To address this challenge, in this paper, we propose a new algorithmic framework called stochastic sampling averaging gradient descent ascent (SAGDA), which i) assembles stochastic gradient estimators from randomly sampled clients as control variates and ii) leverages two learning rates on both server and client sides. We show that SAGDA achieves a linear speedup in terms of both the number of clients and local update steps, which yields an $\mathcal{O}(\epsilon^{-2})$ communication complexity that is orders of magnitude lower than the state of the art. Interestingly, by noting that the standard federated stochastic gradient descent ascent (FSGDA) is in fact a control-variate-free special version of SAGDA, we immediately arrive at an $\mathcal{O}(\epsilon^{-2})$ communication complexity result for FSGDA. Therefore, through the lens of SAGDA, we also advance the current understanding on communication complexity of the standard FSGDA method for federated min-max learning.  ( 2 min )
    Taming Fat-Tailed ("Heavier-Tailed'' with Potentially Infinite Variance) Noise in Federated Learning. (arXiv:2210.00690v2 [cs.LG] UPDATED)
    A key assumption in most existing works on FL algorithms' convergence analysis is that the noise in stochastic first-order information has a finite variance. Although this assumption covers all light-tailed (i.e., sub-exponential) and some heavy-tailed noise distributions (e.g., log-normal, Weibull, and some Pareto distributions), it fails for many fat-tailed noise distributions (i.e., ``heavier-tailed'' with potentially infinite variance) that have been empirically observed in the FL literature. To date, it remains unclear whether one can design convergent algorithms for FL systems that experience fat-tailed noise. This motivates us to fill this gap in this paper by proposing an algorithmic framework called FAT-Clipping (\ul{f}ederated \ul{a}veraging with \ul{t}wo-sided learning rates and \ul{clipping}), which contains two variants: FAT-Clipping per-round (FAT-Clipping-PR) and FAT-Clipping per-iteration (FAT-Clipping-PI). Specifically, for the largest $\alpha \in (1,2]$ such that the fat-tailed noise in FL still has a bounded $\alpha$-moment, we show that both variants achieve $\mathcal{O}((mT)^{\frac{2-\alpha}{\alpha}})$ and $\mathcal{O}((mT)^{\frac{1-\alpha}{3\alpha-2}})$ convergence rates in the strongly-convex and general non-convex settings, respectively, where $m$ and $T$ are the numbers of clients and communication rounds. Moreover, at the expense of more clipping operations compared to FAT-Clipping-PR, FAT-Clipping-PI further enjoys a linear speedup effect with respect to the number of local updates at each client and being lower-bound-matching (i.e., order-optimal). Collectively, our results advance the understanding of designing efficient algorithms for FL systems that exhibit fat-tailed first-order oracle information.  ( 2 min )
    Learning Social Navigation from Demonstrations with Conditional Neural Processes. (arXiv:2210.03582v2 [cs.RO] UPDATED)
    Sociability is essential for modern robots to increase their acceptability in human environments. Traditional techniques use manually engineered utility functions inspired by observing pedestrian behaviors to achieve social navigation. However, social aspects of navigation are diverse, changing across different types of environments, societies, and population densities, making it unrealistic to use hand-crafted techniques in each domain. This paper presents a data-driven navigation architecture that uses state-of-the-art neural architectures, namely Conditional Neural Processes, to learn global and local controllers of the mobile robot from observations. Additionally, we leverage a state-of-the-art, deep prediction mechanism to detect situations not similar to the trained ones, where reactive controllers step in to ensure safe navigation. Our results demonstrate that the proposed framework can successfully carry out navigation tasks regarding social norms in the data. Further, we showed that our system produces fewer personal-zone violations, causing less discomfort.  ( 2 min )
    Why neural networks find simple solutions: the many regularizers of geometric complexity. (arXiv:2209.13083v2 [cs.LG] UPDATED)
    In many contexts, simpler models are preferable to more complex models and the control of this model complexity is the goal for many methods in machine learning such as regularization, hyperparameter tuning and architecture design. In deep learning, it has been difficult to understand the underlying mechanisms of complexity control, since many traditional measures are not naturally suitable for deep neural networks. Here we develop the notion of geometric complexity, which is a measure of the variability of the model function, computed using a discrete Dirichlet energy. Using a combination of theoretical arguments and empirical results, we show that many common training heuristics such as parameter norm regularization, spectral norm regularization, flatness regularization, implicit gradient regularization, noise regularization and the choice of parameter initialization all act to control geometric complexity, providing a unifying framework in which to characterize the behavior of deep learning models.  ( 2 min )
    Activation Learning by Local Competitions. (arXiv:2209.13400v2 [cs.NE] UPDATED)
    Despite its great success, backpropagation has certain limitations that necessitate the investigation of new learning methods. In this study, we present a biologically plausible local learning rule that improves upon Hebb's well-known proposal and discovers unsupervised features by local competitions among neurons. This simple learning rule enables the creation of a forward learning paradigm called activation learning, in which the output activation (sum of the squared output) of the neural network estimates the likelihood of the input patterns, or "learn more, activate more" in simpler terms. For classification on a few small classical datasets, activation learning performs comparably to backpropagation using a fully connected network, and outperforms backpropagation when there are fewer training samples or unpredictable disturbances. Additionally, the same trained network can be used for a variety of tasks, including image generation and completion. Activation learning also achieves state-of-the-art performance on several real-world datasets for anomaly detection. This new learning paradigm, which has the potential to unify supervised, unsupervised, and semi-supervised learning and is reasonably more resistant to adversarial attacks, deserves in-depth investigation.  ( 2 min )
    PELICAN: Permutation Equivariant and Lorentz Invariant or Covariant Aggregator Network for Particle Physics. (arXiv:2211.00454v2 [hep-ph] UPDATED)
    Many current approaches to machine learning in particle physics use generic architectures that require large numbers of parameters and disregard underlying physics principles, limiting their applicability as scientific modeling tools. In this work, we present a machine learning architecture that uses a set of inputs maximally reduced with respect to the full 6-dimensional Lorentz symmetry, and is fully permutation-equivariant throughout. We study the application of this network architecture to the standard task of top quark tagging and show that the resulting network outperforms all existing competitors despite much lower model complexity. In addition, we present a Lorentz-covariant variant of the same network applied to a 4-momentum regression task.  ( 2 min )
    Efficient Learning of Decision-Making Models: A Penalty Block Coordinate Descent Algorithm for Data-Driven Inverse Optimization. (arXiv:2210.15393v2 [math.OC] UPDATED)
    Decision-making problems are commonly formulated as optimization problems, which are then solved to make optimal decisions. In this work, we consider the inverse problem where we use prior decision data to uncover the underlying decision-making process in the form of a mathematical optimization model. This statistical learning problem is referred to as data-driven inverse optimization. We focus on problems where the underlying decision-making process is modeled as a convex optimization problem whose parameters are unknown. We formulate the inverse optimization problem as a bilevel program and propose an efficient block coordinate descent-based algorithm to solve large problem instances. Numerical experiments on synthetic datasets demonstrate the computational advantage of our method compared to standard commercial solvers. Moreover, the real-world utility of the proposed approach is highlighted through two realistic case studies in which we consider estimating risk preferences and learning local constraint parameters of agents in a multiplayer Nash bargaining game.  ( 2 min )
    A Generalized EigenGame with Extensions to Multiview Representation Learning. (arXiv:2211.11323v2 [cs.LG] UPDATED)
    Generalized Eigenvalue Problems (GEPs) encompass a range of interesting dimensionality reduction methods. Development of efficient stochastic approaches to these problems would allow them to scale to larger datasets. Canonical Correlation Analysis (CCA) is one example of a GEP for dimensionality reduction which has found extensive use in problems with two or more views of the data. Deep learning extensions of CCA require large mini-batch sizes, and therefore large memory consumption, in the stochastic setting to achieve good performance and this has limited its application in practice. Inspired by the Generalized Hebbian Algorithm, we develop an approach to solving stochastic GEPs in which all constraints are softly enforced by Lagrange multipliers. Then by considering the integral of this Lagrangian function, its pseudo-utility, and inspired by recent formulations of Principal Components Analysis and GEPs as games with differentiable utilities, we develop a game-theory inspired approach to solving GEPs. We show that our approaches share much of the theoretical grounding of the previous Hebbian and game theoretic approaches for the linear case but our method permits extension to general function approximators like neural networks for certain GEPs for dimensionality reduction including CCA which means our method can be used for deep multiview representation learning. We demonstrate the effectiveness of our method for solving GEPs in the stochastic setting using canonical multiview datasets and demonstrate state-of-the-art performance for optimizing Deep CCA.  ( 2 min )
    Annealing Optimization for Progressive Learning with Stochastic Approximation. (arXiv:2209.02826v2 [eess.SY] UPDATED)
    In this work, we introduce a learning model designed to meet the needs of applications in which computational resources are limited, and robustness and interpretability are prioritized. Learning problems can be formulated as constrained stochastic optimization problems, with the constraints originating mainly from model assumptions that define a trade-off between complexity and performance. This trade-off is closely related to over-fitting, generalization capacity, and robustness to noise and adversarial attacks, and depends on both the structure and complexity of the model, as well as the properties of the optimization methods used. We develop an online prototype-based learning algorithm based on annealing optimization that is formulated as an online gradient-free stochastic approximation algorithm. The learning model can be viewed as an interpretable and progressively growing competitive-learning neural network model to be used for supervised, unsupervised, and reinforcement learning. The annealing nature of the algorithm contributes to minimal hyper-parameter tuning requirements, poor local minima prevention, and robustness with respect to the initial conditions. At the same time, it provides online control over the performance-complexity trade-off by progressively increasing the complexity of the learning model as needed, through an intuitive bifurcation phenomenon. Finally, the use of stochastic approximation enables the study of the convergence of the learning algorithm through mathematical tools from dynamical systems and control, and allows for its integration with reinforcement learning algorithms, constructing an adaptive state-action aggregation scheme.  ( 2 min )
    Distributionally Robust Model-Based Offline Reinforcement Learning with Near-Optimal Sample Complexity. (arXiv:2208.05767v3 [cs.LG] UPDATED)
    This paper concerns the central issues of model robustness and sample efficiency in offline reinforcement learning (RL), which aims to learn to perform decision making from history data without active exploration. Due to uncertainties and variabilities of the environment, it is critical to learn a robust policy -- with as few samples as possible -- that performs well even when the deployed environment deviates from the nominal one used to collect the history dataset. We consider a distributionally robust formulation of offline RL, focusing on tabular robust Markov decision processes with an uncertainty set specified by the Kullback-Leibler divergence in both finite-horizon and infinite-horizon settings. To combat with sample scarcity, a model-based algorithm that combines distributionally robust value iteration with the principle of pessimism in the face of uncertainty is proposed, by penalizing the robust value estimates with a carefully designed data-driven penalty term. Under a mild and tailored assumption of the history dataset that measures distribution shift without requiring full coverage of the state-action space, we establish the finite-sample complexity of the proposed algorithm, and further show it is almost unimprovable in light of a nearly-matching information-theoretic lower bound up to a polynomial factor of the (effective) horizon length. To the best our knowledge, this provides the first provably near-optimal robust offline RL algorithm that learns under model uncertainty and partial coverage.  ( 2 min )
    DeepMed: Semiparametric Causal Mediation Analysis with Debiased Deep Learning. (arXiv:2210.04389v2 [stat.ML] UPDATED)
    Causal mediation analysis can unpack the black box of causality and is therefore a powerful tool for disentangling causal pathways in biomedical and social sciences, and also for evaluating machine learning fairness. To reduce bias for estimating Natural Direct and Indirect Effects in mediation analysis, we propose a new method called DeepMed that uses deep neural networks (DNNs) to cross-fit the infinite-dimensional nuisance functions in the efficient influence functions. We obtain novel theoretical results that our DeepMed method (1) can achieve semiparametric efficiency bound without imposing sparsity constraints on the DNN architecture and (2) can adapt to certain low dimensional structures of the nuisance functions, significantly advancing the existing literature on DNN-based semiparametric causal inference. Extensive synthetic experiments are conducted to support our findings and also expose the gap between theory and practice. As a proof of concept, we apply DeepMed to analyze two real datasets on machine learning fairness and reach conclusions consistent with previous findings.  ( 2 min )
    Faster Randomized Methods for Orthogonality Constrained Problems. (arXiv:2106.12060v1 [math.NA] CROSS LISTED)
    Recent literature has advocated the use of randomized methods for accelerating the solution of various matrix problems arising throughout data science and computational science. One popular strategy for leveraging randomization is to use it as a way to reduce problem size. However, methods based on this strategy lack sufficient accuracy for some applications. Randomized preconditioning is another approach for leveraging randomization, which provides higher accuracy. The main challenge in using randomized preconditioning is the need for an underlying iterative method, thus randomized preconditioning so far have been applied almost exclusively to solving regression problems and linear systems. In this article, we show how to expand the application of randomized preconditioning to another important set of problems prevalent across data science: optimization problems with (generalized) orthogonality constraints. We demonstrate our approach, which is based on the framework of Riemannian optimization and Riemannian preconditioning, on the problem of computing the dominant canonical correlations and on the Fisher linear discriminant analysis problem. For both problems, we evaluate the effect of preconditioning on the computational costs and asymptotic convergence, and demonstrate empirically the utility of our approach.  ( 2 min )
    When Do Curricula Work in Federated Learning?. (arXiv:2212.12712v1 [cs.LG])
    An oft-cited open problem of federated learning is the existence of data heterogeneity at the clients. One pathway to understanding the drastic accuracy drop in federated learning is by scrutinizing the behavior of the clients' deep models on data with different levels of "difficulty", which has been left unaddressed. In this paper, we investigate a different and rarely studied dimension of FL: ordered learning. Specifically, we aim to investigate how ordered learning principles can contribute to alleviating the heterogeneity effects in FL. We present theoretical analysis and conduct extensive empirical studies on the efficacy of orderings spanning three kinds of learning: curriculum, anti-curriculum, and random curriculum. We find that curriculum learning largely alleviates non-IIDness. Interestingly, the more disparate the data distributions across clients the more they benefit from ordered learning. We provide analysis explaining this phenomenon, specifically indicating how curriculum training appears to make the objective landscape progressively less convex, suggesting fast converging iterations at the beginning of the training procedure. We derive quantitative results of convergence for both convex and nonconvex objectives by modeling the curriculum training on federated devices as local SGD with locally biased stochastic gradients. Also, inspired by ordered learning, we propose a novel client selection technique that benefits from the real-world disparity in the clients. Our proposed approach to client selection has a synergic effect when applied together with ordered learning in FL.  ( 2 min )
    Linear Combinatorial Semi-Bandit with Causally Related Rewards. (arXiv:2212.12923v1 [cs.LG])
    In a sequential decision-making problem, having a structural dependency amongst the reward distributions associated with the arms makes it challenging to identify a subset of alternatives that guarantees the optimal collective outcome. Thus, besides individual actions' reward, learning the causal relations is essential to improve the decision-making strategy. To solve the two-fold learning problem described above, we develop the 'combinatorial semi-bandit framework with causally related rewards', where we model the causal relations by a directed graph in a stationary structural equation model. The nodal observation in the graph signal comprises the corresponding base arm's instantaneous reward and an additional term resulting from the causal influences of other base arms' rewards. The objective is to maximize the long-term average payoff, which is a linear function of the base arms' rewards and depends strongly on the network topology. To achieve this objective, we propose a policy that determines the causal relations by learning the network's topology and simultaneously exploits this knowledge to optimize the decision-making process. We establish a sublinear regret bound for the proposed algorithm. Numerical experiments using synthetic and real-world datasets demonstrate the superior performance of our proposed method compared to several benchmarks.  ( 2 min )
    GWO-FI: A novel machine learning framework by combining Gray Wolf Optimizer and Frequent Itemsets to diagnose and investigate effective factors on In-Hospital Mortality and Length of Stay among Kermanshahian Cardiovascular Disease patients. (arXiv:2212.13048v1 [cs.LG])
    Investigation and analysis of patient outcomes, including in-hospital mortality and length of stay, are crucial for assisting clinicians in determining a patient's result at the outset of their hospitalization and for assisting hospitals in allocating their resources. This paper proposes an approach based on combining the well-known gray wolf algorithm with frequent items extracted by association rule mining algorithms. First, original features are combined with the discriminative extracted frequent items. The best subset of these features is then chosen, and the parameters of the used classification algorithms are also adjusted, using the gray wolf algorithm. This framework was evaluated using a real dataset made up of 2816 patients from the Imam Ali Kermanshah Hospital in Iran. The study's findings indicate that low Ejection Fraction, old age, high CPK values, and high Creatinine levels are the main contributors to patients' mortality. Several significant and interesting rules related to mortality in hospitals and length of stay have also been extracted and presented. Additionally, the accuracy, sensitivity, specificity, and auroc of the proposed framework for the diagnosis of mortality in the hospital using the SVM classifier were 0.9961, 0.9477, 0.9992, and 0.9734, respectively. According to the framework's findings, adding frequent items as features considerably improves classification accuracy.  ( 2 min )
    GAE-ISumm: Unsupervised Graph-Based Summarization of Indian Languages. (arXiv:2212.12937v1 [cs.CL])
    Document summarization aims to create a precise and coherent summary of a text document. Many deep learning summarization models are developed mainly for English, often requiring a large training corpus and efficient pre-trained language models and tools. However, English summarization models for low-resource Indian languages are often limited by rich morphological variation, syntax, and semantic differences. In this paper, we propose GAE-ISumm, an unsupervised Indic summarization model that extracts summaries from text documents. In particular, our proposed model, GAE-ISumm uses Graph Autoencoder (GAE) to learn text representations and a document summary jointly. We also provide a manually-annotated Telugu summarization dataset TELSUM, to experiment with our model GAE-ISumm. Further, we experiment with the most publicly available Indian language summarization datasets to investigate the effectiveness of GAE-ISumm on other Indian languages. Our experiments of GAE-ISumm in seven languages make the following observations: (i) it is competitive or better than state-of-the-art results on all datasets, (ii) it reports benchmark results on TELSUM, and (iii) the inclusion of positional and cluster information in the proposed model improved the performance of summaries.  ( 2 min )
    Energy Efficiency Maximization in IRS-Aided Cell-Free Massive MIMO System. (arXiv:2212.12744v1 [eess.SP])
    In this paper, we consider an intelligent reflecting surface (IRS)-aided cell-free massive multiple-input multiple-output system, where the beamforming at access points and the phase shifts at IRSs are jointly optimized to maximize energy efficiency (EE). To solve EE maximization problem, we propose an iterative optimization algorithm by using quadratic transform and Lagrangian dual transform to find the optimum beamforming and phase shifts. However, the proposed algorithm suffers from high computational complexity, which hinders its application in some practical scenarios. Responding to this, we further propose a deep learning based approach for joint beamforming and phase shifts design. Specifically, a two-stage deep neural network is trained offline using the unsupervised learning manner, which is then deployed online for the predictions of beamforming and phase shifts. Simulation results show that compared with the iterative optimization algorithm and the genetic algorithm, the unsupervised learning based approach has higher EE performance and lower running time.  ( 2 min )
    QuickNets: Saving Training and Preventing Overconfidence in Early-Exit Neural Architectures. (arXiv:2212.12866v1 [cs.LG])
    Deep neural networks have long training and processing times. Early exits added to neural networks allow the network to make early predictions using intermediate activations in the network in time-sensitive applications. However, early exits increase the training time of the neural networks. We introduce QuickNets: a novel cascaded training algorithm for faster training of neural networks. QuickNets are trained in a layer-wise manner such that each successive layer is only trained on samples that could not be correctly classified by the previous layers. We demonstrate that QuickNets can dynamically distribute learning and have a reduced training cost and inference cost compared to standard Backpropagation. Additionally, we introduce commitment layers that significantly improve the early exits by identifying for over-confident predictions and demonstrate its success.  ( 2 min )
    Understanding the Complexity Gains of Single-Task RL with a Curriculum. (arXiv:2212.12809v1 [cs.LG])
    Reinforcement learning (RL) problems can be challenging without well-shaped rewards. Prior work on provably efficient RL methods generally proposes to address this issue with dedicated exploration strategies. However, another way to tackle this challenge is to reformulate it as a multi-task RL problem, where the task space contains not only the challenging task of interest but also easier tasks that implicitly function as a curriculum. Such a reformulation opens up the possibility of running existing multi-task RL methods as a more efficient alternative to solving a single challenging task from scratch. In this work, we provide a theoretical framework that reformulates a single-task RL problem as a multi-task RL problem defined by a curriculum. Under mild regularity conditions on the curriculum, we show that sequentially solving each task in the multi-task RL problem is more computationally efficient than solving the original single-task problem, without any explicit exploration bonuses or other exploration strategies. We also show that our theoretical insights can be translated into an effective practical learning algorithm that can accelerate curriculum learning on simulated robotic tasks.  ( 2 min )
    Convolutional Neural Networks on Graphs with Chebyshev Approximation, Revisited. (arXiv:2202.03580v4 [cs.LG] UPDATED)
    Designing spectral convolutional networks is a challenging problem in graph learning. ChebNet, one of the early attempts, approximates the spectral graph convolutions using Chebyshev polynomials. GCN simplifies ChebNet by utilizing only the first two Chebyshev polynomials while still outperforming it on real-world datasets. GPR-GNN and BernNet demonstrate that the Monomial and Bernstein bases also outperform the Chebyshev basis in terms of learning the spectral graph convolutions. Such conclusions are counter-intuitive in the field of approximation theory, where it is established that the Chebyshev polynomial achieves the optimum convergent rate for approximating a function. In this paper, we revisit the problem of approximating the spectral graph convolutions with Chebyshev polynomials. We show that ChebNet's inferior performance is primarily due to illegal coefficients learnt by ChebNet approximating analytic filter functions, which leads to over-fitting. We then propose ChebNetII, a new GNN model based on Chebyshev interpolation, which enhances the original Chebyshev polynomial approximation while reducing the Runge phenomenon. We conducted an extensive experimental study to demonstrate that ChebNetII can learn arbitrary graph convolutions and achieve superior performance in both full- and semi-supervised node classification tasks. Most notably, we scale ChebNetII to a billion graph ogbn-papers100M, showing that spectral-based GNNs have superior performance. Our code is available at https://github.com/ivam-he/ChebNetII.
    How unfair is private learning ?. (arXiv:2206.03985v2 [cs.LG] UPDATED)
    As machine learning algorithms are deployed on sensitive data in critical decision making processes, it is becoming increasingly important that they are also private and fair. In this paper, we show that, when the data has a long-tailed structure, it is not possible to build accurate learning algorithms that are both private and results in higher accuracy on minority subpopulations. We further show that relaxing overall accuracy can lead to good fairness even with strict privacy requirements. To corroborate our theoretical results in practice, we provide an extensive set of experimental results using a variety of synthetic, vision (CIFAR10 and CelebA), and tabular (Law School) datasets and learning algorithms.
    Causal Explanations of Structural Causal Models. (arXiv:2110.02395v3 [cs.LG] UPDATED)
    In explanatory interactive learning (XIL) the user queries the learner, then the learner explains its answer to the user and finally the loop repeats. XIL is attractive for two reasons, (1) the learner becomes better and (2) the user's trust increases. For both reasons to hold, the learner's explanations must be useful to the user and the user must be allowed to ask useful questions. Ideally, both questions and explanations should be grounded in a causal model since they avoid spurious fallacies. Ultimately, we seem to seek a causal variant of XIL. The question part on the user's end we believe to be solved since the user's mental model can provide the causal model. But how would the learner provide causal explanations? In this work we show that existing explanation methods are not guaranteed to be causal even when provided with a Structural Causal Model (SCM). Specifically, we use the popular, proclaimed causal explanation method CXPlain to illustrate how the generated explanations leave open the question of truly causal explanations. Thus as a step towards causal XIL, we propose a solution to the lack of causal explanations. We solve this problem by deriving from first principles an explanation method that makes full use of a given SCM, which we refer to as SC$\textbf{E}$ ($\textbf{E}$ standing for explanation). Since SCEs make use of structural information, any causal graph learner can now provide human-readable explanations. We conduct several experiments including a user study with 22 participants to investigate the virtue of SCE as causal explanations of SCMs.
    Efficient Long-Text Understanding with Short-Text Models. (arXiv:2208.00748v2 [cs.CL] UPDATED)
    Transformer-based pretrained language models (LMs) are ubiquitous across natural language understanding, but cannot be applied to long sequences such as stories, scientific articles and long documents, due to their quadratic complexity. While a myriad of efficient transformer variants have been proposed, they are typically based on custom implementations that require expensive pretraining from scratch.In this work, we propose SLED: SLiding-Encoder and Decoder, a simple approach for processing long sequences that re-uses and leverages battle-tested short-text pretrained LMs. Specifically, we partition the input into overlapping chunks, encode each with a short-text LM encoder and use the pretrained decoder to fuse information across chunks (fusion-in-decoder). We illustrate through controlled experiments that SLED offers a viable strategy for long text understanding and evaluate our approach on SCROLLS, a benchmark with seven datasets across a wide range of language understanding tasks. We find that SLED is competitive with specialized models that are up to 50x larger and require a dedicated and expensive pretraining step.
    Wastewater Pipe Rating Model Using Natural Language Processing. (arXiv:2202.13871v2 [cs.IR] UPDATED)
    Closed-circuit video (CCTV) inspection has been the most popular technique for visually evaluating the interior status of pipelines in recent decades. Certified inspectors prepare the pipe repair document based on the CCTV inspection. The traditional manual method of assessing sewage structural conditions from pipe repair documents takes a long time and is prone to human mistakes. The automatic identification of necessary texts has received little attention. By building an automated framework employing Natural Language Processing (NLP), this study presents an effective technique to automate the identification of the pipe defect rating of the pipe repair documents. NLP technologies are employed to break down textual material into grammatical units in this research. Further analysis entails using words to discover pipe defect symptoms and their frequency and then combining that information into a single score. Our model achieves 95.0% accuracy,94.9% sensitivity, 94.4% specificity, 95.9% precision score, and 95.7% F1 score, showing the potential of the proposed model to be used in large-scale pipe repair documents for accurate and efficient pipeline failure detection to improve the quality of the pipeline. Keywords: Sewer pipe inspection, Defect detection, Natural language processing, Text recognition
    Independent and Decentralized Learning in Markov Potential Games. (arXiv:2205.14590v3 [cs.LG] UPDATED)
    We propose a multi-agent reinforcement learning dynamics, and analyze its convergence properties in infinite-horizon discounted Markov potential games. We focus on the independent and decentralized setting, where players can only observe the realized state and their own reward in every stage. Players do not have knowledge of the game model, and cannot coordinate with each other. In each stage of our learning dynamics, players update their estimate of a perturbed Q-function that evaluates their total contingent payoff based on the realized one-stage reward in an asynchronous manner. Then, players independently update their policies by incorporating a smoothed optimal one-stage deviation strategy based on the estimated Q-function. A key feature of the learning dynamics is that the Q-function estimates are updated at a faster timescale than the policies. We prove that the policies induced by our learning dynamics converge to a stationary Nash equilibrium in Markov potential games with probability 1. Our results demonstrate that agents can reach a stationary Nash equilibrium in Markov potential games through simple learning dynamics under the minimum information environment.
    Demand Forecasting for Platelet Usage: from Univariate Time Series to Multivariate Models. (arXiv:2101.02305v2 [cs.LG] UPDATED)
    Platelet products are both expensive and have very short shelf lives. As usage rates for platelets are highly variable, the effective management of platelet demand and supply is very important yet challenging. The primary goal of this paper is to present an efficient forecasting model for platelet demand at Canadian Blood Services (CBS). To accomplish this goal, four different demand forecasting methods, ARIMA (Auto Regressive Moving Average), Prophet, lasso regression (least absolute shrinkage and selection operator) and LSTM (Long Short-Term Memory) networks are utilized and evaluated. We use a large clinical dataset for a centralized blood distribution centre for four hospitals in Hamilton, Ontario, spanning from 2010 to 2018 and consisting of daily platelet transfusions along with information such as the product specifications, the recipients' characteristics, and the recipients' laboratory test results. This study is the first to utilize different methods from statistical time series models to data-driven regression and a machine learning technique for platelet transfusion using clinical predictors and with different amounts of data. We find that the multivariate approaches have the highest accuracy in general, however, if sufficient data are available, a simpler time series approach such as ARIMA appears to be sufficient. We also comment on the approach to choose clinical indicators (inputs) for the multivariate models.
    Domain-invariant Feature Exploration for Domain Generalization. (arXiv:2207.12020v2 [cs.LG] UPDATED)
    Deep learning has achieved great success in the past few years. However, the performance of deep learning is likely to impede in face of non-IID situations. Domain generalization (DG) enables a model to generalize to an unseen test distribution, i.e., to learn domain-invariant representations. In this paper, we argue that domain-invariant features should be originating from both internal and mutual sides. Internal invariance means that the features can be learned with a single domain and the features capture intrinsic semantics of data, i.e., the property within a domain, which is agnostic to other domains. Mutual invariance means that the features can be learned with multiple domains (cross-domain) and the features contain common information, i.e., the transferable features w.r.t. other domains. We then propose DIFEX for Domain-Invariant Feature EXploration. DIFEX employs a knowledge distillation framework to capture the high-level Fourier phase as the internally-invariant features and learn cross-domain correlation alignment as the mutually-invariant features. We further design an exploration loss to increase the feature diversity for better generalization. Extensive experiments on both time-series and visual benchmarks demonstrate that the proposed DIFEX achieves state-of-the-art performance.
    Accelerated Training of Physics-Informed Neural Networks (PINNs) using Meshless Discretizations. (arXiv:2205.09332v5 [cs.LG] UPDATED)
    We present a new technique for the accelerated training of physics-informed neural networks (PINNs): discretely-trained PINNs (DT-PINNs). The repeated computation of partial derivative terms in the PINN loss functions via automatic differentiation during training is known to be computationally expensive, especially for higher-order derivatives. DT-PINNs are trained by replacing these exact spatial derivatives with high-order accurate numerical discretizations computed using meshless radial basis function-finite differences (RBF-FD) and applied via sparse-matrix vector multiplication. The use of RBF-FD allows for DT-PINNs to be trained even on point cloud samples placed on irregular domain geometries. Additionally, though traditional PINNs (vanilla-PINNs) are typically stored and trained in 32-bit floating-point (fp32) on the GPU, we show that for DT-PINNs, using fp64 on the GPU leads to significantly faster training times than fp32 vanilla-PINNs with comparable accuracy. We demonstrate the efficiency and accuracy of DT-PINNs via a series of experiments. First, we explore the effect of network depth on both numerical and automatic differentiation of a neural network with random weights and show that RBF-FD approximations of third-order accuracy and above are more efficient while being sufficiently accurate. We then compare the DT-PINNs to vanilla-PINNs on both linear and nonlinear Poisson equations and show that DT-PINNs achieve similar losses with 2-4x faster training times on a consumer GPU. Finally, we also demonstrate that similar results can be obtained for the PINN solution to the heat equation (a space-time problem) by discretizing the spatial derivatives using RBF-FD and using automatic differentiation for the temporal derivative. Our results show that fp64 DT-PINNs offer a superior cost-accuracy profile to fp32 vanilla-PINNs.
    An Efficient and Reliable Asynchronous Federated Learning Scheme for Smart Public Transportation. (arXiv:2208.07194v4 [cs.LG] UPDATED)
    Since the traffic conditions change over time, machine learning models that predict traffic flows must be updated continuously and efficiently in smart public transportation. Federated learning (FL) is a distributed machine learning scheme that allows buses to receive model updates without waiting for model training on the cloud. However, FL is vulnerable to poisoning or DDoS attacks since buses travel in public. Some work introduces blockchain to improve reliability, but the additional latency from the consensus process reduces the efficiency of FL. Asynchronous Federated Learning (AFL) is a scheme that reduces the latency of aggregation to improve efficiency, but the learning performance is unstable due to unreasonably weighted local models. To address the above challenges, this paper offers a blockchain-based asynchronous federated learning scheme with a dynamic scaling factor (DBAFL). Specifically, the novel committee-based consensus algorithm for blockchain improves reliability at the lowest possible cost of time. Meanwhile, the devised dynamic scaling factor allows AFL to assign reasonable weights to stale local models. Extensive experiments conducted on heterogeneous devices validate outperformed learning performance, efficiency, and reliability of DBAFL.
    Learning from Heterogeneous Data Based on Social Interactions over Graphs. (arXiv:2112.09483v2 [cs.LG] UPDATED)
    This work proposes a decentralized architecture, where individual agents aim at solving a classification problem while observing streaming features of different dimensions and arising from possibly different distributions. In the context of social learning, several useful strategies have been developed, which solve decision making problems through local cooperation across distributed agents and allow them to learn from streaming data. However, traditional social learning strategies rely on the fundamental assumption that each agent has significant prior knowledge of the underlying distribution of the observations. In this work we overcome this issue by introducing a machine learning framework that exploits social interactions over a graph, leading to a fully data-driven solution to the distributed classification problem. In the proposed social machine learning (SML) strategy, two phases are present: in the training phase, classifiers are independently trained to generate a belief over a set of hypotheses using a finite number of training samples; in the prediction phase, classifiers evaluate streaming unlabeled observations and share their instantaneous beliefs with neighboring classifiers. We show that the SML strategy enables the agents to learn consistently under this highly-heterogeneous setting and allows the network to continue learning even during the prediction phase when it is deciding on unlabeled samples. The prediction decisions are used to continually improve performance thereafter in a manner that is markedly different from most existing static classification schemes where, following training, the decisions on unlabeled data are not re-used to improve future performance.
    Indeterminacy and Strong Identifiability in Generative Models. (arXiv:2206.00801v3 [stat.ML] UPDATED)
    Most modern probabilistic generative models, such as the variational autoencoder (VAE), have certain indeterminacies that are unresolvable even with an infinite amount of data. Different tasks tolerate different indeterminacies, however recent applications have indicated the need for strongly identifiable models, in which an observation corresponds to a unique latent code. Progress has been made towards reducing model indeterminacies while maintaining flexibility, and recent work excludes many--but not all--indeterminacies. In this work, we motivate model-identifiability in terms of task-identifiability, then construct a theoretical framework for analyzing the indeterminacies of latent variable models, which enables their precise characterization in terms of the generator function and prior distribution spaces. We reveal that strong identifiability is possible even with highly flexible nonlinear generators, and give two such examples. One is a straightforward modification of iVAE (arXiv:1907.04809 [stat.ML]); the other uses triangular monotonic maps, leading to novel connections between optimal transport and identifiability.
    ModelPred: A Framework for Predicting Trained Model from Training Data. (arXiv:2111.12545v4 [cs.LG] UPDATED)
    In this work, we propose ModelPred, a framework that helps to understand the impact of changes in training data on a trained model. This is critical for building trust in various stages of a machine learning pipeline: from cleaning poor-quality samples and tracking important ones to be collected during data preparation, to calibrating uncertainty of model prediction, to interpreting why certain behaviors of a model emerge during deployment. Specifically, ModelPred learns a parameterized function that takes a dataset $S$ as the input and predicts the model obtained by training on $S$. Our work differs from the recent work of Datamodels [1] as we aim for predicting the trained model parameters directly instead of the trained model behaviors. We demonstrate that a neural network-based set function class is capable of learning the complex relationships between the training data and model parameters. We introduce novel global and local regularization techniques to prevent overfitting and we rigorously characterize the expressive power of neural networks (NN) in approximating the end-to-end training process. Through extensive empirical investigations, we show that ModelPred enables a variety of applications that boost the interpretability and accountability of machine learning (ML), such as data valuation, data selection, memorization quantification, and model calibration.
    Learning k-Level Sparse Neural Networks Using a New Generalized Group Sparse Envelope Regularization. (arXiv:2212.12921v1 [cs.LG])
    We propose an efficient method to learn both unstructured and structured sparse neural networks during training, using a novel generalization of the sparse envelope function (SEF) used as a regularizer, termed {\itshape{group sparse envelope function}} (GSEF). The GSEF acts as a neuron group selector, which we leverage to induce structured pruning. Our method receives a hardware-friendly structured sparsity of a deep neural network (DNN) to efficiently accelerate the DNN's evaluation. This method is flexible in the sense that it allows any hardware to dictate the definition of a group, such as a filter, channel, filter shape, layer depth, a single parameter (unstructured), etc. By the nature of the GSEF, the proposed method is the first to make possible a pre-define sparsity level that is being achieved at the training convergence, while maintaining negligible network accuracy degradation. We propose an efficient method to calculate the exact value of the GSEF along with its proximal operator, in a worst-case complexity of $O(n)$, where $n$ is the total number of groups variables. In addition, we propose a proximal-gradient-based optimization method to train the model, that is, the non-convex minimization of the sum of the neural network loss and the GSEF. Finally, we conduct an experiment and illustrate the efficiency of our proposed technique in terms of the completion ratio, accuracy, and inference latency.
    Gaussian Pre-Activations in Neural Networks: Myth or Reality?. (arXiv:2205.12379v2 [cs.LG] UPDATED)
    The study of feature propagation at initialization in neural networks lies at the root of numerous initialization designs. An assumption very commonly made in the field states that the pre-activations are Gaussian. Although this convenient Gaussian hypothesis can be justified when the number of neurons per layer tends to infinity, it is challenged by both theoretical and experimental works for finite-width neural networks. Our major contribution is to construct a family of pairs of activation functions and initialization distributions that ensure that the pre-activations remain Gaussian throughout the network's depth, even in narrow neural networks. In the process, we discover a set of constraints that a neural network should fulfill to ensure Gaussian pre-activations. Additionally, we provide a critical review of the claims of the Edge of Chaos line of works and build an exact Edge of Chaos analysis. We also propose a unified view on pre-activations propagation, encompassing the framework of several well-known initialization procedures. Finally, our work provides a principled framework for answering the much-debated question: is it desirable to initialize the training of a neural network whose pre-activations are ensured to be Gaussian?
    FedEval: A Holistic Evaluation Framework for Federated Learning. (arXiv:2011.09655v3 [cs.LG] UPDATED)
    Federated Learning (FL) has been widely accepted as the solution for privacy-preserving machine learning without collecting raw data. While new technologies proposed in the past few years do evolve the FL area, unfortunately, the evaluation results presented in these works fall short in integrity and are hardly comparable because of the inconsistent evaluation metrics and experimental settings. In this paper, we propose a holistic evaluation framework for FL called FedEval, and present a benchmarking study on seven state-of-the-art FL algorithms. Specifically, we first introduce the core evaluation taxonomy model, called FedEval-Core, which covers four essential evaluation aspects for FL: Privacy, Robustness, Effectiveness, and Efficiency, with various well-defined metrics and experimental settings. Based on the FedEval-Core, we further develop an FL evaluation platform with standardized evaluation settings and easy-to-use interfaces. We then provide an in-depth benchmarking study between the seven well-known FL algorithms, including FedSGD, FedAvg, FedProx, FedOpt, FedSTC, SecAgg, and HEAgg. We comprehensively analyze the advantages and disadvantages of these algorithms and further identify the suitable practical scenarios for different algorithms, which is rarely done by prior work. Lastly, we excavate a set of take-away insights and future research directions, which are very helpful for researchers in the FL area.
    Compositional optimization of quantum circuits for quantum kernels of support vector machines. (arXiv:2203.13848v2 [quant-ph] UPDATED)
    While quantum machine learning (ML) has been proposed to be one of the most promising applications of quantum computing, how to build quantum ML models that outperform classical ML remains a major open question. Here, we demonstrate a Bayesian algorithm for constructing quantum kernels for support vector machines that adapts quantum gate sequences to data. The algorithm increases the complexity of quantum circuits incrementally by appending quantum gates selected with Bayesian information criterion as circuit selection metric and Bayesian optimization of the parameters of the locally optimal quantum circuits identified. The performance of the resulting quantum models for classification problems with a small number of training points significantly exceeds that of optimized classical models with conventional kernels.
    Learning-Based Client Selection for Federated Learning Services Over Wireless Networks with Constrained Monetary Budgets. (arXiv:2208.04322v2 [cs.LG] UPDATED)
    We investigate a data quality-aware dynamic client selection problem for multiple federated learning (FL) services in a wireless network, where each client offers dynamic datasets for the simultaneous training of multiple FL services, and each FL service demander has to pay for the clients under constrained monetary budgets. The problem is formalized as a non-cooperative Markov game over the training rounds. A multi-agent hybrid deep reinforcement learning-based algorithm is proposed to optimize the joint client selection and payment actions, while avoiding action conflicts. Simulation results indicate that our proposed algorithm can significantly improve training performance.
    MC-Nonlocal-PINNs: handling nonlocal operators in PINNs via Monte Carlo sampling. (arXiv:2212.12984v1 [math.NA])
    We propose, Monte Carlo Nonlocal physics-informed neural networks (MC-Nonlocal-PINNs), which is a generalization of MC-fPINNs in \cite{guo2022monte}, for solving general nonlocal models such as integral equations and nonlocal PDEs. Similar as in MC-fPINNs, our MC-Nonlocal-PINNs handle the nonlocal operators in a Monte Carlo way, resulting in a very stable approach for high dimensional problems. We present a variety of test problems, including high dimensional Volterra type integral equations, hypersingular integral equations and nonlocal PDEs, to demonstrate the effectiveness of our approach.
    Robust computation of optimal transport by $\beta$-potential regularization. (arXiv:2212.13251v1 [cs.LG])
    Optimal transport (OT) has become a widely used tool in the machine learning field to measure the discrepancy between probability distributions. For instance, OT is a popular loss function that quantifies the discrepancy between an empirical distribution and a parametric model. Recently, an entropic penalty term and the celebrated Sinkhorn algorithm have been commonly used to approximate the original OT in a computationally efficient way. However, since the Sinkhorn algorithm runs a projection associated with the Kullback-Leibler divergence, it is often vulnerable to outliers. To overcome this problem, we propose regularizing OT with the \beta-potential term associated with the so-called $\beta$-divergence, which was developed in robust statistics. Our theoretical analysis reveals that the $\beta$-potential can prevent the mass from being transported to outliers. We experimentally demonstrate that the transport matrix computed with our algorithm helps estimate a probability distribution robustly even in the presence of outliers. In addition, our proposed method can successfully detect outliers from a contaminated dataset
    Online Active Learning for Soft Sensor Development using Semi-Supervised Autoencoders. (arXiv:2212.13067v1 [cs.LG])
    Data-driven soft sensors are extensively used in industrial and chemical processes to predict hard-to-measure process variables whose real value is difficult to track during routine operations. The regression models used by these sensors often require a large number of labeled examples, yet obtaining the label information can be very expensive given the high time and cost required by quality inspections. In this context, active learning methods can be highly beneficial as they can suggest the most informative labels to query. However, most of the active learning strategies proposed for regression focus on the offline setting. In this work, we adapt some of these approaches to the stream-based scenario and show how they can be used to select the most informative data points. We also demonstrate how to use a semi-supervised architecture based on orthogonal autoencoders to learn salient features in a lower dimensional space. The Tennessee Eastman Process is used to compare the predictive performance of the proposed approaches.
    Convergence of Batch Asynchronous Stochastic Approximation With Applications to Reinforcement Learning. (arXiv:2109.03445v3 [stat.ML] UPDATED)
    The stochastic approximation (SA) algorithm is a widely used probabilistic method for finding a zero or a fixed point of a vector-valued funtion, when only noisy measurements of the function are available. In the literature to date, one makes a distinction between ``synchronous'' updating, whereby every component of the current guess is updated at each time, and ``asynchronous'' updating, whereby only one component is updated. In this paper, we study an intermediate situation that we call ``batch asynchronous stochastic approximation'' (BASA), in which, at each time instant, \textit{some but not all} components of the current estimated solution are updated. BASA allows the user to trade off memory requirements against time complexity. We develop a general methodology for proving that such algorithms converge to the fixed point of the map under study. These convergence proofs make use of weaker hypotheses than existing results. Specifically, existing convergence proofs require that the measurement noise is a zero-mean i.i.d\ sequence or a martingale difference sequence. In the present paper, we permit biased measurements, that is, measurement noises that have nonzero conditional mean. Also, all convergence results to date assume that the stochastic step sizes satisfy a probabilistic analog of the well-known Robbins-Monro conditions. We replace this assumption by a purely deterministic condition on the irreducibility of the underlying Markov processes. As specific applications to Reinforcement Learning, we analyze the temporal difference algorithm $TD(\lambda)$ for value iteration, and the $Q$-learning algorithm for finding the optimal action-value function. In both cases, we establish the convergence of these algorithms, under milder conditions than in the existing literature.
    On Error and Compression Rates for Prototype Rules. (arXiv:2206.08014v2 [cs.LG] UPDATED)
    We study the close interplay between error and compression in the non-parametric multiclass classification setting in terms of prototype learning rules. We focus in particular on a recently proposed compression-based learning rule termed OptiNet (Kontorovich, Sabato, and Urner 2016; Kontorovich, Sabato, and Weiss 2017; Hanneke et al. 2021). Beyond its computational merits, this rule has been recently shown to be universally consistent in any metric instance space that admits a universally consistent rule--the first learning algorithm known to enjoy this property. However, its error and compression rates have been left open. Here we derive such rates in the case where instances reside in Euclidean space under commonly posed smoothness and tail conditions on the data distribution. We first show that OptiNet achieves non-trivial compression rates while enjoying near minimax-optimal error rates. We then proceed to study a novel general compression scheme for further compressing prototype rules that locally adapts to the noise level without sacrificing accuracy. Applying it to OptiNet, we show that under a geometric margin condition, further gain in the compression rate is achieved. Experimental results comparing the performance of the various methods are presented.
    SYMBA: Symbolic Computation of Squared Amplitudes in High Energy Physics with Machine Learning. (arXiv:2206.08901v2 [hep-ph] UPDATED)
    The cross section is one of the most important physical quantities in high-energy physics and the most time consuming to compute. While machine learning has proven to be highly successful in numerical calculations in high-energy physics, analytical calculations using machine learning are still in their infancy. In this work, we use a sequence-to-sequence model, specifically, a transformer, to compute a key element of the cross section calculation, namely, the squared amplitude of an interaction. We show that a transformer model is able to predict correctly 97.6% and 99% of squared amplitudes of QCD and QED processes, respectively, at a speed that is up to orders of magnitude faster than current symbolic computation frameworks. We discuss the performance of the current model, its limitations and possible future directions for this work.
    Bias Mitigation Framework for Intersectional Subgroups in Neural Networks. (arXiv:2212.13014v1 [cs.LG])
    We propose a fairness-aware learning framework that mitigates intersectional subgroup bias associated with protected attributes. Prior research has primarily focused on mitigating one kind of bias by incorporating complex fairness-driven constraints into optimization objectives or designing additional layers that focus on specific protected attributes. We introduce a simple and generic bias mitigation approach that prevents models from learning relationships between protected attributes and output variable by reducing mutual information between them. We demonstrate that our approach is effective in reducing bias with little or no drop in accuracy. We also show that the models trained with our learning framework become causally fair and insensitive to the values of protected attributes. Finally, we validate our approach by studying feature interactions between protected and non-protected attributes. We demonstrate that these interactions are significantly reduced when applying our bias mitigation.
    Toward Efficient Automated Feature Engineering. (arXiv:2212.13152v1 [cs.LG])
    Automated Feature Engineering (AFE) refers to automatically generate and select optimal feature sets for downstream tasks, which has achieved great success in real-world applications. Current AFE methods mainly focus on improving the effectiveness of the produced features, but ignoring the low-efficiency issue for large-scale deployment. Therefore, in this work, we propose a generic framework to improve the efficiency of AFE. Specifically, we construct the AFE pipeline based on reinforcement learning setting, where each feature is assigned an agent to perform feature transformation \com{and} selection, and the evaluation score of the produced features in downstream tasks serve as the reward to update the policy. We improve the efficiency of AFE in two perspectives. On the one hand, we develop a Feature Pre-Evaluation (FPE) Model to reduce the sample size and feature size that are two main factors on undermining the efficiency of feature evaluation. On the other hand, we devise a two-stage policy training strategy by running FPE on the pre-evaluation task as the initialization of the policy to avoid training policy from scratch. We conduct comprehensive experiments on 36 datasets in terms of both classification and regression tasks. The results show $2.9\%$ higher performance in average and 2x higher computational efficiency comparing to state-of-the-art AFE methods.
    Can Foundation Models Wrangle Your Data?. (arXiv:2205.09911v2 [cs.LG] UPDATED)
    Foundation Models (FMs) are models trained on large corpora of data that, at very large scale, can generalize to new tasks without any task-specific finetuning. As these models continue to grow in size, innovations continue to push the boundaries of what these models can do on language and image tasks. This paper aims to understand an underexplored area of FMs: classical data tasks like cleaning and integration. As a proof-of-concept, we cast five data cleaning and integration tasks as prompting tasks and evaluate the performance of FMs on these tasks. We find that large FMs generalize and achieve SoTA performance on data cleaning and integration tasks, even though they are not trained for these data tasks. We identify specific research challenges and opportunities that these models present, including challenges with private and domain specific data, and opportunities to make data management systems more accessible to non-experts. We make our code and experiments publicly available at: https://github.com/HazyResearch/fm_data_tasks.
    Skit-S2I: An Indian Accented Speech to Intent dataset. (arXiv:2212.13015v1 [cs.CL])
    Conventional conversation assistants extract text transcripts from the speech signal using automatic speech recognition (ASR) and then predict intent from the transcriptions. Using end-to-end spoken language understanding (SLU), the intents of the speaker are predicted directly from the speech signal without requiring intermediate text transcripts. As a result, the model can optimize directly for intent classification and avoid cascading errors from ASR. The end-to-end SLU system also helps in reducing the latency of the intent prediction model. Although many datasets are available publicly for text-to-intent tasks, the availability of labeled speech-to-intent datasets is limited, and there are no datasets available in the Indian accent. In this paper, we release the Skit-S2I dataset, the first publicly available Indian-accented SLU dataset in the banking domain in a conversational tonality. We experiment with multiple baselines, compare different pretrained speech encoder's representations, and find that SSL pretrained representations perform slightly better than ASR pretrained representations lacking prosodic features for speech-to-intent classification. The dataset and baseline code is available at \url{https://github.com/skit-ai/speech-to-intent-dataset}
    A Universal Law of Robustness via Isoperimetry. (arXiv:2105.12806v4 [cs.LG] UPDATED)
    Classically, data interpolation with a parametrized model class is possible as long as the number of parameters is larger than the number of equations to be satisfied. A puzzling phenomenon in deep learning is that models are trained with many more parameters than what this classical theory would suggest. We propose a partial theoretical explanation for this phenomenon. We prove that for a broad class of data distributions and model classes, overparametrization is necessary if one wants to interpolate the data smoothly. Namely we show that smooth interpolation requires $d$ times more parameters than mere interpolation, where $d$ is the ambient data dimension. We prove this universal law of robustness for any smoothly parametrized function class with polynomial size weights, and any covariate distribution verifying isoperimetry. In the case of two-layers neural networks and Gaussian covariates, this law was conjectured in prior work by Bubeck, Li and Nagaraj. We also give an interpretation of our result as an improved generalization bound for model classes consisting of smooth functions.
    Policy Learning with Competing Agents. (arXiv:2204.01884v2 [stat.ML] UPDATED)
    Decision makers often aim to learn a treatment assignment policy under a capacity constraint on the number of agents that they can treat. When agents can respond strategically to such policies, competition arises, complicating the estimation of the effect of the policy. In this paper, we study capacity-constrained treatment assignment in the presence of such interference. We consider a dynamic model where the decision maker allocates treatments at each time step and heterogeneous agents myopically best respond to the previous treatment assignment policy. When the number of agents is large but finite, we show that the threshold for receiving treatment under a given policy converges to the policy's mean-field equilibrium threshold. Based on this result, we develop a consistent estimator for the policy effect. In simulations and a semi-synthetic experiment with data from the National Education Longitudinal Study of 1988, we demonstrate that this estimator can be used for learning capacity-constrained policies in the presence of strategic behavior.
    Deep Reinforcement Learning for Heat Pump Control. (arXiv:2212.12716v1 [cs.LG])
    Heating in private households is a major contributor to the emissions generated today. Heat pumps are a promising alternative for heat generation and are a key technology in achieving our goals of the German energy transformation and to become less dependent on fossil fuels. Today, the majority of heat pumps in the field are controlled by a simple heating curve, which is a naive mapping of the current outdoor temperature to a control action. A more advanced control approach is model predictive control (MPC) which was applied in multiple research works to heat pump control. However, MPC is heavily dependent on the building model, which has several disadvantages. Motivated by this and by recent breakthroughs in the field, this work applies deep reinforcement learning (DRL) to heat pump control in a simulated environment. Through a comparison to MPC, it could be shown that it is possible to apply DRL in a model-free manner to achieve MPC-like performance. This work extends other works which have already applied DRL to building heating operation by performing an in-depth analysis of the learned control strategies and by giving a detailed comparison of the two state-of-the-art control methods.
    POLAR: A Polynomial Arithmetic Framework for Verifying Neural-Network Controlled Systems. (arXiv:2106.13867v5 [eess.SY] UPDATED)
    We present POLAR, a polynomial arithmetic-based framework for efficient bounded-time reachability analysis of neural-network controlled systems (NNCSs). Existing approaches that leverage the standard Taylor Model (TM) arithmetic for approximating the neural-network controller cannot deal with non-differentiable activation functions and suffer from rapid explosion of the remainder when propagating the TMs. POLAR overcomes these shortcomings by integrating TM arithmetic with \textbf{Bernstein B{\'e}zier Form} and \textbf{symbolic remainder}. The former enables TM propagation across non-differentiable activation functions and local refinement of TMs, and the latter reduces error accumulation in the TM remainder for linear mappings in the network. Experimental results show that POLAR significantly outperforms the current state-of-the-art tools in terms of both efficiency and tightness of the reachable set overapproximation. The source code can be found in https://github.com/ChaoHuang2018/POLAR_Tool
    Data Redaction from Pre-trained GANs. (arXiv:2206.14389v2 [cs.LG] UPDATED)
    Large pre-trained generative models are known to occasionally output undesirable samples, which undermines their trustworthiness. The common way to mitigate this is to re-train them differently from scratch using different data or different regularization -- which uses a lot of computational resources and does not always fully address the problem. In this work, we take a different, more compute-friendly approach and investigate how to post-edit a model after training so that it ''redacts'', or refrains from outputting certain kinds of samples. We show that redaction is a fundamentally different task from data deletion, and data deletion may not always lead to redaction. We then consider Generative Adversarial Networks (GANs), and provide three different algorithms for data redaction that differ on how the samples to be redacted are described. Extensive evaluations on real-world image datasets show that our algorithms out-perform data deletion baselines, and are capable of redacting data while retaining high generation quality at a fraction of the cost of full re-training.
    Visualizing Information Bottleneck through Variational Inference. (arXiv:2212.12667v1 [cs.LG])
    The Information Bottleneck theory provides a theoretical and computational framework for finding approximate minimum sufficient statistics. Analysis of the Stochastic Gradient Descent (SGD) training of a neural network on a toy problem has shown the existence of two phases, fitting and compression. In this work, we analyze the SGD training process of a Deep Neural Network on MNIST classification and confirm the existence of two phases of SGD training. We also propose a setup for estimating the mutual information for a Deep Neural Network through Variational Inference.
    Sitting Posture Recognition Using a Spiking Neural Network. (arXiv:2212.12908v1 [eess.SP])
    To increase the quality of citizens' lives, we designed a personalized smart chair system to recognize sitting behaviors. The system can receive surface pressure data from the designed sensor and provide feedback for guiding the user towards proper sitting postures. We used a liquid state machine and a logistic regression classifier to construct a spiking neural network for classifying 15 sitting postures. To allow this system to read our pressure data into the spiking neurons, we designed an algorithm to encode map-like data into cosine-rank sparsity data. The experimental results consisting of 15 sitting postures from 19 participants show that the prediction precision of our SNN is 88.52%.
    Packing Privacy Budget Efficiently. (arXiv:2212.13228v1 [cs.CR])
    Machine learning (ML) models can leak information about users, and differential privacy (DP) provides a rigorous way to bound that leakage under a given budget. This DP budget can be regarded as a new type of compute resource in workloads of multiple ML models training on user data. Once it is used, the DP budget is forever consumed. Therefore, it is crucial to allocate it most efficiently to train as many models as possible. This paper presents the scheduler for privacy that optimizes for efficiency. We formulate privacy scheduling as a new type of multidimensional knapsack problem, called privacy knapsack, which maximizes DP budget efficiency. We show that privacy knapsack is NP-hard, hence practical algorithms are necessarily approximate. We develop an approximation algorithm for privacy knapsack, DPK, and evaluate it on microbenchmarks and on a new, synthetic private-ML workload we developed from the Alibaba ML cluster trace. We show that DPK: (1) often approaches the efficiency-optimal schedule, (2) consistently schedules more tasks compared to a state-of-the-art privacy scheduling algorithm that focused on fairness (1.3-1.7x in Alibaba, 1.0-2.6x in microbenchmarks), but (3) sacrifices some level of fairness for efficiency. Therefore, using DPK, DP ML operators should be able to train more models on the same amount of user data while offering the same privacy guarantee to their users.
    Linear convergence of a policy gradient method for some finite horizon continuous time control problems. (arXiv:2203.11758v3 [math.OC] UPDATED)
    Despite its popularity in the reinforcement learning community, a provably convergent policy gradient method for continuous space-time control problems with nonlinear state dynamics has been elusive. This paper proposes proximal gradient algorithms for feedback controls of finite-time horizon stochastic control problems. The state dynamics are nonlinear diffusions with control-affine drift, and the cost functions are nonconvex in the state and nonsmooth in the control. The system noise can degenerate, which allows for deterministic control problems as special cases. We prove under suitable conditions that the algorithm converges linearly to a stationary point of the control problem, and is stable with respect to policy updates by approximate gradient steps. The convergence result justifies the recent reinforcement learning heuristics that adding entropy regularization or a fictitious discount factor to the optimization objective accelerates the convergence of policy gradient methods. The proof exploits careful regularity estimates of backward stochastic differential equations.
    Quaternion Backpropagation. (arXiv:2212.13082v1 [cs.LG])
    Quaternion valued neural networks experienced rising popularity and interest from researchers in the last years, whereby the derivatives with respect to quaternions needed for optimization are calculated as the sum of the partial derivatives with respect to the real and imaginary parts. However, we can show that product- and chain-rule does not hold with this approach. We solve this by employing the GHRCalculus and derive quaternion backpropagation based on this. Furthermore, we experimentally prove the functionality of the derived quaternion backpropagation.
    Saliency-Augmented Memory Completion for Continual Learning. (arXiv:2212.13242v1 [cs.LG])
    Continual Learning is considered a key step toward next-generation Artificial Intelligence. Among various methods, replay-based approaches that maintain and replay a small episodic memory of previous samples are one of the most successful strategies against catastrophic forgetting. However, since forgetting is inevitable given bounded memory and unbounded tasks, how to forget is a problem continual learning must address. Therefore, beyond simply avoiding catastrophic forgetting, an under-explored issue is how to reasonably forget while ensuring the merits of human memory, including 1. storage efficiency, 2. generalizability, and 3. some interpretability. To achieve these simultaneously, our paper proposes a new saliency-augmented memory completion framework for continual learning, inspired by recent discoveries in memory completion separation in cognitive neuroscience. Specifically, we innovatively propose to store the part of the image most important to the tasks in episodic memory by saliency map extraction and memory encoding. When learning new tasks, previous data from memory are inpainted by an adaptive data generation module, which is inspired by how humans complete episodic memory. The module's parameters are shared across all tasks and it can be jointly trained with a continual learning classifier as bilevel optimization. Extensive experiments on several continual learning and image classification benchmarks demonstrate the proposed method's effectiveness and efficiency.
    A photonic chip-based machine learning approach for the prediction of molecular properties. (arXiv:2203.02285v2 [cs.ET] UPDATED)
    Machine learning methods have revolutionized the discovery process of new molecules and materials. However, the intensive training process of neural networks for molecules with ever-increasing complexity has resulted in exponential growth in computation cost, leading to long simulation time and high energy consumption. Photonic chip technology offers an alternative platform for implementing neural networks with faster data processing and lower energy usage compared to digital computers. Photonics technology is naturally capable of implementing complex-valued neural networks at no additional hardware cost. Here, we demonstrate the capability of photonic neural networks for predicting the quantum mechanical properties of molecules. To the best of our knowledge, this work is the first to harness photonic technology for machine learning applications in computational chemistry and molecular sciences, such as drug discovery and materials design. We further show that multiple properties can be learned simultaneously in a photonic chip via a multi-task regression learning algorithm, which is also the first of its kind as well, as most previous works focus on implementing a network in the classification task.
    Mining the Factor Zoo: Estimation of Latent Factor Models with Sufficient Proxies. (arXiv:2212.12845v1 [stat.ME])
    Latent factor model estimation typically relies on either using domain knowledge to manually pick several observed covariates as factor proxies, or purely conducting multivariate analysis such as principal component analysis. However, the former approach may suffer from the bias while the latter can not incorporate additional information. We propose to bridge these two approaches while allowing the number of factor proxies to diverge, and hence make the latent factor model estimation robust, flexible, and statistically more accurate. As a bonus, the number of factors is also allowed to grow. At the heart of our method is a penalized reduced rank regression to combine information. To further deal with heavy-tailed data, a computationally attractive penalized robust reduced rank regression method is proposed. We establish faster rates of convergence compared with the benchmark. Extensive simulations and real examples are used to illustrate the advantages.
    Learning Generalizable Representations for Reinforcement Learning via Adaptive Meta-learner of Behavioral Similarities. (arXiv:2212.13088v1 [cs.LG])
    How to learn an effective reinforcement learning-based model for control tasks from high-level visual observations is a practical and challenging problem. A key to solving this problem is to learn low-dimensional state representations from observations, from which an effective policy can be learned. In order to boost the learning of state encoding, recent works are focused on capturing behavioral similarities between state representations or applying data augmentation on visual observations. In this paper, we propose a novel meta-learner-based framework for representation learning regarding behavioral similarities for reinforcement learning. Specifically, our framework encodes the high-dimensional observations into two decomposed embeddings regarding reward and dynamics in a Markov Decision Process (MDP). A pair of meta-learners are developed, one of which quantifies the reward similarity and the other quantifies dynamics similarity over the correspondingly decomposed embeddings. The meta-learners are self-learned to update the state embeddings by approximating two disjoint terms in on-policy bisimulation metric. To incorporate the reward and dynamics terms, we further develop a strategy to adaptively balance their impacts based on different tasks or environments. We empirically demonstrate that our proposed framework outperforms state-of-the-art baselines on several benchmarks, including conventional DM Control Suite, Distracting DM Control Suite and a self-driving task CARLA.
    Off-Policy Reinforcement Learning with Loss Function Weighted by Temporal Difference Error. (arXiv:2212.13175v1 [cs.LG])
    Training agents via off-policy deep reinforcement learning (RL) requires a large memory, named replay memory, that stores past experiences used for learning. These experiences are sampled, uniformly or non-uniformly, to create the batches used for training. When calculating the loss function, off-policy algorithms assume that all samples are of the same importance. In this paper, we hypothesize that training can be enhanced by assigning different importance for each experience based on their temporal-difference (TD) error directly in the training objective. We propose a novel method that introduces a weighting factor for each experience when calculating the loss function at the learning stage. In addition to improving convergence speed when used with uniform sampling, the method can be combined with prioritization methods for non-uniform sampling. Combining the proposed method with prioritization methods improves sampling efficiency while increasing the performance of TD-based off-policy RL algorithms. The effectiveness of the proposed method is demonstrated by experiments in six environments of the OpenAI Gym suite. The experimental results demonstrate that the proposed method achieves a 33%~76% reduction of convergence speed in three environments and an 11% increase in returns and a 3%~10% increase in success rate for other three environments.
    Designing Compact Features for Remote Stroke Rehabilitation Monitoring using Wearable Accelerometers. (arXiv:2009.08798v3 [eess.SP] UPDATED)
    Stroke is known as a major global health problem, and for stroke survivors it is key to monitor the recovery levels. However, traditional stroke rehabilitation assessment methods (such as the popular clinical assessment) can be subjective and expensive, and it is also less convenient for patients to visit clinics in a high frequency. To address this issue, in this work based on wearable sensing and machine learning techniques, we develop an automated system that can predict the assessment score in an objective manner. With wrist-worn sensors, accelerometer data is collected from 59 stroke survivors in free-living environments for a duration of 8 weeks, and we map the week-wise accelerometer data(3 days per week) to the assessment score by developing signal processing and predictive model pipeline. To achieve this, we propose two types of new features, which can encode the rehabilitation information from both paralysed and non-paralysed sides while suppressing the high level noises such as irrelevant daily activities. Based on the proposed features, we further develop the longitudinal mixed-effects model with Gaussian process prior (LMGP), which can model the random effects caused by different subjects and time slots (during the 8 weeks). Comprehensive experiments are conducted to evaluate our system on both acute and chronic patients, and the promising results suggest its effectiveness.
    Boosting Urban Traffic Speed Prediction via Integrating Implicit Spatial Correlations. (arXiv:2212.12932v1 [cs.LG])
    Urban traffic speed prediction aims to estimate the future traffic speed for improving the urban transportation services. Enormous efforts have been made on exploiting spatial correlations and temporal dependencies of traffic speed evolving patterns by leveraging explicit spatial relations (geographical proximity) through pre-defined geographical structures ({\it e.g.}, region grids or road networks). While achieving promising results, current traffic speed prediction methods still suffer from ignoring implicit spatial correlations (interactions), which cannot be captured by grid/graph convolutions. To tackle the challenge, we propose a generic model for enabling the current traffic speed prediction methods to preserve implicit spatial correlations. Specifically, we first develop a Dual-Transformer architecture, including a Spatial Transformer and a Temporal Transformer. The Spatial Transformer automatically learns the implicit spatial correlations across the road segments beyond the boundary of geographical structures, while the Temporal Transformer aims to capture the dynamic changing patterns of the implicit spatial correlations. Then, to further integrate both explicit and implicit spatial correlations, we propose a distillation-style learning framework, in which the existing traffic speed prediction methods are considered as the teacher model, and the proposed Dual-Transformer architectures are considered as the student model. The extensive experiments over three real-world datasets indicate significant improvements of our proposed framework over the existing methods.
    Neural Structure Fields with Application to Crystal Structure Autoencoders. (arXiv:2212.13120v1 [cond-mat.mtrl-sci])
    Representing crystal structures of materials to facilitate determining them via neural networks is crucial for enabling machine-learning applications involving crystal structure estimation. Among these applications, the inverse design of materials can contribute to next-generation methods that explore materials with desired properties without relying on luck or serendipity. We propose neural structure fields (NeSF) as an accurate and practical approach for representing crystal structures using neural networks. Inspired by the concepts of vector fields in physics and implicit neural representations in computer vision, the proposed NeSF considers a crystal structure as a continuous field rather than as a discrete set of atoms. Unlike existing grid-based discretized spatial representations, the NeSF overcomes the tradeoff between spatial resolution and computational complexity and can represent any crystal structure. To evaluate the NeSF, we propose an autoencoder of crystal structures that can recover various crystal structures, such as those of perovskite structure materials and cuprate superconductors. Extensive quantitative results demonstrate the superior performance of the NeSF compared with the existing grid-based approach.
    Higher order organizational features can distinguish protein interaction networks of disease classes: a case study of neoplasms and neurological diseases. (arXiv:2212.13171v1 [q-bio.MN])
    Neoplasms (NPs) and neurological diseases and disorders (NDDs) are amongst the major classes of diseases underlying deaths of a disproportionate number of people worldwide. To determine if there exist some distinctive features in the local wiring patterns of protein interactions emerging at the onset of a disease belonging to either of these two classes, we examined 112 and 175 protein interaction networks belonging to NPs and NDDs, respectively. Orbit usage profiles (OUPs) for each of these networks were enumerated by investigating the networks' local topology. 56 non-redundant OUPs (nrOUPs) were derived and used as network features for classification between these two disease classes. Four machine learning classifiers, namely, k-nearest neighbour (KNN), support vector machine (SVM), deep neural network (DNN), random forest (RF) were trained on these data. DNN obtained the greatest average AUPRC (0.988) among these classifiers. DNNs developed on node2vec and the proposed nrOUPs embeddings were compared using 5-fold cross validation on the basis of average values of the six of performance measures, viz., AUPRC, Accuracy, Sensitivity, Specificity, Precision and MCC. It was found that nrOUPs based classifier performed better in all of these six performance measures.
    FMM-Net: neural network architecture based on the Fast Multipole Method. (arXiv:2212.12899v1 [math.NA])
    In this paper, we propose a new neural network architecture based on the H2 matrix. Even though networks with H2-inspired architecture already exist, and our approach is designed to reduce memory costs and improve performance by taking into account the sparsity template of the H2 matrix. In numerical comparison with alternative neural networks, including the known H2-based ones, our architecture showed itself as beneficial in terms of performance, memory, and scalability.
    Assessing thermal imagery integration into object detection methods on ground-based and air-based collection platforms. (arXiv:2212.12616v1 [cs.CV])
    Object detection models commonly deployed on uncrewed aerial systems (UAS) focus on identifying objects in the visible spectrum using Red-Green-Blue (RGB) imagery. However, there is growing interest in fusing RGB with thermal long wave infrared (LWIR) images to increase the performance of object detection machine learning (ML) models. Currently LWIR ML models have received less research attention, especially for both ground- and air-based platforms, leading to a lack of baseline performance metrics evaluating LWIR, RGB and LWIR-RGB fused object detection models. Therefore, this research contributes such quantitative metrics to the literature .The results found that the ground-based blended RGB-LWIR model exhibited superior performance compared to the RGB or LWIR approaches, achieving a mAP of 98.4%. Additionally, the blended RGB-LWIR model was also the only object detection model to work in both day and night conditions, providing superior operational capabilities. This research additionally contributes a novel labelled training dataset of 12,600 images for RGB, LWIR, and RGB-LWIR fused imagery, collected from ground-based and air-based platforms, enabling further multispectral machine-driven object detection research.
    Improved Kernel Alignment Regret Bound for Online Kernel Learning. (arXiv:2212.12989v1 [cs.LG])
    In this paper, we improve the kernel alignment regret bound for online kernel learning in the regime of the Hinge loss function. Previous algorithm achieves a regret of $O((\mathcal{A}_TT\ln{T})^{\frac{1}{4}})$ at a computational complexity (space and per-round time) of $O(\sqrt{\mathcal{A}_TT\ln{T}})$, where $\mathcal{A}_T$ is called \textit{kernel alignment}. We propose an algorithm whose regret bound and computational complexity are better than previous results. Our results depend on the decay rate of eigenvalues of the kernel matrix. If the eigenvalues of the kernel matrix decay exponentially, then our algorithm enjoys a regret of $O(\sqrt{\mathcal{A}_T})$ at a computational complexity of $O(\ln^2{T})$. Otherwise, our algorithm enjoys a regret of $O((\mathcal{A}_TT)^{\frac{1}{4}})$ at a computational complexity of $O(\sqrt{\mathcal{A}_TT})$. We extend our algorithm to batch learning and obtain a $O(\frac{1}{T}\sqrt{\mathbb{E}[\mathcal{A}_T]})$ excess risk bound which improves the previous $O(1/\sqrt{T})$ bound.
    Diagnosis of COVID-19 based on Chest Radiography. (arXiv:2212.13032v1 [eess.IV])
    The Coronavirus disease 2019 (COVID-19) was first identified in Wuhan, China, in early December 2019 and now becoming a pandemic. When COVID-19 patients undergo radiography examination, radiologists can observe the present of radiographic abnormalities from their chest X-ray (CXR) images. In this study, a deep convolutional neural network (CNN) model was proposed to aid radiologists in diagnosing COVID-19 patients. First, this work conducted a comparative study on the performance of modified VGG-16, ResNet-50 and DenseNet-121 to classify CXR images into normal, COVID-19 and viral pneumonia. Then, the impact of image augmentation on the classification results was evaluated. The publicly available COVID-19 Radiography Database was used throughout this study. After comparison, ResNet-50 achieved the highest accuracy with 95.88%. Next, after training ResNet-50 with rotation, translation, horizontal flip, intensity shift and zoom augmented dataset, the accuracy dropped to 80.95%. Furthermore, an ablation study on the effect of image augmentation on the classification results found that the combinations of rotation and intensity shift augmentation methods obtained an accuracy higher than baseline, which is 96.14%. Finally, ResNet-50 with rotation and intensity shift augmentations performed the best and was proposed as the final classification model in this work. These findings demonstrated that the proposed classification model can provide a promising result for COVID-19 diagnosis.
    Statistical Mechanics of Generalization In Graph Convolution Networks. (arXiv:2212.13069v1 [cs.LG])
    Graph neural networks (GNN) have become the default machine learning model for relational datasets, including protein interaction networks, biological neural networks, and scientific collaboration graphs. We use tools from statistical physics and random matrix theory to precisely characterize generalization in simple graph convolution networks on the contextual stochastic block model. The derived curves are phenomenologically rich: they explain the distinction between learning on homophilic and heterophilic graphs and they predict double descent whose existence in GNNs has been questioned by recent work. Our results are the first to accurately explain the behavior not only of a stylized graph learning model but also of complex GNNs on messy real-world datasets. To wit, we use our analytic insights about homophily and heterophily to improve performance of state-of-the-art graph neural networks on several heterophilic benchmarks by a simple addition of negative self-loop filters.
    Human Activity Recognition from Wi-Fi CSI Data Using Principal Component-Based Wavelet CNN. (arXiv:2212.13161v1 [cs.CV])
    Human Activity Recognition (HAR) is an emerging technology with several applications in surveillance, security, and healthcare sectors. Noninvasive HAR systems based on Wi-Fi Channel State Information (CSI) signals can be developed leveraging the quick growth of ubiquitous Wi-Fi technologies, and the correlation between CSI dynamics and body motions. In this paper, we propose Principal Component-based Wavelet Convolutional Neural Network (or PCWCNN) -- a novel approach that offers robustness and efficiency for practical real-time applications. Our proposed method incorporates two efficient preprocessing algorithms -- the Principal Component Analysis (PCA) and the Discrete Wavelet Transform (DWT). We employ an adaptive activity segmentation algorithm that is accurate and computationally light. Additionally, we used the Wavelet CNN for classification, which is a deep convolutional network analogous to the well-studied ResNet and DenseNet networks. We empirically show that our proposed PCWCNN model performs very well on a real dataset, outperforming existing approaches.
    Application of Unsupervised Domain Adaptation for Structural MRI Analysis. (arXiv:2212.12986v1 [eess.IV])
    The primary goal of this work is to study the effectiveness of an unsupervised domain adaptation approach for various applications such as binary classification and anomaly detection in the context of Alzheimer's disease (AD) detection for the OASIS datasets. We also explore image reconstruction and image synthesis for analyzing and generating 3D structural MRI data to establish performance benchmarks for anomaly detection. We successfully demonstrate that domain adaptation improves the performance of AD detection when implemented in both supervised and unsupervised settings. Additionally, the proposed methodology achieves state-of-the-art performance for binary classification on the OASIS-1 dataset.
    Inverse Multiobjective Optimization Through Online Learning. (arXiv:2010.06140v2 [cs.LG] UPDATED)
    We study the problem of learning the objective functions or constraints of a multiobjective decision making model, based on a set of sequentially arrived decisions. In particular, these decisions might not be exact and possibly carry measurement noise or are generated with the bounded rationality of decision makers. In this paper, we propose a general online learning framework to deal with this learning problem using inverse multiobjective optimization. More precisely, we develop two online learning algorithms with implicit update rules which can handle noisy data. Numerical results show that both algorithms can learn the parameters with great accuracy and are robust to noise.
    Towards Improved Prediction of Ship Performance: A Comparative Analysis on In-service Ship Monitoring Data for Modeling the Speed-Power Relation. (arXiv:2212.13061v1 [cs.LG])
    Accurate modeling of ship performance is crucial for the shipping industry to optimize fuel consumption and subsequently reduce emissions. However, predicting the speed-power relation in real-world conditions remains a challenge. In this study, we used in-service monitoring data from multiple vessels with different hull shapes to compare the accuracy of data-driven machine learning (ML) algorithms to traditional methods for assessing ship performance. Our analysis consists of two main parts: (1) a comparison of sea trial curves with calm-water curves fitted on operational data, and (2) a benchmark of multiple added wave resistance theories with an ML-based approach. Our results showed that a simple neural network outperformed established semi-empirical formulas following first principles. The neural network only required operational data as input, while the traditional methods required extensive ship particulars that are often unavailable. These findings suggest that data-driven algorithms may be more effective for predicting ship performance in practical applications.
    Doubly Smoothed GDA: Global Convergent Algorithm for Constrained Nonconvex-Nonconcave Minimax Optimization. (arXiv:2212.12978v1 [math.OC])
    Nonconvex-nonconcave minimax optimization has been the focus of intense research over the last decade due to its broad applications in machine learning and operation research. Unfortunately, most existing algorithms cannot be guaranteed to converge and always suffer from limit cycles. Their global convergence relies on certain conditions that are difficult to check, including but not limited to the global Polyak-\L{}ojasiewicz condition, the existence of a solution satisfying the weak Minty variational inequality and $\alpha$-interaction dominant condition. In this paper, we develop the first provably convergent algorithm called doubly smoothed gradient descent ascent method, which gets rid of the limit cycle without requiring any additional conditions. We further show that the algorithm has an iteration complexity of $\mathcal{O}(\epsilon^{-4})$ for finding a game stationary point, which matches the best iteration complexity of single-loop algorithms under nonconcave-concave settings. The algorithm presented here opens up a new path for designing provable algorithms for nonconvex-nonconcave minimax optimization problems.
    Modeling Nonlinear Dynamics in Continuous Time with Inductive Biases on Decay Rates and/or Frequencies. (arXiv:2212.13033v1 [stat.ML])
    We propose a neural network-based model for nonlinear dynamics in continuous time that can impose inductive biases on decay rates and/or frequencies. Inductive biases are helpful for training neural networks especially when training data are small. The proposed model is based on the Koopman operator theory, where the decay rate and frequency information is used by restricting the eigenvalues of the Koopman operator that describe linear evolution in a Koopman space. We use neural networks to find an appropriate Koopman space, which are trained by minimizing multi-step forecasting and backcasting errors using irregularly sampled time-series data. Experiments on various time-series datasets demonstrate that the proposed method achieves higher forecasting performance given a single short training sequence than the existing methods.
    Rapid Extraction of Respiratory Waveforms from Photoplethysmography: A Deep Encoder Approach. (arXiv:2212.12578v1 [eess.IV])
    Much of the information of breathing is contained within the photoplethysmography (PPG) signal, through changes in venous blood flow, heart rate and stroke volume. We aim to leverage this fact, by employing a novel deep learning framework which is a based on a repurposed convolutional autoencoder. Our model aims to encode all of the relevant respiratory information contained within photoplethysmography waveform, and decode it into a waveform that is similar to a gold standard respiratory reference. The model is employed on two photoplethysmography data sets, namely Capnobase and BIDMC. We show that the model is capable of producing respiratory waveforms that approach the gold standard, while in turn producing state of the art respiratory rate estimates. We also show that when it comes to capturing more advanced respiratory waveform characteristics such as duty cycle, our model is for the most part unsuccessful. A suggested reason for this, in light of a previous study on in-ear PPG, is that the respiratory variations in finger-PPG are far weaker compared with other recording locations. Importantly, our model can perform these waveform estimates in a fraction of a millisecond, giving it the capacity to produce over 6 hours of respiratory waveforms in a single second. Moreover, we attempt to interpret the behaviour of the kernel weights within the model, showing that in part our model intuitively selects different breathing frequencies. The model proposed in this work could help to improve the usefulness of consumer PPG-based wearables for medical applications, where detailed respiratory information is required.
    Unsupervised Instance and Subnetwork Selection for Network Data. (arXiv:2212.12771v1 [cs.LG])
    Unlike tabular data, features in network data are interconnected within a domain-specific graph. Examples of this setting include gene expression overlaid on a protein interaction network (PPI) and user opinions in a social network. Network data is typically high-dimensional (large number of nodes) and often contains outlier snapshot instances and noise. In addition, it is often non-trivial and time-consuming to annotate instances with global labels (e.g., disease or normal). How can we jointly select discriminative subnetworks and representative instances for network data without supervision? We address these challenges within an unsupervised framework for joint subnetwork and instance selection in network data, called UISS, via a convex self-representation objective. Given an unlabeled network dataset, UISS identifies representative instances while ignoring outliers. It outperforms state-of-the-art baselines on both discriminative subnetwork selection and representative instance selection, achieving up to 10% accuracy improvement on all real-world data sets we use for evaluation. When employed for exploratory analysis in RNA-seq network samples from multiple studies it produces interpretable and informative summaries.
    Author Name Disambiguation via Heterogeneous Network Embedding from Structural and Semantic Perspectives. (arXiv:2212.12715v1 [cs.LG])
    Name ambiguity is common in academic digital libraries, such as multiple authors having the same name. This creates challenges for academic data management and analysis, thus name disambiguation becomes necessary. The procedure of name disambiguation is to divide publications with the same name into different groups, each group belonging to a unique author. A large amount of attribute information in publications makes traditional methods fall into the quagmire of feature selection. These methods always select attributes artificially and equally, which usually causes a negative impact on accuracy. The proposed method is mainly based on representation learning for heterogeneous networks and clustering and exploits the self-attention technology to solve the problem. The presentation of publications is a synthesis of structural and semantic representations. The structural representation is obtained by meta-path-based sampling and a skip-gram-based embedding method, and meta-path level attention is introduced to automatically learn the weight of each feature. The semantic representation is generated using NLP tools. Our proposal performs better in terms of name disambiguation accuracy compared with baselines and the ablation experiments demonstrate the improvement by feature selection and the meta-path level attention in our method. The experimental results show the superiority of our new method for capturing the most attributes from publications and reducing the impact of redundant information.
    Reconstructing Kernel-based Machine Learning Force Fields with Super-linear Convergence. (arXiv:2212.12737v1 [physics.chem-ph])
    Kernel machines have sustained continuous progress in the field of quantum chemistry. In particular, they have proven to be successful in the low-data regime of force field reconstruction. This is because many physical invariances and symmetries can be incorporated into the kernel function to compensate for much larger datasets. So far, the scalability of this approach has however been hindered by its cubical runtime in the number of training points. While it is known, that iterative Krylov subspace solvers can overcome these burdens, they crucially rely on effective preconditioners, which are elusive in practice. Practical preconditioners need to be computationally efficient and numerically robust at the same time. Here, we consider the broad class of Nystr\"om-type methods to construct preconditioners based on successively more sophisticated low-rank approximations of the original kernel matrix, each of which provides a different set of computational trade-offs. All considered methods estimate the relevant subspace spanned by the kernel matrix columns using different strategies to identify a representative set of inducing points. Our comprehensive study covers the full spectrum of approaches, starting from naive random sampling to leverage score estimates and incomplete Cholesky factorizations, up to exact SVD decompositions.
    Streaming Traffic Flow Prediction Based on Continuous Reinforcement Learning. (arXiv:2212.12767v1 [stat.ML])
    Traffic flow prediction is an important part of smart transportation. The goal is to predict future traffic conditions based on historical data recorded by sensors and the traffic network. As the city continues to build, parts of the transportation network will be added or modified. How to accurately predict expanding and evolving long-term streaming networks is of great significance. To this end, we propose a new simulation-based criterion that considers teaching autonomous agents to mimic sensor patterns, planning their next visit based on the sensor's profile (e.g., traffic, speed, occupancy). The data recorded by the sensor is most accurate when the agent can perfectly simulate the sensor's activity pattern. We propose to formulate the problem as a continuous reinforcement learning task, where the agent is the next flow value predictor, the action is the next time-series flow value in the sensor, and the environment state is a dynamically fused representation of the sensor and transportation network. Actions taken by the agent change the environment, which in turn forces the agent's mode to update, while the agent further explores changes in the dynamic traffic network, which helps the agent predict its next visit more accurately. Therefore, we develop a strategy in which sensors and traffic networks update each other and incorporate temporal context to quantify state representations evolving over time.
    Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective. (arXiv:2112.06409v3 [cs.LG] UPDATED)
    Data-centric AI is at the center of a fundamental shift in software engineering where machine learning becomes the new software, powered by big data and computing infrastructure. Here software engineering needs to be re-thought where data becomes a first-class citizen on par with code. One striking observation is that a significant portion of the machine learning process is spent on data preparation. Without good data, even the best machine learning algorithms cannot perform well. As a result, data-centric AI practices are now becoming mainstream. Unfortunately, many datasets in the real world are small, dirty, biased, and even poisoned. In this survey, we study the research landscape for data collection and data quality primarily for deep learning applications. Data collection is important because there is lesser need for feature engineering for recent deep learning approaches, but instead more need for large amounts of data. For data quality, we study data validation, cleaning, and integration techniques. Even if the data cannot be fully cleaned, we can still cope with imperfect data during model training using robust model training techniques. In addition, while bias and fairness have been less studied in traditional data management research, these issues become essential topics in modern machine learning applications. We thus study fairness measures and unfairness mitigation techniques that can be applied before, during, or after model training. We believe that the data management community is well poised to solve these problems.
    A Novel SOC Estimation for Hybrid Energy Pack using Deep Learning. (arXiv:2212.12607v1 [cs.CE])
    Estimating the state of charge (SOC) of compound energy storage devices in the hybrid energy storage system (HESS) of electric vehicles (EVs) is vital in improving the performance of the EV. The complex and variable charging and discharging current of EVs makes an accurate SOC estimation a challenge. This paper proposes a novel deep learning-based SOC estimation method for lithium-ion battery-supercapacitor HESS EV based on the nonlinear autoregressive with exogenous inputs neural network (NARXNN). The NARXNN is utilized to capture and overcome the complex nonlinear behaviors of lithium-ion batteries and supercapacitors in EVs. The results show that the proposed method improved the SOC estimation accuracy by 91.5% on average with error values below 0.1% and reduced consumption time by 11.4%. Hence validating both the effectiveness and robustness of the proposed method.
    A Fair Pricing Model via Adversarial Learning. (arXiv:2202.12008v3 [stat.ML] UPDATED)
    At the core of insurance business lies classification between risky and non-risky insureds, actuarial fairness meaning that risky insureds should contribute more and pay a higher premium than non-risky or less-risky ones. Actuaries, therefore, use econometric or machine learning techniques to classify, but the distinction between a fair actuarial classification and "discrimination" is subtle. For this reason, there is a growing interest about fairness and discrimination in the actuarial community Lindholm, Richman, Tsanakas, and Wuthrich (2022). Presumably, non-sensitive characteristics can serve as substitutes or proxies for protected attributes. For example, the color and model of a car, combined with the driver's occupation, may lead to an undesirable gender bias in the prediction of car insurance prices. Surprisingly, we will show that debiasing the predictor alone may be insufficient to maintain adequate accuracy (1). Indeed, the traditional pricing model is currently built in a two-stage structure that considers many potentially biased components such as car or geographic risks. We will show that this traditional structure has significant limitations in achieving fairness. For this reason, we have developed a novel pricing model approach. Recently some approaches have Blier-Wong, Cossette, Lamontagne, and Marceau (2021); Wuthrich and Merz (2021) shown the value of autoencoders in pricing. In this paper, we will show that (2) this can be generalized to multiple pricing factors (geographic, car type), (3) it perfectly adapted for a fairness context (since it allows to debias the set of pricing components): We extend this main idea to a general framework in which a single whole pricing model is trained by generating the geographic and car pricing components needed to predict the pure premium while mitigating the unwanted bias according to the desired metric.
    A Unified Hard-Constraint Framework for Solving Geometrically Complex PDEs. (arXiv:2210.03526v4 [cs.LG] UPDATED)
    We present a unified hard-constraint framework for solving geometrically complex PDEs with neural networks, where the most commonly used Dirichlet, Neumann, and Robin boundary conditions (BCs) are considered. Specifically, we first introduce the "extra fields" from the mixed finite element method to reformulate the PDEs so as to equivalently transform the three types of BCs into linear forms. Based on the reformulation, we derive the general solutions of the BCs analytically, which are employed to construct an ansatz that automatically satisfies the BCs. With such a framework, we can train the neural networks without adding extra loss terms and thus efficiently handle geometrically complex PDEs, alleviating the unbalanced competition between the loss terms corresponding to the BCs and PDEs. We theoretically demonstrate that the "extra fields" can stabilize the training process. Experimental results on real-world geometrically complex PDEs showcase the effectiveness of our method compared with state-of-the-art baselines.
    Computation of conditional expectations with guarantees. (arXiv:2112.01804v2 [stat.CO] UPDATED)
    Theoretically, the conditional expectation of a square-integrable random variable $Y$ given a $d$-dimensional random vector $X$ can be obtained by minimizing the mean squared distance between $Y$ and $f(X)$ over all Borel measurable functions $f \colon \mathbb{R}^d \to \mathbb{R}$. However, in many applications this minimization problem cannot be solved exactly, and instead, a numerical method which computes an approximate minimum over a suitable subfamily of Borel functions has to be used. The quality of the result depends on the adequacy of the subfamily and the performance of the numerical method. In this paper, we derive an expected value representation of the minimal mean squared distance which in many applications can efficiently be approximated with a standard Monte Carlo average. This enables us to provide guarantees for the accuracy of any numerical approximation of a given conditional expectation. We illustrate the method by assessing the quality of approximate conditional expectations obtained by linear, polynomial and neural network regression in different concrete examples.
    Your diffusion model secretly knows the dimension of the data manifold. (arXiv:2212.12611v1 [cs.LG])
    In this work, we propose a novel framework for estimating the dimension of the data manifold using a trained diffusion model. A trained diffusion model approximates the gradient of the log density of a noise-corrupted version of the target distribution for varying levels of corruption. If the data concentrates around a manifold embedded in the high-dimensional ambient space, then as the level of corruption decreases, the score function points towards the manifold, as this direction becomes the direction of maximum likelihood increase. Therefore, for small levels of corruption, the diffusion model provides us with access to an approximation of the normal bundle of the data manifold. This allows us to estimate the dimension of the tangent space, thus, the intrinsic dimension of the data manifold. Our method outperforms linear methods for dimensionality detection such as PPCA in controlled experiments.
    Ask Question First for Enhancing Lifelong Language Learning. (arXiv:2208.08367v2 [cs.CL] UPDATED)
    Lifelong language learning aims to stream learning NLP tasks while retaining knowledge of previous tasks. Previous works based on the language model and following data-free constraint approaches have explored formatting all data as "begin token (\textit{B}) + context (\textit{C}) + question (\textit{Q}) + answer (\textit{A})" for different tasks. However, they still suffer from catastrophic forgetting and are exacerbated when the previous task's pseudo data is insufficient for the following reasons: (1) The model has difficulty generating task-corresponding pseudo data, and (2) \textit{A} is prone to error when \textit{A} and \textit{C} are separated by \textit{Q} because the information of the \textit{C} is diminished before generating \textit{A}. Therefore, we propose the Ask Question First and Replay Question (AQF-RQ), including a novel data format "\textit{BQCA}" and a new training task to train pseudo questions of previous tasks. Experimental results demonstrate that AQF-RQ makes it easier for the model to generate more pseudo data that match corresponding tasks, and is more robust to both sufficient and insufficient pseudo-data when the task boundary is both clear and unclear. AQF-RQ can achieve only 0.36\% lower performance than multi-task learning.
    Concentration of the Langevin Algorithm's Stationary Distribution. (arXiv:2212.12629v1 [stat.ML])
    A canonical algorithm for log-concave sampling is the Langevin Algorithm, aka the Langevin Diffusion run with some discretization stepsize $\eta > 0$. This discretization leads the Langevin Algorithm to have a stationary distribution $\pi_{\eta}$ which differs from the stationary distribution $\pi$ of the Langevin Diffusion, and it is an important challenge to understand whether the well-known properties of $\pi$ extend to $\pi_{\eta}$. In particular, while concentration properties such as isoperimetry and rapidly decaying tails are classically known for $\pi$, the analogous properties for $\pi_{\eta}$ are open questions with direct algorithmic implications. This note provides a first step in this direction by establishing concentration results for $\pi_{\eta}$ that mirror classical results for $\pi$. Specifically, we show that for any nontrivial stepsize $\eta > 0$, $\pi_{\eta}$ is sub-exponential (respectively, sub-Gaussian) when the potential is convex (respectively, strongly convex). Moreover, the concentration bounds we show are essentially tight. Key to our analysis is the use of a rotation-invariant moment generating function (aka Bessel function) to study the stationary dynamics of the Langevin Algorithm. This technique may be of independent interest because it enables directly analyzing the discrete-time stationary distribution $\pi_{\eta}$ without going through the continuous-time stationary distribution $\pi$ as an intermediary.
    Structure-Enhanced DRL for Optimal Transmission Scheduling. (arXiv:2212.12704v1 [cs.IT])
    Remote state estimation of large-scale distributed dynamic processes plays an important role in Industry 4.0 applications. In this paper, we focus on the transmission scheduling problem of a remote estimation system. First, we derive some structural properties of the optimal sensor scheduling policy over fading channels. Then, building on these theoretical guidelines, we develop a structure-enhanced deep reinforcement learning (DRL) framework for optimal scheduling of the system to achieve the minimum overall estimation mean-square error (MSE). In particular, we propose a structure-enhanced action selection method, which tends to select actions that obey the policy structure. This explores the action space more effectively and enhances the learning efficiency of DRL agents. Furthermore, we introduce a structure-enhanced loss function to add penalties to actions that do not follow the policy structure. The new loss function guides the DRL to converge to the optimal policy structure quickly. Our numerical experiments illustrate that the proposed structure-enhanced DRL algorithms can save the training time by 50% and reduce the remote estimation MSE by 10% to 25% when compared to benchmark DRL algorithms. In addition, we show that the derived structural properties exist in a wide range of dynamic scheduling problems that go beyond remote state estimation.
    Out-of-Distribution Detection with Reconstruction Error and Typicality-based Penalty. (arXiv:2212.12641v1 [cs.LG])
    The task of out-of-distribution (OOD) detection is vital to realize safe and reliable operation for real-world applications. After the failure of likelihood-based detection in high dimensions had been shown, approaches based on the \emph{typical set} have been attracting attention; however, they still have not achieved satisfactory performance. Beginning by presenting the failure case of the typicality-based approach, we propose a new reconstruction error-based approach that employs normalizing flow (NF). We further introduce a typicality-based penalty, and by incorporating it into the reconstruction error in NF, we propose a new OOD detection method, penalized reconstruction error (PRE). Because the PRE detects test inputs that lie off the in-distribution manifold, it effectively detects adversarial examples as well as OOD examples. We show the effectiveness of our method through the evaluation using natural image datasets, CIFAR-10, TinyImageNet, and ILSVRC2012.
    Regularization with Latent Space Virtual Adversarial Training. (arXiv:2011.13181v2 [cs.LG] UPDATED)
    Virtual Adversarial Training (VAT) has shown impressive results among recently developed regularization methods called consistency regularization. VAT utilizes adversarial samples, generated by injecting perturbation in the input space, for training and thereby enhances the generalization ability of a classifier. However, such adversarial samples can be generated only within a very small area around the input data point, which limits the adversarial effectiveness of such samples. To address this problem we propose LVAT (Latent space VAT), which injects perturbation in the latent space instead of the input space. LVAT can generate adversarial samples flexibly, resulting in more adverse effects and thus more effective regularization. The latent space is built by a generative model, and in this paper, we examine two different type of models: variational auto-encoder and normalizing flow, specifically Glow. We evaluated the performance of our method in both supervised and semi-supervised learning scenarios for an image classification task using SVHN and CIFAR-10 datasets. In our evaluation, we found that our method outperforms VAT and other state-of-the-art methods.
    Utilizing Priming to Identify Optimal Class Ordering to Alleviate Catastrophic Forgetting. (arXiv:2212.12643v1 [cs.LG])
    In order for artificial neural networks to begin accurately mimicking biological ones, they must be able to adapt to new exigencies without forgetting what they have learned from previous training. Lifelong learning approaches to artificial neural networks attempt to strive towards this goal, yet have not progressed far enough to be realistically deployed for natural language processing tasks. The proverbial roadblock of catastrophic forgetting still gate-keeps researchers from an adequate lifelong learning model. While efforts are being made to quell catastrophic forgetting, there is a lack of research that looks into the importance of class ordering when training on new classes for incremental learning. This is surprising as the ordering of "classes" that humans learn is heavily monitored and incredibly important. While heuristics to develop an ideal class order have been researched, this paper examines class ordering as it relates to priming as a scheme for incremental class learning. By examining the connections between various methods of priming found in humans and how those are mimicked yet remain unexplained in life-long machine learning, this paper provides a better understanding of the similarities between our biological systems and the synthetic systems while simultaneously improving current practices to combat catastrophic forgetting. Through the merging of psychological priming practices with class ordering, this paper is able to identify a generalizable method for class ordering in NLP incremental learning tasks that consistently outperforms random class ordering.
    A Convergence Rate for Manifold Neural Networks. (arXiv:2212.12606v1 [cs.LG])
    High-dimensional data arises in numerous applications, and the rapidly developing field of geometric deep learning seeks to develop neural network architectures to analyze such data in non-Euclidean domains, such as graphs and manifolds. Recent work by Z. Wang, L. Ruiz, and A. Ribeiro has introduced a method for constructing manifold neural networks using the spectral decomposition of the Laplace Beltrami operator. Moreover, in this work, the authors provide a numerical scheme for implementing such neural networks when the manifold is unknown and one only has access to finitely many sample points. The authors show that this scheme, which relies upon building a data-driven graph, converges to the continuum limit as the number of sample points tends to infinity. Here, we build upon this result by establishing a rate of convergence that depends on the intrinsic dimension of the manifold but is independent of the ambient dimension. We also discuss how the rate of convergence depends on the depth of the network and the number of filters used in each layer.
    Data-Driven Linear Complexity Low-Rank Approximation of General Kernel Matrices: A Geometric Approach. (arXiv:2212.12674v1 [math.NA])
    A general, {\em rectangular} kernel matrix may be defined as $K_{ij} = \kappa(x_i,y_j)$ where $\kappa(x,y)$ is a kernel function and where $X=\{x_i\}_{i=1}^m$ and $Y=\{y_i\}_{i=1}^n$ are two sets of points. In this paper, we seek a low-rank approximation to a kernel matrix where the sets of points $X$ and $Y$ are large and are not well-separated (e.g., the points in $X$ and $Y$ may be ``intermingled''). Such rectangular kernel matrices may arise, for example, in Gaussian process regression where $X$ corresponds to the training data and $Y$ corresponds to the test data. In this case, the points are often high-dimensional. Since the point sets are large, we must exploit the fact that the matrix arises from a kernel function, and avoid forming the matrix, and thus ruling out most algebraic techniques. In particular, we seek methods that can scale linearly, i.e., with computational complexity $O(m)$ or $O(n)$ for a fixed accuracy or rank. The main idea in this paper is to {\em geometrically} select appropriate subsets of points to construct a low rank approximation. An analysis in this paper guides how this selection should be performed.
    On Realization of Intelligent Decision-Making in the Real World: A Foundation Decision Model Perspective. (arXiv:2212.12669v1 [cs.AI])
    Our situated environment is full of uncertainty and highly dynamic, thus hindering the widespread adoption of machine-led Intelligent Decision-Making (IDM) in real world scenarios. This means IDM should have the capability of continuously learning new skills and efficiently generalizing across wider applications. IDM benefits from any new approaches and theoretical breakthroughs that exhibit Artificial General Intelligence (AGI) breaking the barriers between tasks and applications. Recent research has well-examined neural architecture, Transformer, as a backbone foundation model and its generalization to various tasks, including computer vision, natural language processing, and reinforcement learning. We therefore argue that a foundation decision model (FDM) can be established by formulating various decision-making tasks as a sequence decoding task using the Transformer architecture; this would be a promising solution to advance the applications of IDM in more complex real world tasks. In this paper, we elaborate on how a foundation decision model improves the efficiency and generalization of IDM. We also discuss potential applications of a FDM in multi-agent game AI, production scheduling, and robotics tasks. Finally, through a case study, we demonstrate our realization of the FDM, DigitalBrain (DB1) with 1.2 billion parameters, which achieves human-level performance over 453 tasks, including text generation, images caption, video games playing, robotic control, and traveling salesman problems. As a foundation decision model, DB1 would be a baby step towards more autonomous and efficient real world IDM applications.
    Parotid Gland MRI Segmentation Based on Swin-Unet and Multimodal Images. (arXiv:2206.03336v2 [eess.IV] UPDATED)
    Background and objective: Parotid gland tumors account for approximately 2% to 10% of head and neck tumors. Preoperative tumor localization, differential diagnosis, and subsequent selection of appropriate treatment for parotid gland tumors are critical. However, the relative rarity of these tumors and the highly dispersed tissue types have left an unmet need for a subtle differential diagnosis of such neoplastic lesions based on preoperative radiomics. Recently, deep learning methods have developed rapidly, especially Transformer beats the traditional convolutional neural network in computer vision. Many new Transformer-based networks have been proposed for computer vision tasks. Methods: In this study, multicenter multimodal parotid gland MR images were collected. The Swin-Unet which was based on Transformer was used. MR images of short time inversion recovery, T1-weighted and T2-weighted modalities were combined into three-channel data to train the network. We achieved segmentation of the region of interest for parotid gland and tumor. Results: The Dice-Similarity Coefficient of the model on the test set was 88.63%, Mean Pixel Accuracy was 99.31%, Mean Intersection over Union was 83.99%, and Hausdorff Distance was 3.04. Then a series of comparison experiments were designed in this paper to further validate the segmentation performance of the algorithm. Conclusions: Experimental results showed that our method has good results for parotid gland and tumor segmentation. The Transformer-based network outperforms the traditional convolutional neural network in the field of medical images.
    A Taxonomy for Inference in Causal Model Families. (arXiv:2110.12052v2 [cs.LG] UPDATED)
    Neurally-parameterized Structural Causal Models in the Pearlian notion to causality, referred to as NCM, were recently introduced as a step towards next-generation learning systems. However, said NCM are only concerned with the learning aspect of causal inference but totally miss out on the architecture aspect. That is, actual causal inference within NCM is intractable in that the NCM won't return an answer to a query in polynomial time. This insight follows as corollary to the more general statement on the intractability of arbitrary SCM parameterizations, which we prove in this work through classical 3-SAT reduction. Since future learning algorithms will be required to deal with both high dimensional data and highly complex mechanisms governing the data, we ultimately believe work on tractable inference for causality to be decisive. We also show that not all ``causal'' models are created equal. More specifically, there are models capable of answering causal queries that are not SCM, which we refer to as \emph{partially causal models} (PCM). We provide a tabular taxonomy in terms of tractability properties for all of the different model families, namely correlation-based, PCM and SCM. To conclude our work, we also provide some initial ideas on how to overcome parts of the intractability of causal inference with SCM by showing an example of how parameterizing an SCM with SPN modules can at least allow for tractable mechanisms. We hope that our impossibility result alongside the taxonomy for tractability in causal models can raise awareness for this novel research direction since achieving success with causality in real world downstream tasks will not only depend on learning correct models as we also require having the practical ability to gain access to model inferences.
    AttentionCode: Ultra-Reliable Feedback Codes for Short-Packet Communications. (arXiv:2205.14955v2 [cs.IT] UPDATED)
    Ultra-reliable short-packet communication is a major challenge in future wireless networks with critical applications. To achieve ultra-reliable communications beyond 99.999%, this paper envisions a new interaction-based communication paradigm that exploits feedback from the receiver. We present AttentionCode, a new class of feedback codes leveraging deep learning (DL) technologies. The underpinnings of AttentionCode are three architectural innovations: AttentionNet, input restructuring, and adaptation to fading channels, accompanied by several training methods, including large-batch training, distributed learning, look-ahead optimizer, training-test signal-to-noise ratio (SNR) mismatch, and curriculum learning. The training methods can potentially be generalized to other wireless communication applications with machine learning. Numerical experiments verify that AttentionCode establishes a new state of the art among all DL-based feedback codes in both additive white Gaussian noise (AWGN) channels and fading channels. In AWGN channels with noiseless feedback, for example, AttentionCode achieves a block error rate (BLER) of $10^{-7}$ when the forward channel SNR is 0 dB for a block size of 50 bits, demonstrating the potential of AttentionCode to provide ultra-reliable short-packet communications.
    Attentional-Biased Stochastic Gradient Descent. (arXiv:2012.06951v4 [cs.LG] UPDATED)
    In this paper, we present a simple yet effective method (ABSGD) for addressing the data imbalance issue in deep learning. Our method is a simple modification to momentum SGD where we leverage an attentional mechanism to assign an individual importance weight to each gradient in the mini-batch. Unlike many existing heuristic-driven methods for tackling data imbalance, our method is grounded in {\it theoretically justified distributionally robust optimization (DRO)}, which is guaranteed to converge to a stationary point of an information-regularized DRO problem. The individual-level weight of a sampled data is systematically proportional to the exponential of a scaled loss value of the data, where the scaling factor is interpreted as the regularization parameter in the framework of information-regularized DRO. Compared with existing class-level weighting schemes, our method can capture the diversity between individual examples within each class. Compared with existing individual-level weighting methods using meta-learning that require three backward propagations for computing mini-batch stochastic gradients, our method is more efficient with only one backward propagation at each iteration as in standard deep learning methods. To balance between the learning of feature extraction layers and the learning of the classifier layer, we employ a two-stage method that uses SGD for pretraining followed by ABSGD for learning a robust classifier and finetuning lower layers. Our empirical studies on several benchmark datasets demonstrate the effectiveness of the proposed method.
    2-hop Neighbor Class Similarity (2NCS): A graph structural metric indicative of graph neural network performance. (arXiv:2212.13202v1 [cs.LG])
    Graph Neural Networks (GNNs) achieve state-of-the-art performance on graph-structured data across numerous domains. Their underlying ability to represent nodes as summaries of their vicinities has proven effective for homophilous graphs in particular, in which same-type nodes tend to connect. On heterophilous graphs, in which different-type nodes are likely connected, GNNs perform less consistently, as neighborhood information might be less representative or even misleading. On the other hand, GNN performance is not inferior on all heterophilous graphs, and there is a lack of understanding of what other graph properties affect GNN performance. In this work, we highlight the limitations of the widely used homophily ratio and the recent Cross-Class Neighborhood Similarity (CCNS) metric in estimating GNN performance. To overcome these limitations, we introduce 2-hop Neighbor Class Similarity (2NCS), a new quantitative graph structural property that correlates with GNN performance more strongly and consistently than alternative metrics. 2NCS considers two-hop neighborhoods as a theoretically derived consequence of the two-step label propagation process governing GCN's training-inference process. Experiments on one synthetic and eight real-world graph datasets confirm consistent improvements over existing metrics in estimating the accuracy of GCN- and GAT-based architectures on the node classification task.
    Improving SGD convergence by online linear regression of gradients in multiple statistically relevant directions. (arXiv:1901.11457v9 [cs.LG] UPDATED)
    Deep neural networks are usually trained with stochastic gradient descent (SGD), which minimizes objective function using very rough approximations of gradient, only averaging to the real gradient. Standard approaches like momentum or ADAM only consider a single direction, and do not try to model distance from extremum - neglecting valuable information from calculated sequence of gradients, often stagnating in some suboptimal plateau. Second order methods could exploit these missed opportunities, however, beside suffering from very large cost and numerical instabilities, many of them attract to suboptimal points like saddles due to negligence of signs of curvatures (as eigenvalues of Hessian). Saddle-free Newton method is a rare example of addressing this issue - changes saddle attraction into repulsion, and was shown to provide essential improvement for final value this way. However, it neglects noise while modelling second order behavior, focuses on Krylov subspace for numerical reasons, and requires costly eigendecomposion. Maintaining SFN advantages, there are proposed inexpensive ways for exploiting these opportunities. Second order behavior is linear dependence of first derivative - we can optimally estimate it from sequence of noisy gradients with least square linear regression, in online setting here: with weakening weights of old gradients. Statistically relevant subspace is suggested by PCA of recent noisy gradients - in online setting it can be made by slowly rotating considered directions toward new gradients, gradually replacing old directions with recent statistically relevant. Eigendecomposition can be also performed online: with regularly performed step of QR method to maintain diagonal Hessian. Outside the second order modeled subspace we can simultaneously perform gradient descent.
    Gaussian Process Classification Bandits. (arXiv:2212.13157v1 [cs.LG])
    Classification bandits are multi-armed bandit problems whose task is to classify a given set of arms into either positive or negative class depending on whether the rate of the arms with the expected reward of at least h is not less than w for given thresholds h and w. We study a special classification bandit problem in which arms correspond to points x in d-dimensional real space with expected rewards f(x) which are generated according to a Gaussian process prior. We develop a framework algorithm for the problem using various arm selection policies and propose policies called FCB and FTSV. We show a smaller sample complexity upper bound for FCB than that for the existing algorithm of the level set estimation, in which whether f(x) is at least h or not must be decided for every arm's x. Arm selection policies depending on an estimated rate of arms with rewards of at least h are also proposed and shown to improve empirical sample complexity. According to our experimental results, the rate-estimation versions of FCB and FTSV, together with that of the popular active learning policy that selects the point with the maximum variance, outperform other policies for synthetic functions, and the version of FTSV is also the best performer for our real-world dataset.
    Mantis: Enabling Energy-Efficient Autonomous Mobile Agents with Spiking Neural Networks. (arXiv:2212.12620v1 [cs.RO])
    Autonomous mobile agents such as unmanned aerial vehicles (UAVs) and mobile robots have shown huge potential for improving human productivity. These mobile agents require low power/energy consumption to have a long lifespan since they are usually powered by batteries. These agents also need to adapt to changing/dynamic environments, especially when deployed in far or dangerous locations, thus requiring efficient online learning capabilities. These requirements can be fulfilled by employing Spiking Neural Networks (SNNs) since SNNs offer low power/energy consumption due to sparse computations and efficient online learning due to bio-inspired learning mechanisms. However, a methodology is still required to employ appropriate SNN models on autonomous mobile agents. Towards this, we propose a Mantis methodology to systematically employ SNNs on autonomous mobile agents to enable energy-efficient processing and adaptive capabilities in dynamic environments. The key ideas of our Mantis include the optimization of SNN operations, the employment of a bio-plausible online learning mechanism, and the SNN model selection. The experimental results demonstrate that our methodology maintains high accuracy with a significantly smaller memory footprint and energy consumption (i.e., 3.32x memory reduction and 2.9x energy saving for an SNN model with 8-bit weights) compared to the baseline network with 32-bit weights. In this manner, our Mantis enables the employment of SNNs for resource- and energy-constrained mobile agents.
    Refined Edge Usage of Graph Neural Networks for Edge Prediction. (arXiv:2212.12970v1 [cs.LG])
    Graph Neural Networks (GNNs), originally proposed for node classification, have also motivated many recent works on edge prediction (a.k.a., link prediction). However, existing methods lack elaborate design regarding the distinctions between two tasks that have been frequently overlooked: (i) edges only constitute the topology in the node classification task but can be used as both the topology and the supervisions (i.e., labels) in the edge prediction task; (ii) the node classification makes prediction over each individual node, while the edge prediction is determinated by each pair of nodes. To this end, we propose a novel edge prediction paradigm named Edge-aware Message PassIng neuRal nEtworks (EMPIRE). Concretely, we first introduce an edge splitting technique to specify use of each edge where each edge is solely used as either the topology or the supervision (named as topology edge or supervision edge). We then develop a new message passing mechanism that generates the messages to source nodes (through topology edges) being aware of target nodes (through supervision edges). In order to emphasize the differences between pairs connected by supervision edges and pairs unconnected, we further weight the messages to highlight the relative ones that can reflect the differences. In addition, we design a novel negative node-pair sampling trick that efficiently samples 'hard' negative instances in the supervision instances, and can significantly improve the performance. Experimental results verify that the proposed method can significantly outperform existing state-of-the-art models regarding the edge prediction task on multiple homogeneous and heterogeneous graph datasets.
    A Close Look at Spatial Modeling: From Attention to Convolution. (arXiv:2212.12552v1 [cs.CV])
    Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism. By revisiting the self-attention responses in Transformers, we empirically observe two interesting issues. First, Vision Transformers present a queryirrelevant behavior at deep layers, where the attention maps exhibit nearly consistent contexts in global scope, regardless of the query patch position (also head-irrelevant). Second, the attention maps are intrinsically sparse, few tokens dominate the attention weights; introducing the knowledge from ConvNets would largely smooth the attention and enhance the performance. Motivated by above observations, we generalize self-attention formulation to abstract a queryirrelevant global context directly and further integrate the global context into convolutions. The resulting model, a Fully Convolutional Vision Transformer (i.e., FCViT), purely consists of convolutional layers and firmly inherits the merits of both attention mechanism and convolutions, including dynamic property, weight sharing, and short- and long-range feature modeling, etc. Experimental results demonstrate the effectiveness of FCViT. With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K. When scaling FCViT to larger models, we still perform better than previous state-of-the-art ConvNeXt with even fewer parameters. FCViT-based models also demonstrate promising transferability to downstream tasks, like object detection, instance segmentation, and semantic segmentation. Codes and models are made available at: https://github.com/ma-xu/FCViT.
    HandsOff: Labeled Dataset Generation With No Additional Human Annotations. (arXiv:2212.12645v1 [cs.CV])
    Recent work leverages the expressive power of generative adversarial networks (GANs) to generate labeled synthetic datasets. These dataset generation methods often require new annotations of synthetic images, which forces practitioners to seek out annotators, curate a set of synthetic images, and ensure the quality of generated labels. We introduce the HandsOff framework, a technique capable of producing an unlimited number of synthetic images and corresponding labels after being trained on less than 50 pre-existing labeled images. Our framework avoids the practical drawbacks of prior work by unifying the field of GAN inversion with dataset generation. We generate datasets with rich pixel-wise labels in multiple challenging domains such as faces, cars, full-body human poses, and urban driving scenes. Our method achieves state-of-the-art performance in semantic segmentation, keypoint detection, and depth estimation compared to prior dataset generation approaches and transfer learning baselines. We additionally showcase its ability to address broad challenges in model development which stem from fixed, hand-annotated datasets, such as the long-tail problem in semantic segmentation.
    Nothing Stands Alone: Relational Fake News Detection with Hypergraph Neural Networks. (arXiv:2212.12621v1 [cs.SI])
    Nowadays, fake news easily propagates through online social networks and becomes a grand threat to individuals and society. Assessing the authenticity of news is challenging due to its elaborately fabricated contents, making it difficult to obtain large-scale annotations for fake news data. Due to such data scarcity issues, detecting fake news tends to fail and overfit in the supervised setting. Recently, graph neural networks (GNNs) have been adopted to leverage the richer relational information among both labeled and unlabeled instances. Despite their promising results, they are inherently focused on pairwise relations between news, which can limit the expressive power for capturing fake news that spreads in a group-level. For example, detecting fake news can be more effective when we better understand relations between news pieces shared among susceptible users. To address those issues, we propose to leverage a hypergraph to represent group-wise interaction among news, while focusing on important news relations with its dual-level attention mechanism. Experiments based on two benchmark datasets show that our approach yields remarkable performance and maintains the high performance even with a small subset of labeled news data.
    SHIRO: Soft Hierarchical Reinforcement Learning. (arXiv:2212.12786v1 [cs.RO])
    Hierarchical Reinforcement Learning (HRL) algorithms have been demonstrated to perform well on high-dimensional decision making and robotic control tasks. However, because they solely optimize for rewards, the agent tends to search the same space redundantly. This problem reduces the speed of learning and achieved reward. In this work, we present an Off-Policy HRL algorithm that maximizes entropy for efficient exploration. The algorithm learns a temporally abstracted low-level policy and is able to explore broadly through the addition of entropy to the high-level. The novelty of this work is the theoretical motivation of adding entropy to the RL objective in the HRL setting. We empirically show that the entropy can be added to both levels if the Kullback-Leibler (KL) divergence between consecutive updates of the low-level policy is sufficiently small. We performed an ablative study to analyze the effects of entropy on hierarchy, in which adding entropy to high-level emerged as the most desirable configuration. Furthermore, a higher temperature in the low-level leads to Q-value overestimation and increases the stochasticity of the environment that the high-level operates on, making learning more challenging. Our method, SHIRO, surpasses state-of-the-art performance on a range of simulated robotic control benchmark tasks and requires minimal tuning.
    Simultaneously Optimizing Perturbations and Positions for Black-box Adversarial Patch Attacks. (arXiv:2212.12995v1 [cs.CV])
    Adversarial patch is an important form of real-world adversarial attack that brings serious risks to the robustness of deep neural networks. Previous methods generate adversarial patches by either optimizing their perturbation values while fixing the pasting position or manipulating the position while fixing the patch's content. This reveals that the positions and perturbations are both important to the adversarial attack. For that, in this paper, we propose a novel method to simultaneously optimize the position and perturbation for an adversarial patch, and thus obtain a high attack success rate in the black-box setting. Technically, we regard the patch's position, the pre-designed hyper-parameters to determine the patch's perturbations as the variables, and utilize the reinforcement learning framework to simultaneously solve for the optimal solution based on the rewards obtained from the target model with a small number of queries. Extensive experiments are conducted on the Face Recognition (FR) task, and results on four representative FR models show that our method can significantly improve the attack success rate and query efficiency. Besides, experiments on the commercial FR service and physical environments confirm its practical application value. We also extend our method to the traffic sign recognition task to verify its generalization ability.
    GraphCast: Learning skillful medium-range global weather forecasting. (arXiv:2212.12794v1 [cs.LG])
    We introduce a machine-learning (ML)-based weather simulator--called "GraphCast"--which outperforms the most accurate deterministic operational medium-range weather forecasting system in the world, as well as all previous ML baselines. GraphCast is an autoregressive model, based on graph neural networks and a novel high-resolution multi-scale mesh representation, which we trained on historical weather data from the European Centre for Medium-Range Weather Forecasts (ECMWF)'s ERA5 reanalysis archive. It can make 10-day forecasts, at 6-hour time intervals, of five surface variables and six atmospheric variables, each at 37 vertical pressure levels, on a 0.25-degree latitude-longitude grid, which corresponds to roughly 25 x 25 kilometer resolution at the equator. Our results show GraphCast is more accurate than ECMWF's deterministic operational forecasting system, HRES, on 90.0% of the 2760 variable and lead time combinations we evaluated. GraphCast also outperforms the most accurate previous ML-based weather forecasting model on 99.2% of the 252 targets it reported. GraphCast can generate a 10-day forecast (35 gigabytes of data) in under 60 seconds on Cloud TPU v4 hardware. Unlike traditional forecasting methods, ML-based forecasting scales well with data: by training on bigger, higher quality, and more recent data, the skill of the forecasts can improve. Together these results represent a key step forward in complementing and improving weather modeling with ML, open new opportunities for fast, accurate forecasting, and help realize the promise of ML-based simulation in the physical sciences.  ( 2 min )
    Multi-duplicated Characterization of Graph Structures using Information Gain Ratio for Graph Neural Networks. (arXiv:2212.12691v1 [cs.LG])
    Various graph neural networks (GNNs) have been proposed to solve node classification tasks in machine learning for graph data. GNNs use the structural information of graph data by aggregating the features of neighboring nodes. However, they fail to directly characterize and leverage the structural information. In this paper, we propose multi-duplicated characterization of graph structures using information gain ratio (IGR) for GNNs (MSI-GNN), which enhances the performance of node classification by using an i-hop adjacency matrix as the structural information of the graph data. In MSI-GNN, the i-hop adjacency matrix is adaptively adjusted by two methods: (i) structural features in the matrix are selected based on the IGR, and (ii) the selected features in (i) for each node are duplicated and combined flexibly. In an experiment, we show that our MSI-GNN outperforms GCN, H2GCN, and GCNII in terms of average accuracies in benchmark graph datasets.  ( 2 min )
    Inclusive Artificial Intelligence. (arXiv:2212.12633v1 [cs.LG])
    Prevailing methods for assessing and comparing generative AIs incentivize responses that serve a hypothetical representative individual. Evaluating models in these terms presumes homogeneous preferences across the population and engenders selection of agglomerative AIs, which fail to represent the diverse range of interests across individuals. We propose an alternative evaluation method that instead prioritizes inclusive AIs, which provably retain the requisite knowledge not only for subsequent response customization to particular segments of the population but also for utility-maximizing decisions.  ( 2 min )
    Automatic stabilization of finite-element simulations using neural networks and hierarchical matrices. (arXiv:2212.12695v1 [math.NA])
    Petrov-Galerkin formulations with optimal test functions allow for the stabilization of finite element simulations. In particular, given a discrete trial space, the optimal test space induces a numerical scheme delivering the best approximation in terms of a problem-dependent energy norm. This ideal approach has two shortcomings: first, we need to explicitly know the set of optimal test functions; and second, the optimal test functions may have large supports inducing expensive dense linear systems. Nevertheless, parametric families of PDEs are an example where it is worth investing some (offline) computational effort to obtain stabilized linear systems that can be solved efficiently, for a given set of parameters, in an online stage. Therefore, as a remedy for the first shortcoming, we explicitly compute (offline) a function mapping any PDE-parameter, to the matrix of coefficients of optimal test functions (in a basis expansion) associated with that PDE-parameter. Next, as a remedy for the second shortcoming, we use the low-rank approximation to hierarchically compress the (non-square) matrix of coefficients of optimal test functions. In order to accelerate this process, we train a neural network to learn a critical bottleneck of the compression algorithm (for a given set of PDE-parameters). When solving online the resulting (compressed) Petrov-Galerkin formulation, we employ a GMRES iterative solver with inexpensive matrix-vector multiplications thanks to the low-rank features of the compressed matrix. We perform experiments showing that the full online procedure as fast as the original (unstable) Galerkin approach. In other words, we get the stabilization with hierarchical matrices and neural networks practically for free. We illustrate our findings by means of 2D Eriksson-Johnson and Hemholtz model problems.  ( 2 min )
    Automated Gadget Discovery in Science. (arXiv:2212.12743v1 [quant-ph])
    In recent years, reinforcement learning (RL) has become increasingly successful in its application to science and the process of scientific discovery in general. However, while RL algorithms learn to solve increasingly complex problems, interpreting the solutions they provide becomes ever more challenging. In this work, we gain insights into an RL agent's learned behavior through a post-hoc analysis based on sequence mining and clustering. Specifically, frequent and compact subroutines, used by the agent to solve a given task, are distilled as gadgets and then grouped by various metrics. This process of gadget discovery develops in three stages: First, we use an RL agent to generate data, then, we employ a mining algorithm to extract gadgets and finally, the obtained gadgets are grouped by a density-based clustering algorithm. We demonstrate our method by applying it to two quantum-inspired RL environments. First, we consider simulated quantum optics experiments for the design of high-dimensional multipartite entangled states where the algorithm finds gadgets that correspond to modern interferometer setups. Second, we consider a circuit-based quantum computing environment where the algorithm discovers various gadgets for quantum information processing, such as quantum teleportation. This approach for analyzing the policy of a learned agent is agent and environment agnostic and can yield interesting insights into any agent's policy.  ( 2 min )
    Boosting Out-of-Distribution Detection with Multiple Pre-trained Models. (arXiv:2212.12720v1 [cs.LG])
    Out-of-Distribution (OOD) detection, i.e., identifying whether an input is sampled from a novel distribution other than the training distribution, is a critical task for safely deploying machine learning systems in the open world. Recently, post hoc detection utilizing pre-trained models has shown promising performance and can be scaled to large-scale problems. This advance raises a natural question: Can we leverage the diversity of multiple pre-trained models to improve the performance of post hoc detection methods? In this work, we propose a detection enhancement method by ensembling multiple detection decisions derived from a zoo of pre-trained models. Our approach uses the p-value instead of the commonly used hard threshold and leverages a fundamental framework of multiple hypothesis testing to control the true positive rate of In-Distribution (ID) data. We focus on the usage of model zoos and provide systematic empirical comparisons with current state-of-the-art methods on various OOD detection benchmarks. The proposed ensemble scheme shows consistent improvement compared to single-model detectors and significantly outperforms the current competitive methods. Our method substantially improves the relative performance by 65.40% and 26.96% on the CIFAR10 and ImageNet benchmarks.  ( 2 min )
    Stochastic Methods for AUC Optimization subject to AUC-based Fairness Constraints. (arXiv:2212.12603v1 [cs.LG])
    As machine learning being used increasingly in making high-stakes decisions, an arising challenge is to avoid unfair AI systems that lead to discriminatory decisions for protected population. A direct approach for obtaining a fair predictive model is to train the model through optimizing its prediction performance subject to fairness constraints, which achieves Pareto efficiency when trading off performance against fairness. Among various fairness metrics, the ones based on the area under the ROC curve (AUC) are emerging recently because they are threshold-agnostic and effective for unbalanced data. In this work, we formulate the training problem of a fairness-aware machine learning model as an AUC optimization problem subject to a class of AUC-based fairness constraints. This problem can be reformulated as a min-max optimization problem with min-max constraints, which we solve by stochastic first-order methods based on a new Bregman divergence designed for the special structure of the problem. We numerically demonstrate the effectiveness of our approach on real-world data under different fairness metrics.  ( 2 min )
    Adapting to game trees in zero-sum imperfect information games. (arXiv:2212.12567v1 [stat.ML])
    Imperfect information games (IIG) are games in which each player only partially observes the current game state. We study how to learn $\epsilon$-optimal strategies in a zero-sum IIG through self-play with trajectory feedback. We give a problem-independent lower bound $\mathcal{O}(H(A_{\mathcal{X}}+B_{\mathcal{Y}})/\epsilon^2)$ on the required number of realizations to learn these strategies with high probability, where $H$ is the length of the game, $A_{\mathcal{X}}$ and $B_{\mathcal{Y}}$ are the total number of actions for the two players. We also propose two Follow the Regularize leader (FTRL) algorithms for this setting: Balanced-FTRL which matches this lower bound, but requires the knowledge of the information set structure beforehand to define the regularization; and Adaptive-FTRL which needs $\mathcal{O}(H^2(A_{\mathcal{X}}+B_{\mathcal{Y}})/\epsilon^2)$ plays without this requirement by progressively adapting the regularization to the observations.
    A Labelled Sample Compression Scheme of Size at Most Quadratic in the VC Dimension. (arXiv:2212.12631v1 [cs.LG])
    This paper presents a construction of a proper and stable labelled sample compression scheme of size $O(\VCD^2)$ for any finite concept class, where $\VCD$ denotes the Vapnik-Chervonenkis Dimension. The construction is based on a well-known model of machine teaching, referred to as recursive teaching dimension. This substantially improves on the currently best known bound on the size of sample compression schemes (due to Moran and Yehudayoff), which is exponential in $\VCD$. The long-standing open question whether the smallest size of a sample compression scheme is in $O(\VCD)$ remains unresolved, but our results show that research on machine teaching is a promising avenue for the study of this open problem. As further evidence of the strong connections between machine teaching and sample compression, we prove that the model of no-clash teaching, introduced by Kirkpatrick et al., can be used to define a non-trivial lower bound on the size of stable sample compression schemes.
    A Lightweight Reconstruction Network for Surface Defect Inspection. (arXiv:2212.12878v1 [cs.CV])
    Currently, most deep learning methods cannot solve the problem of scarcity of industrial product defect samples and significant differences in characteristics. This paper proposes an unsupervised defect detection algorithm based on a reconstruction network, which is realized using only a large number of easily obtained defect-free sample data. The network includes two parts: image reconstruction and surface defect area detection. The reconstruction network is designed through a fully convolutional autoencoder with a lightweight structure. Only a small number of normal samples are used for training so that the reconstruction network can be A defect-free reconstructed image is generated. A function combining structural loss and $\mathit{L}1$ loss is proposed as the loss function of the reconstruction network to solve the problem of poor detection of irregular texture surface defects. Further, the residual of the reconstructed image and the image to be tested is used as the possible region of the defect, and conventional image operations can realize the location of the fault. The unsupervised defect detection algorithm of the proposed reconstruction network is used on multiple defect image sample sets. Compared with other similar algorithms, the results show that the unsupervised defect detection algorithm of the reconstructed network has strong robustness and accuracy.  ( 2 min )
    Neural Networks beyond explainability: Selective inference for sequence motifs. (arXiv:2212.12542v1 [q-bio.GN])
    Over the past decade, neural networks have been successful at making predictions from biological sequences, especially in the context of regulatory genomics. As in other fields of deep learning, tools have been devised to extract features such as sequence motifs that can explain the predictions made by a trained network. Here we intend to go beyond explainable machine learning and introduce SEISM, a selective inference procedure to test the association between these extracted features and the predicted phenotype. In particular, we discuss how training a one-layer convolutional network is formally equivalent to selecting motifs maximizing some association score. We adapt existing sampling-based selective inference procedures by quantizing this selection over an infinite set to a large but finite grid. Finally, we show that sampling under a specific choice of parameters is sufficient to characterize the composite null hypothesis typically used for selective inference-a result that goes well beyond our particular framework. We illustrate the behavior of our method in terms of calibration, power and speed and discuss its power/speed trade-off with a simpler data-split strategy. SEISM paves the way to an easier analysis of neural networks used in regulatory genomics, and to more powerful methods for genome wide association studies (GWAS).  ( 2 min )
    Rank-LIME: Local Model-Agnostic Feature Attribution for Learning to Rank. (arXiv:2212.12722v1 [cs.IR])
    Understanding why a model makes certain predictions is crucial when adapting it for real world decision making. LIME is a popular model-agnostic feature attribution method for the tasks of classification and regression. However, the task of learning to rank in information retrieval is more complex in comparison with either classification or regression. In this work, we extend LIME to propose Rank-LIME, a model-agnostic, local, post-hoc linear feature attribution method for the task of learning to rank that generates explanations for ranked lists. We employ novel correlation-based perturbations, differentiable ranking loss functions and introduce new metrics to evaluate ranking based additive feature attribution models. We compare Rank-LIME with a variety of competing systems, with models trained on the MS MARCO datasets and observe that Rank-LIME outperforms existing explanation algorithms in terms of Model Fidelity and Explain-NDCG. With this we propose one of the first algorithms to generate additive feature attributions for explaining ranked lists.  ( 2 min )
    A Bayesian Robust Regression Method for Corrupted Data Reconstruction. (arXiv:2212.12787v1 [cs.LG])
    Because of the widespread existence of noise and data corruption, recovering the true regression parameters with a certain proportion of corrupted response variables is an essential task. Methods to overcome this problem often involve robust least-squares regression, but few methods perform well when confronted with severe adaptive adversarial attacks. In many applications, prior knowledge is often available from historical data or engineering experience, and by incorporating prior information into a robust regression method, we develop an effective robust regression method that can resist adaptive adversarial attacks. First, we propose the novel TRIP (hard Thresholding approach to Robust regression with sImple Prior) algorithm, which improves the breakdown point when facing adaptive adversarial attacks. Then, to improve the robustness and reduce the estimation error caused by the inclusion of priors, we use the idea of Bayesian reweighting to construct the more robust BRHT (robust Bayesian Reweighting regression via Hard Thresholding) algorithm. We prove the theoretical convergence of the proposed algorithms under mild conditions, and extensive experiments show that under different types of dataset attacks, our algorithms outperform other benchmark ones. Finally, we apply our methods to a data-recovery problem in a real-world application involving a space solar array, demonstrating their good applicability.  ( 2 min )
    Forecasting through deep learning and modal decomposition in multi-phase concentric jets. (arXiv:2212.12731v1 [cs.LG])
    This work presents a set of neural network (NN) models specifically designed for accurate and efficient fluid dynamics forecasting. In this work, we show how neural networks training can be improved by reducing data complexity through a modal decomposition technique called higher order dynamic mode decomposition (HODMD), which identifies the main structures inside flow dynamics and reconstructs the original flow using only these main structures. This reconstruction has the same number of samples and spatial dimension as the original flow, but with a less complex dynamics and preserving its main features. We also show the low computational cost required by the proposed NN models, both in their training and inference phases. The core idea of this work is to test the limits of applicability of deep learning models to data forecasting in complex fluid dynamics problems. Generalization capabilities of the models are demonstrated by using the same neural network architectures to forecast the future dynamics of four different multi-phase flows. Data sets used to train and test these deep learning models come from Direct Numerical Simulations (DNS) of these flows.  ( 2 min )
    An Adaptive Deep RL Method for Non-Stationary Environments with Piecewise Stable Context. (arXiv:2212.12735v1 [cs.LG])
    One of the key challenges in deploying RL to real-world applications is to adapt to variations of unknown environment contexts, such as changing terrains in robotic tasks and fluctuated bandwidth in congestion control. Existing works on adaptation to unknown environment contexts either assume the contexts are the same for the whole episode or assume the context variables are Markovian. However, in many real-world applications, the environment context usually stays stable for a stochastic period and then changes in an abrupt and unpredictable manner within an episode, resulting in a segment structure, which existing works fail to address. To leverage the segment structure of piecewise stable context in real-world applications, in this paper, we propose a \textit{\textbf{Se}gmented \textbf{C}ontext \textbf{B}elief \textbf{A}ugmented \textbf{D}eep~(SeCBAD)} RL method. Our method can jointly infer the belief distribution over latent context with the posterior over segment length and perform more accurate belief context inference with observed data within the current context segment. The inferred belief context can be leveraged to augment the state, leading to a policy that can adapt to abrupt variations in context. We demonstrate empirically that SeCBAD can infer context segment length accurately and outperform existing methods on a toy grid world environment and Mujuco tasks with piecewise-stable context.  ( 2 min )
    T2-GNN: Graph Neural Networks for Graphs with Incomplete Features and Structure via Teacher-Student Distillation. (arXiv:2212.12738v1 [cs.LG])
    Graph Neural Networks (GNNs) have been a prevailing technique for tackling various analysis tasks on graph data. A key premise for the remarkable performance of GNNs relies on complete and trustworthy initial graph descriptions (i.e., node features and graph structure), which is often not satisfied since real-world graphs are often incomplete due to various unavoidable factors. In particular, GNNs face greater challenges when both node features and graph structure are incomplete at the same time. The existing methods either focus on feature completion or structure completion. They usually rely on the matching relationship between features and structure, or employ joint learning of node representation and feature (or structure) completion in the hope of achieving mutual benefit. However, recent studies confirm that the mutual interference between features and structure leads to the degradation of GNN performance. When both features and structure are incomplete, the mismatch between features and structure caused by the missing randomness exacerbates the interference between the two, which may trigger incorrect completions that negatively affect node representation. To this end, in this paper we propose a general GNN framework based on teacher-student distillation to improve the performance of GNNs on incomplete graphs, namely T2-GNN. To avoid the interference between features and structure, we separately design feature-level and structure-level teacher models to provide targeted guidance for student model (base GNNs, such as GCN) through distillation. Then we design two personalized methods to obtain well-trained feature and structure teachers. To ensure that the knowledge of the teacher model is comprehensively and effectively distilled to the student model, we further propose a dual distillation mode to enable the student to acquire as much expert knowledge as possible.  ( 2 min )
    A learning-based approach to multi-agent decision-making. (arXiv:2212.12561v1 [eess.SY])
    We propose a learning-based methodology to reconstruct private information held by a population of interacting agents in order to predict an exact outcome of the underlying multi-agent interaction process, here identified as a stationary action profile. We envision a scenario where an external observer, endowed with a learning procedure, is allowed to make queries and observe the agents' reactions through private action-reaction mappings, whose collective fixed point corresponds to a stationary profile. By adopting a smart query process to iteratively collect sensible data and update parametric estimates, we establish sufficient conditions to assess the asymptotic properties of the proposed learning-based methodology so that, if convergence happens, it can only be towards a stationary action profile. This fact yields two main consequences: i) learning locally-exact surrogates of the action-reaction mappings allows the external observer to succeed in its prediction task, and ii) working with assumptions so general that a stationary profile is not even guaranteed to exist, the established sufficient conditions hence act also as certificates for the existence of such a desirable profile. Extensive numerical simulations involving typical competitive multi-agent control and decision making problems illustrate the practical effectiveness of the proposed learning-based approach.  ( 2 min )
    Deep Latent State Space Models for Time-Series Generation. (arXiv:2212.12749v1 [stat.ML])
    Methods based on ordinary differential equations (ODEs) are widely used to build generative models of time-series. In addition to high computational overhead due to explicitly computing hidden states recurrence, existing ODE-based models fall short in learning sequence data with sharp transitions - common in many real-world systems - due to numerical challenges during optimization. In this work, we propose LS4, a generative model for sequences with latent variables evolving according to a state space ODE to increase modeling capacity. Inspired by recent deep state space models (S4), we achieve speedups by leveraging a convolutional representation of LS4 which bypasses the explicit evaluation of hidden states. We show that LS4 significantly outperforms previous continuous-time generative models in terms of marginal distribution, classification, and prediction scores on real-world datasets in the Monash Forecasting Repository, and is capable of modeling highly stochastic data with sharp temporal transitions. LS4 sets state-of-the-art for continuous-time latent generative models, with significant improvement of mean squared error and tighter variational lower bounds on irregularly-sampled datasets, while also being x100 faster than other baselines on long sequences.  ( 2 min )
    Improving Uncertainty Quantification of Variance Networks by Tree-Structured Learning. (arXiv:2212.12658v1 [cs.LG])
    To improve uncertainty quantification of variance networks, we propose a novel tree-structured local neural network model that partitions the feature space into multiple regions based on uncertainty heterogeneity. A tree is built upon giving the training data, whose leaf nodes represent different regions where region-specific neural networks are trained to predict both the mean and the variance for quantifying uncertainty. The proposed Uncertainty-Splitting Neural Regression Tree (USNRT) employs novel splitting criteria. At each node, a neural network is trained on the full data first, and a statistical test for the residuals is conducted to find the best split, corresponding to the two sub-regions with the most significant uncertainty heterogeneity. USNRT is computationally friendly because very few leaf nodes are sufficient and pruning is unnecessary. On extensive UCI datasets, in terms of both calibration and sharpness, USNRT shows superior performance compared to some recent popular methods for variance prediction, including vanilla variance network, deep ensemble, dropout-based methods, tree-based models, etc. Through comprehensive visualization and analysis, we uncover how USNRT works and show its merits.  ( 2 min )
    Iterative regularization in classification via hinge loss diagonal descent. (arXiv:2212.12675v1 [stat.ML])
    Iterative regularization is a classic idea in regularization theory, that has recently become popular in machine learning. On the one hand, it allows to design efficient algorithms controlling at the same time numerical and statistical accuracy. On the other hand it allows to shed light on the learning curves observed while training neural networks. In this paper, we focus on iterative regularization in the context of classification. After contrasting this setting with that of regression and inverse problems, we develop an iterative regularization approach based on the use of the hinge loss function. More precisely we consider a diagonal approach for a family of algorithms for which we prove convergence as well as rates of convergence. Our approach compares favorably with other alternatives, as confirmed also in numerical simulations.  ( 2 min )
    Beyond 5G Networks: Integration of Communication, Computing, Caching, and Control. (arXiv:2212.13141v1 [cs.NI])
    In recent years, the exponential proliferation of smart devices with their intelligent applications poses severe challenges on conventional cellular networks. Such challenges can be potentially overcome by integrating communication, computing, caching, and control (i4C) technologies. In this survey, we first give a snapshot of different aspects of the i4C, comprising background, motivation, leading technological enablers, potential applications, and use cases. Next, we describe different models of communication, computing, caching, and control (4C) to lay the foundation of the integration approach. We review current state-of-the-art research efforts related to the i4C, focusing on recent trends of both conventional and artificial intelligence (AI)-based integration approaches. We also highlight the need for intelligence in resources integration. Then, we discuss integration of sensing and communication (ISAC) and classify the integration approaches into various classes. Finally, we propose open challenges and present future research directions for beyond 5G networks, such as 6G.
  • Open

    On Error and Compression Rates for Prototype Rules. (arXiv:2206.08014v2 [cs.LG] UPDATED)
    We study the close interplay between error and compression in the non-parametric multiclass classification setting in terms of prototype learning rules. We focus in particular on a recently proposed compression-based learning rule termed OptiNet (Kontorovich, Sabato, and Urner 2016; Kontorovich, Sabato, and Weiss 2017; Hanneke et al. 2021). Beyond its computational merits, this rule has been recently shown to be universally consistent in any metric instance space that admits a universally consistent rule--the first learning algorithm known to enjoy this property. However, its error and compression rates have been left open. Here we derive such rates in the case where instances reside in Euclidean space under commonly posed smoothness and tail conditions on the data distribution. We first show that OptiNet achieves non-trivial compression rates while enjoying near minimax-optimal error rates. We then proceed to study a novel general compression scheme for further compressing prototype rules that locally adapts to the noise level without sacrificing accuracy. Applying it to OptiNet, we show that under a geometric margin condition, further gain in the compression rate is achieved. Experimental results comparing the performance of the various methods are presented.
    Sliced gradient-enhanced Kriging for high-dimensional function approximation and aerodynamic modeling. (arXiv:2204.03562v2 [stat.ML] UPDATED)
    Gradient-enhanced Kriging (GE-Kriging) is a well-established surrogate modelling technique for approximating expensive computational models. However, it tends to get impractical for high-dimensional problems due to the large inherent correlation matrix and the associated high-dimensional hyper-parameter tuning problem. To address these issues, we propose a new method in this paper, called sliced GE-Kriging (SGE-Kriging) for reducing both the size of the correlation matrix and the number of hyper-parameters. Firstly, we perform a derivative-based global sensitivity analysis to detect the relative importance of each input variable with respect to model response. Then, we propose to split the training sample set into multiple slices, and invoke Bayes' theorem to approximate the full likelihood function via a sliced likelihood function, in which multiple small correlation matrices are utilized to describe the correlation of the sample set. Additionally, we replace the original high-dimensional hyper-parameter tuning problem with a low-dimensional counterpart by learning the relationship between the hyper-parameters and the global sensitivity indices. Finally, we validate SGE-Kriging by means of numerical experiments with several benchmarks problems. The results show that the SGE-Kriging model features an accuracy and robustness that is comparable to the standard one but comes at much less training costs. The benefits are most evident in high-dimensional problems.
    Indeterminacy and Strong Identifiability in Generative Models. (arXiv:2206.00801v3 [stat.ML] UPDATED)
    Most modern probabilistic generative models, such as the variational autoencoder (VAE), have certain indeterminacies that are unresolvable even with an infinite amount of data. Different tasks tolerate different indeterminacies, however recent applications have indicated the need for strongly identifiable models, in which an observation corresponds to a unique latent code. Progress has been made towards reducing model indeterminacies while maintaining flexibility, and recent work excludes many--but not all--indeterminacies. In this work, we motivate model-identifiability in terms of task-identifiability, then construct a theoretical framework for analyzing the indeterminacies of latent variable models, which enables their precise characterization in terms of the generator function and prior distribution spaces. We reveal that strong identifiability is possible even with highly flexible nonlinear generators, and give two such examples. One is a straightforward modification of iVAE (arXiv:1907.04809 [stat.ML]); the other uses triangular monotonic maps, leading to novel connections between optimal transport and identifiability.
    Granger Causal Chain Discovery for Sepsis-Associated Derangements via Multivariate Hawkes Processes. (arXiv:2209.04480v2 [stat.AP] UPDATED)
    Modern health care systems are conducting continuous, automated surveillance of the electronic medical record (EMR) to identify adverse events with increasing frequency; however, many events such as sepsis do not have clearly elucidated prodromes (i.e., event chains) that can be used to identify and intercept the adverse event early in its course. Currently there does not exist a reliable framework for discovering or describing causal chains that precede adverse hospital events. Clinically relevant and interpretable results require a framework that can (1) infer temporal interactions across multiple patient features found in EMR data (e.g., labs, vital signs, etc.) and (2) can identify pattern(s) which precede and are specific to an impending adverse event (e.g., sepsis). In this work, we propose a linear multivariate Hawkes process model, coupled with $g(x) = x^+$ link function to allow potential inhibition effects, in order to recover a Granger Causal (GC) graph. We develop a two-phase gradient-based scheme to maximize a surrogate of likelihood to estimate the problem parameters. This two-phase algorithm is scalable and shown to be effective via our numerical simulation. It is subsequently extended to a data set of patients admitted to Grady hospital system in Atalanta, GA, where the fitted Granger Causal graph identifies several highly interpretable chains that precede sepsis.
    Data Redaction from Pre-trained GANs. (arXiv:2206.14389v2 [cs.LG] UPDATED)
    Large pre-trained generative models are known to occasionally output undesirable samples, which undermines their trustworthiness. The common way to mitigate this is to re-train them differently from scratch using different data or different regularization -- which uses a lot of computational resources and does not always fully address the problem. In this work, we take a different, more compute-friendly approach and investigate how to post-edit a model after training so that it ''redacts'', or refrains from outputting certain kinds of samples. We show that redaction is a fundamentally different task from data deletion, and data deletion may not always lead to redaction. We then consider Generative Adversarial Networks (GANs), and provide three different algorithms for data redaction that differ on how the samples to be redacted are described. Extensive evaluations on real-world image datasets show that our algorithms out-perform data deletion baselines, and are capable of redacting data while retaining high generation quality at a fraction of the cost of full re-training.
    Tensor Principal Component Analysis. (arXiv:2212.12981v1 [econ.EM])
    In this paper, we develop new methods for analyzing high-dimensional tensor datasets. A tensor factor model describes a high-dimensional dataset as a sum of a low-rank component and an idiosyncratic noise, generalizing traditional factor models for panel data. We propose an estimation algorithm, called tensor principal component analysis (PCA), which generalizes the traditional PCA applicable to panel data. The algorithm involves unfolding the tensor into a sequence of matrices along different dimensions and applying PCA to the unfolded matrices. We provide theoretical results on the consistency and asymptotic distribution for tensor PCA estimator of loadings and factors. The algorithm demonstrates good performance in Mote Carlo experiments and is applied to sorted portfolios.
    Why neural networks find simple solutions: the many regularizers of geometric complexity. (arXiv:2209.13083v2 [cs.LG] UPDATED)
    In many contexts, simpler models are preferable to more complex models and the control of this model complexity is the goal for many methods in machine learning such as regularization, hyperparameter tuning and architecture design. In deep learning, it has been difficult to understand the underlying mechanisms of complexity control, since many traditional measures are not naturally suitable for deep neural networks. Here we develop the notion of geometric complexity, which is a measure of the variability of the model function, computed using a discrete Dirichlet energy. Using a combination of theoretical arguments and empirical results, we show that many common training heuristics such as parameter norm regularization, spectral norm regularization, flatness regularization, implicit gradient regularization, noise regularization and the choice of parameter initialization all act to control geometric complexity, providing a unifying framework in which to characterize the behavior of deep learning models.
    Gaussian Pre-Activations in Neural Networks: Myth or Reality?. (arXiv:2205.12379v2 [cs.LG] UPDATED)
    The study of feature propagation at initialization in neural networks lies at the root of numerous initialization designs. An assumption very commonly made in the field states that the pre-activations are Gaussian. Although this convenient Gaussian hypothesis can be justified when the number of neurons per layer tends to infinity, it is challenged by both theoretical and experimental works for finite-width neural networks. Our major contribution is to construct a family of pairs of activation functions and initialization distributions that ensure that the pre-activations remain Gaussian throughout the network's depth, even in narrow neural networks. In the process, we discover a set of constraints that a neural network should fulfill to ensure Gaussian pre-activations. Additionally, we provide a critical review of the claims of the Edge of Chaos line of works and build an exact Edge of Chaos analysis. We also propose a unified view on pre-activations propagation, encompassing the framework of several well-known initialization procedures. Finally, our work provides a principled framework for answering the much-debated question: is it desirable to initialize the training of a neural network whose pre-activations are ensured to be Gaussian?
    Inference on Strongly Identified Functionals of Weakly Identified Functions. (arXiv:2208.08291v2 [stat.ME] UPDATED)
    In a variety of applications, including nonparametric instrumental variable (NPIV) analysis, proximal causal inference under unmeasured confounding, and missing-not-at-random data with shadow variables, we are interested in inference on a continuous linear functional (e.g., average causal effects) of nuisance function (e.g., NPIV regression) defined by conditional moment restrictions. These nuisance functions are generally weakly identified, in that the conditional moment restrictions can be severely ill-posed as well as admit multiple solutions. This is sometimes resolved by imposing strong conditions that imply the function can be estimated at rates that make inference on the functional possible. In this paper, we study a novel condition for the functional to be strongly identified even when the nuisance function is not; that is, the functional is amenable to asymptotically-normal estimation at $\sqrt{n}$-rates. The condition implies the existence of debiasing nuisance functions, and we propose penalized minimax estimators for both the primary and debiasing nuisance functions. The proposed nuisance estimators can accommodate flexible function classes, and importantly they can converge to fixed limits determined by the penalization regardless of the identifiability of the nuisances. We use the penalized nuisance estimators to form a debiased estimator for the functional of interest and prove its asymptotic normality under generic high-level conditions, which provide for asymptotically valid confidence intervals. We also illustrate our method in a novel partially linear proximal causal inference problem and a partially linear instrumental variable regression problem.
    How unfair is private learning ?. (arXiv:2206.03985v2 [cs.LG] UPDATED)
    As machine learning algorithms are deployed on sensitive data in critical decision making processes, it is becoming increasingly important that they are also private and fair. In this paper, we show that, when the data has a long-tailed structure, it is not possible to build accurate learning algorithms that are both private and results in higher accuracy on minority subpopulations. We further show that relaxing overall accuracy can lead to good fairness even with strict privacy requirements. To corroborate our theoretical results in practice, we provide an extensive set of experimental results using a variety of synthetic, vision (CIFAR10 and CelebA), and tabular (Law School) datasets and learning algorithms.
    Convergence of Batch Asynchronous Stochastic Approximation With Applications to Reinforcement Learning. (arXiv:2109.03445v3 [stat.ML] UPDATED)
    The stochastic approximation (SA) algorithm is a widely used probabilistic method for finding a zero or a fixed point of a vector-valued funtion, when only noisy measurements of the function are available. In the literature to date, one makes a distinction between ``synchronous'' updating, whereby every component of the current guess is updated at each time, and ``asynchronous'' updating, whereby only one component is updated. In this paper, we study an intermediate situation that we call ``batch asynchronous stochastic approximation'' (BASA), in which, at each time instant, \textit{some but not all} components of the current estimated solution are updated. BASA allows the user to trade off memory requirements against time complexity. We develop a general methodology for proving that such algorithms converge to the fixed point of the map under study. These convergence proofs make use of weaker hypotheses than existing results. Specifically, existing convergence proofs require that the measurement noise is a zero-mean i.i.d\ sequence or a martingale difference sequence. In the present paper, we permit biased measurements, that is, measurement noises that have nonzero conditional mean. Also, all convergence results to date assume that the stochastic step sizes satisfy a probabilistic analog of the well-known Robbins-Monro conditions. We replace this assumption by a purely deterministic condition on the irreducibility of the underlying Markov processes. As specific applications to Reinforcement Learning, we analyze the temporal difference algorithm $TD(\lambda)$ for value iteration, and the $Q$-learning algorithm for finding the optimal action-value function. In both cases, we establish the convergence of these algorithms, under milder conditions than in the existing literature.
    Attentional-Biased Stochastic Gradient Descent. (arXiv:2012.06951v4 [cs.LG] UPDATED)
    In this paper, we present a simple yet effective method (ABSGD) for addressing the data imbalance issue in deep learning. Our method is a simple modification to momentum SGD where we leverage an attentional mechanism to assign an individual importance weight to each gradient in the mini-batch. Unlike many existing heuristic-driven methods for tackling data imbalance, our method is grounded in {\it theoretically justified distributionally robust optimization (DRO)}, which is guaranteed to converge to a stationary point of an information-regularized DRO problem. The individual-level weight of a sampled data is systematically proportional to the exponential of a scaled loss value of the data, where the scaling factor is interpreted as the regularization parameter in the framework of information-regularized DRO. Compared with existing class-level weighting schemes, our method can capture the diversity between individual examples within each class. Compared with existing individual-level weighting methods using meta-learning that require three backward propagations for computing mini-batch stochastic gradients, our method is more efficient with only one backward propagation at each iteration as in standard deep learning methods. To balance between the learning of feature extraction layers and the learning of the classifier layer, we employ a two-stage method that uses SGD for pretraining followed by ABSGD for learning a robust classifier and finetuning lower layers. Our empirical studies on several benchmark datasets demonstrate the effectiveness of the proposed method.
    Formalising the Use of the Activation Function in Neural Inference. (arXiv:2102.04896v3 [q-bio.NC] UPDATED)
    We investigate how the activation function can be used to describe neural firing in an abstract way, and in turn, why it works well in artificial neural networks. We discuss how a spike in a biological neurone belongs to a particular universality class of phase transitions in statistical physics. We then show that the artificial neurone is, mathematically, a mean field model of biological neural membrane dynamics, which arises from modelling spiking as a phase transition. This allows us to treat selective neural firing in an abstract way, and formalise the role of the activation function in perceptron learning. The resultant statistical physical model allows us to recover the expressions for some known activation functions as various special cases. Along with deriving this model and specifying the analogous neural case, we analyse the phase transition to understand the physics of neural network learning. Together, it is shown that there is not only a biological meaning, but a physical justification, for the emergence and performance of typical activation functions; implications for neural learning and inference are also discussed.
    Online Active Learning for Soft Sensor Development using Semi-Supervised Autoencoders. (arXiv:2212.13067v1 [cs.LG])
    Data-driven soft sensors are extensively used in industrial and chemical processes to predict hard-to-measure process variables whose real value is difficult to track during routine operations. The regression models used by these sensors often require a large number of labeled examples, yet obtaining the label information can be very expensive given the high time and cost required by quality inspections. In this context, active learning methods can be highly beneficial as they can suggest the most informative labels to query. However, most of the active learning strategies proposed for regression focus on the offline setting. In this work, we adapt some of these approaches to the stream-based scenario and show how they can be used to select the most informative data points. We also demonstrate how to use a semi-supervised architecture based on orthogonal autoencoders to learn salient features in a lower dimensional space. The Tennessee Eastman Process is used to compare the predictive performance of the proposed approaches.
    Demand Forecasting for Platelet Usage: from Univariate Time Series to Multivariate Models. (arXiv:2101.02305v2 [cs.LG] UPDATED)
    Platelet products are both expensive and have very short shelf lives. As usage rates for platelets are highly variable, the effective management of platelet demand and supply is very important yet challenging. The primary goal of this paper is to present an efficient forecasting model for platelet demand at Canadian Blood Services (CBS). To accomplish this goal, four different demand forecasting methods, ARIMA (Auto Regressive Moving Average), Prophet, lasso regression (least absolute shrinkage and selection operator) and LSTM (Long Short-Term Memory) networks are utilized and evaluated. We use a large clinical dataset for a centralized blood distribution centre for four hospitals in Hamilton, Ontario, spanning from 2010 to 2018 and consisting of daily platelet transfusions along with information such as the product specifications, the recipients' characteristics, and the recipients' laboratory test results. This study is the first to utilize different methods from statistical time series models to data-driven regression and a machine learning technique for platelet transfusion using clinical predictors and with different amounts of data. We find that the multivariate approaches have the highest accuracy in general, however, if sufficient data are available, a simpler time series approach such as ARIMA appears to be sufficient. We also comment on the approach to choose clinical indicators (inputs) for the multivariate models.
    A Generalized EigenGame with Extensions to Multiview Representation Learning. (arXiv:2211.11323v2 [cs.LG] UPDATED)
    Generalized Eigenvalue Problems (GEPs) encompass a range of interesting dimensionality reduction methods. Development of efficient stochastic approaches to these problems would allow them to scale to larger datasets. Canonical Correlation Analysis (CCA) is one example of a GEP for dimensionality reduction which has found extensive use in problems with two or more views of the data. Deep learning extensions of CCA require large mini-batch sizes, and therefore large memory consumption, in the stochastic setting to achieve good performance and this has limited its application in practice. Inspired by the Generalized Hebbian Algorithm, we develop an approach to solving stochastic GEPs in which all constraints are softly enforced by Lagrange multipliers. Then by considering the integral of this Lagrangian function, its pseudo-utility, and inspired by recent formulations of Principal Components Analysis and GEPs as games with differentiable utilities, we develop a game-theory inspired approach to solving GEPs. We show that our approaches share much of the theoretical grounding of the previous Hebbian and game theoretic approaches for the linear case but our method permits extension to general function approximators like neural networks for certain GEPs for dimensionality reduction including CCA which means our method can be used for deep multiview representation learning. We demonstrate the effectiveness of our method for solving GEPs in the stochastic setting using canonical multiview datasets and demonstrate state-of-the-art performance for optimizing Deep CCA.
    Iterative regularization in classification via hinge loss diagonal descent. (arXiv:2212.12675v1 [stat.ML])
    Iterative regularization is a classic idea in regularization theory, that has recently become popular in machine learning. On the one hand, it allows to design efficient algorithms controlling at the same time numerical and statistical accuracy. On the other hand it allows to shed light on the learning curves observed while training neural networks. In this paper, we focus on iterative regularization in the context of classification. After contrasting this setting with that of regression and inverse problems, we develop an iterative regularization approach based on the use of the hinge loss function. More precisely we consider a diagonal approach for a family of algorithms for which we prove convergence as well as rates of convergence. Our approach compares favorably with other alternatives, as confirmed also in numerical simulations.
    Streaming Traffic Flow Prediction Based on Continuous Reinforcement Learning. (arXiv:2212.12767v1 [stat.ML])
    Traffic flow prediction is an important part of smart transportation. The goal is to predict future traffic conditions based on historical data recorded by sensors and the traffic network. As the city continues to build, parts of the transportation network will be added or modified. How to accurately predict expanding and evolving long-term streaming networks is of great significance. To this end, we propose a new simulation-based criterion that considers teaching autonomous agents to mimic sensor patterns, planning their next visit based on the sensor's profile (e.g., traffic, speed, occupancy). The data recorded by the sensor is most accurate when the agent can perfectly simulate the sensor's activity pattern. We propose to formulate the problem as a continuous reinforcement learning task, where the agent is the next flow value predictor, the action is the next time-series flow value in the sensor, and the environment state is a dynamically fused representation of the sensor and transportation network. Actions taken by the agent change the environment, which in turn forces the agent's mode to update, while the agent further explores changes in the dynamic traffic network, which helps the agent predict its next visit more accurately. Therefore, we develop a strategy in which sensors and traffic networks update each other and incorporate temporal context to quantify state representations evolving over time.
    Distilling and Transferring Knowledge via cGAN-generated Samples for Image Classification and Regression. (arXiv:2104.03164v4 [cs.CV] UPDATED)
    Knowledge distillation (KD) has been actively studied for image classification tasks in deep learning, aiming to improve the performance of a student based on the knowledge from a teacher. However, applying KD in image regression with a scalar response variable has been rarely studied, and there exists no KD method applicable to both classification and regression tasks yet. Moreover, existing KD methods often require a practitioner to carefully select or adjust the teacher and student architectures, making these methods less flexible in practice. To address the above problems in a unified way, we propose a comprehensive KD framework based on cGANs, termed cGAN-KD. Fundamentally different from existing KD methods, cGAN-KD distills and transfers knowledge from a teacher model to a student model via cGAN-generated samples. This novel mechanism makes cGAN-KD suitable for both classification and regression tasks, compatible with other KD methods, and insensitive to the teacher and student architectures. An error bound for a student model trained in the cGAN-KD framework is derived in this work, providing a theory for why cGAN-KD is effective as well as guiding the practical implementation of cGAN-KD. Extensive experiments on CIFAR-100 and ImageNet-100 show that we can combine state of the art KD methods with the cGAN-KD framework to yield a new state of the art. Moreover, experiments on Steering Angle and UTKFace demonstrate the effectiveness of cGAN-KD in image regression tasks, where existing KD methods are inapplicable.
    A Fair Pricing Model via Adversarial Learning. (arXiv:2202.12008v3 [stat.ML] UPDATED)
    At the core of insurance business lies classification between risky and non-risky insureds, actuarial fairness meaning that risky insureds should contribute more and pay a higher premium than non-risky or less-risky ones. Actuaries, therefore, use econometric or machine learning techniques to classify, but the distinction between a fair actuarial classification and "discrimination" is subtle. For this reason, there is a growing interest about fairness and discrimination in the actuarial community Lindholm, Richman, Tsanakas, and Wuthrich (2022). Presumably, non-sensitive characteristics can serve as substitutes or proxies for protected attributes. For example, the color and model of a car, combined with the driver's occupation, may lead to an undesirable gender bias in the prediction of car insurance prices. Surprisingly, we will show that debiasing the predictor alone may be insufficient to maintain adequate accuracy (1). Indeed, the traditional pricing model is currently built in a two-stage structure that considers many potentially biased components such as car or geographic risks. We will show that this traditional structure has significant limitations in achieving fairness. For this reason, we have developed a novel pricing model approach. Recently some approaches have Blier-Wong, Cossette, Lamontagne, and Marceau (2021); Wuthrich and Merz (2021) shown the value of autoencoders in pricing. In this paper, we will show that (2) this can be generalized to multiple pricing factors (geographic, car type), (3) it perfectly adapted for a fairness context (since it allows to debias the set of pricing components): We extend this main idea to a general framework in which a single whole pricing model is trained by generating the geographic and car pricing components needed to predict the pure premium while mitigating the unwanted bias according to the desired metric.
    DeepMed: Semiparametric Causal Mediation Analysis with Debiased Deep Learning. (arXiv:2210.04389v2 [stat.ML] UPDATED)
    Causal mediation analysis can unpack the black box of causality and is therefore a powerful tool for disentangling causal pathways in biomedical and social sciences, and also for evaluating machine learning fairness. To reduce bias for estimating Natural Direct and Indirect Effects in mediation analysis, we propose a new method called DeepMed that uses deep neural networks (DNNs) to cross-fit the infinite-dimensional nuisance functions in the efficient influence functions. We obtain novel theoretical results that our DeepMed method (1) can achieve semiparametric efficiency bound without imposing sparsity constraints on the DNN architecture and (2) can adapt to certain low dimensional structures of the nuisance functions, significantly advancing the existing literature on DNN-based semiparametric causal inference. Extensive synthetic experiments are conducted to support our findings and also expose the gap between theory and practice. As a proof of concept, we apply DeepMed to analyze two real datasets on machine learning fairness and reach conclusions consistent with previous findings.
    Deep Latent State Space Models for Time-Series Generation. (arXiv:2212.12749v1 [stat.ML])
    Methods based on ordinary differential equations (ODEs) are widely used to build generative models of time-series. In addition to high computational overhead due to explicitly computing hidden states recurrence, existing ODE-based models fall short in learning sequence data with sharp transitions - common in many real-world systems - due to numerical challenges during optimization. In this work, we propose LS4, a generative model for sequences with latent variables evolving according to a state space ODE to increase modeling capacity. Inspired by recent deep state space models (S4), we achieve speedups by leveraging a convolutional representation of LS4 which bypasses the explicit evaluation of hidden states. We show that LS4 significantly outperforms previous continuous-time generative models in terms of marginal distribution, classification, and prediction scores on real-world datasets in the Monash Forecasting Repository, and is capable of modeling highly stochastic data with sharp temporal transitions. LS4 sets state-of-the-art for continuous-time latent generative models, with significant improvement of mean squared error and tighter variational lower bounds on irregularly-sampled datasets, while also being x100 faster than other baselines on long sequences.
    Doubly Smoothed GDA: Global Convergent Algorithm for Constrained Nonconvex-Nonconcave Minimax Optimization. (arXiv:2212.12978v1 [math.OC])
    Nonconvex-nonconcave minimax optimization has been the focus of intense research over the last decade due to its broad applications in machine learning and operation research. Unfortunately, most existing algorithms cannot be guaranteed to converge and always suffer from limit cycles. Their global convergence relies on certain conditions that are difficult to check, including but not limited to the global Polyak-\L{}ojasiewicz condition, the existence of a solution satisfying the weak Minty variational inequality and $\alpha$-interaction dominant condition. In this paper, we develop the first provably convergent algorithm called doubly smoothed gradient descent ascent method, which gets rid of the limit cycle without requiring any additional conditions. We further show that the algorithm has an iteration complexity of $\mathcal{O}(\epsilon^{-4})$ for finding a game stationary point, which matches the best iteration complexity of single-loop algorithms under nonconcave-concave settings. The algorithm presented here opens up a new path for designing provable algorithms for nonconvex-nonconcave minimax optimization problems.
    Statistical Mechanics of Generalization In Graph Convolution Networks. (arXiv:2212.13069v1 [cs.LG])
    Graph neural networks (GNN) have become the default machine learning model for relational datasets, including protein interaction networks, biological neural networks, and scientific collaboration graphs. We use tools from statistical physics and random matrix theory to precisely characterize generalization in simple graph convolution networks on the contextual stochastic block model. The derived curves are phenomenologically rich: they explain the distinction between learning on homophilic and heterophilic graphs and they predict double descent whose existence in GNNs has been questioned by recent work. Our results are the first to accurately explain the behavior not only of a stylized graph learning model but also of complex GNNs on messy real-world datasets. To wit, we use our analytic insights about homophily and heterophily to improve performance of state-of-the-art graph neural networks on several heterophilic benchmarks by a simple addition of negative self-loop filters.
    Policy Learning with Competing Agents. (arXiv:2204.01884v2 [stat.ML] UPDATED)
    Decision makers often aim to learn a treatment assignment policy under a capacity constraint on the number of agents that they can treat. When agents can respond strategically to such policies, competition arises, complicating the estimation of the effect of the policy. In this paper, we study capacity-constrained treatment assignment in the presence of such interference. We consider a dynamic model where the decision maker allocates treatments at each time step and heterogeneous agents myopically best respond to the previous treatment assignment policy. When the number of agents is large but finite, we show that the threshold for receiving treatment under a given policy converges to the policy's mean-field equilibrium threshold. Based on this result, we develop a consistent estimator for the policy effect. In simulations and a semi-synthetic experiment with data from the National Education Longitudinal Study of 1988, we demonstrate that this estimator can be used for learning capacity-constrained policies in the presence of strategic behavior.
    Modeling Nonlinear Dynamics in Continuous Time with Inductive Biases on Decay Rates and/or Frequencies. (arXiv:2212.13033v1 [stat.ML])
    We propose a neural network-based model for nonlinear dynamics in continuous time that can impose inductive biases on decay rates and/or frequencies. Inductive biases are helpful for training neural networks especially when training data are small. The proposed model is based on the Koopman operator theory, where the decay rate and frequency information is used by restricting the eigenvalues of the Koopman operator that describe linear evolution in a Koopman space. We use neural networks to find an appropriate Koopman space, which are trained by minimizing multi-step forecasting and backcasting errors using irregularly sampled time-series data. Experiments on various time-series datasets demonstrate that the proposed method achieves higher forecasting performance given a single short training sequence than the existing methods.
    Improving SGD convergence by online linear regression of gradients in multiple statistically relevant directions. (arXiv:1901.11457v9 [cs.LG] UPDATED)
    Deep neural networks are usually trained with stochastic gradient descent (SGD), which minimizes objective function using very rough approximations of gradient, only averaging to the real gradient. Standard approaches like momentum or ADAM only consider a single direction, and do not try to model distance from extremum - neglecting valuable information from calculated sequence of gradients, often stagnating in some suboptimal plateau. Second order methods could exploit these missed opportunities, however, beside suffering from very large cost and numerical instabilities, many of them attract to suboptimal points like saddles due to negligence of signs of curvatures (as eigenvalues of Hessian). Saddle-free Newton method is a rare example of addressing this issue - changes saddle attraction into repulsion, and was shown to provide essential improvement for final value this way. However, it neglects noise while modelling second order behavior, focuses on Krylov subspace for numerical reasons, and requires costly eigendecomposion. Maintaining SFN advantages, there are proposed inexpensive ways for exploiting these opportunities. Second order behavior is linear dependence of first derivative - we can optimally estimate it from sequence of noisy gradients with least square linear regression, in online setting here: with weakening weights of old gradients. Statistically relevant subspace is suggested by PCA of recent noisy gradients - in online setting it can be made by slowly rotating considered directions toward new gradients, gradually replacing old directions with recent statistically relevant. Eigendecomposition can be also performed online: with regularly performed step of QR method to maintain diagonal Hessian. Outside the second order modeled subspace we can simultaneously perform gradient descent.
    A Universal Law of Robustness via Isoperimetry. (arXiv:2105.12806v4 [cs.LG] UPDATED)
    Classically, data interpolation with a parametrized model class is possible as long as the number of parameters is larger than the number of equations to be satisfied. A puzzling phenomenon in deep learning is that models are trained with many more parameters than what this classical theory would suggest. We propose a partial theoretical explanation for this phenomenon. We prove that for a broad class of data distributions and model classes, overparametrization is necessary if one wants to interpolate the data smoothly. Namely we show that smooth interpolation requires $d$ times more parameters than mere interpolation, where $d$ is the ambient data dimension. We prove this universal law of robustness for any smoothly parametrized function class with polynomial size weights, and any covariate distribution verifying isoperimetry. In the case of two-layers neural networks and Gaussian covariates, this law was conjectured in prior work by Bubeck, Li and Nagaraj. We also give an interpretation of our result as an improved generalization bound for model classes consisting of smooth functions.
    Reconstructing Kernel-based Machine Learning Force Fields with Super-linear Convergence. (arXiv:2212.12737v1 [physics.chem-ph])
    Kernel machines have sustained continuous progress in the field of quantum chemistry. In particular, they have proven to be successful in the low-data regime of force field reconstruction. This is because many physical invariances and symmetries can be incorporated into the kernel function to compensate for much larger datasets. So far, the scalability of this approach has however been hindered by its cubical runtime in the number of training points. While it is known, that iterative Krylov subspace solvers can overcome these burdens, they crucially rely on effective preconditioners, which are elusive in practice. Practical preconditioners need to be computationally efficient and numerically robust at the same time. Here, we consider the broad class of Nystr\"om-type methods to construct preconditioners based on successively more sophisticated low-rank approximations of the original kernel matrix, each of which provides a different set of computational trade-offs. All considered methods estimate the relevant subspace spanned by the kernel matrix columns using different strategies to identify a representative set of inducing points. Our comprehensive study covers the full spectrum of approaches, starting from naive random sampling to leverage score estimates and incomplete Cholesky factorizations, up to exact SVD decompositions.
    Orthogonal Series Estimation for the Ratio of Conditional Expectation Functions. (arXiv:2212.13145v1 [econ.EM])
    In various fields of data science, researchers are often interested in estimating the ratio of conditional expectation functions (CEFR). Specifically in causal inference problems, it is sometimes natural to consider ratio-based treatment effects, such as odds ratios and hazard ratios, and even difference-based treatment effects are identified as CEFR in some empirically relevant settings. This chapter develops the general framework for estimation and inference on CEFR, which allows the use of flexible machine learning for infinite-dimensional nuisance parameters. In the first stage of the framework, the orthogonal signals are constructed using debiased machine learning techniques to mitigate the negative impacts of the regularization bias in the nuisance estimates on the target estimates. The signals are then combined with a novel series estimator tailored for CEFR. We derive the pointwise and uniform asymptotic results for estimation and inference on CEFR, including the validity of the Gaussian bootstrap, and provide low-level sufficient conditions to apply the proposed framework to some specific examples. We demonstrate the finite-sample performance of the series estimator constructed under the proposed framework by numerical simulations. Finally, we apply the proposed method to estimate the causal effect of the 401(k) program on household assets.  ( 2 min )
    Distributionally Robust Model-Based Offline Reinforcement Learning with Near-Optimal Sample Complexity. (arXiv:2208.05767v3 [cs.LG] UPDATED)
    This paper concerns the central issues of model robustness and sample efficiency in offline reinforcement learning (RL), which aims to learn to perform decision making from history data without active exploration. Due to uncertainties and variabilities of the environment, it is critical to learn a robust policy -- with as few samples as possible -- that performs well even when the deployed environment deviates from the nominal one used to collect the history dataset. We consider a distributionally robust formulation of offline RL, focusing on tabular robust Markov decision processes with an uncertainty set specified by the Kullback-Leibler divergence in both finite-horizon and infinite-horizon settings. To combat with sample scarcity, a model-based algorithm that combines distributionally robust value iteration with the principle of pessimism in the face of uncertainty is proposed, by penalizing the robust value estimates with a carefully designed data-driven penalty term. Under a mild and tailored assumption of the history dataset that measures distribution shift without requiring full coverage of the state-action space, we establish the finite-sample complexity of the proposed algorithm, and further show it is almost unimprovable in light of a nearly-matching information-theoretic lower bound up to a polynomial factor of the (effective) horizon length. To the best our knowledge, this provides the first provably near-optimal robust offline RL algorithm that learns under model uncertainty and partial coverage.
    Learning k-Level Sparse Neural Networks Using a New Generalized Group Sparse Envelope Regularization. (arXiv:2212.12921v1 [cs.LG])
    We propose an efficient method to learn both unstructured and structured sparse neural networks during training, using a novel generalization of the sparse envelope function (SEF) used as a regularizer, termed {\itshape{group sparse envelope function}} (GSEF). The GSEF acts as a neuron group selector, which we leverage to induce structured pruning. Our method receives a hardware-friendly structured sparsity of a deep neural network (DNN) to efficiently accelerate the DNN's evaluation. This method is flexible in the sense that it allows any hardware to dictate the definition of a group, such as a filter, channel, filter shape, layer depth, a single parameter (unstructured), etc. By the nature of the GSEF, the proposed method is the first to make possible a pre-define sparsity level that is being achieved at the training convergence, while maintaining negligible network accuracy degradation. We propose an efficient method to calculate the exact value of the GSEF along with its proximal operator, in a worst-case complexity of $O(n)$, where $n$ is the total number of groups variables. In addition, we propose a proximal-gradient-based optimization method to train the model, that is, the non-convex minimization of the sum of the neural network loss and the GSEF. Finally, we conduct an experiment and illustrate the efficiency of our proposed technique in terms of the completion ratio, accuracy, and inference latency.
    Faster Randomized Methods for Orthogonality Constrained Problems. (arXiv:2106.12060v1 [math.NA] CROSS LISTED)
    Recent literature has advocated the use of randomized methods for accelerating the solution of various matrix problems arising throughout data science and computational science. One popular strategy for leveraging randomization is to use it as a way to reduce problem size. However, methods based on this strategy lack sufficient accuracy for some applications. Randomized preconditioning is another approach for leveraging randomization, which provides higher accuracy. The main challenge in using randomized preconditioning is the need for an underlying iterative method, thus randomized preconditioning so far have been applied almost exclusively to solving regression problems and linear systems. In this article, we show how to expand the application of randomized preconditioning to another important set of problems prevalent across data science: optimization problems with (generalized) orthogonality constraints. We demonstrate our approach, which is based on the framework of Riemannian optimization and Riemannian preconditioning, on the problem of computing the dominant canonical correlations and on the Fisher linear discriminant analysis problem. For both problems, we evaluate the effect of preconditioning on the computational costs and asymptotic convergence, and demonstrate empirically the utility of our approach.
    Exact Selective Inference with Randomization. (arXiv:2212.12940v1 [stat.ME])
    We introduce a pivot for exact selective inference with randomization. Not only does our pivot lead to exact inference in Gaussian regression models, but it is also available in closed form. We reduce the problem of exact selective inference to a bivariate truncated Gaussian distribution. By doing so, we give up some power that is achieved with approximate inference in Panigrahi and Taylor (2022). Yet we always produce narrower confidence intervals than a closely related data-splitting procedure. For popular instances of Gaussian regression, this price -- in terms of power -- in exchange for exact selective inference is demonstrated in simulated experiments and in an HIV drug resistance analysis.
    Improving Uncertainty Quantification of Variance Networks by Tree-Structured Learning. (arXiv:2212.12658v1 [cs.LG])
    To improve uncertainty quantification of variance networks, we propose a novel tree-structured local neural network model that partitions the feature space into multiple regions based on uncertainty heterogeneity. A tree is built upon giving the training data, whose leaf nodes represent different regions where region-specific neural networks are trained to predict both the mean and the variance for quantifying uncertainty. The proposed Uncertainty-Splitting Neural Regression Tree (USNRT) employs novel splitting criteria. At each node, a neural network is trained on the full data first, and a statistical test for the residuals is conducted to find the best split, corresponding to the two sub-regions with the most significant uncertainty heterogeneity. USNRT is computationally friendly because very few leaf nodes are sufficient and pruning is unnecessary. On extensive UCI datasets, in terms of both calibration and sharpness, USNRT shows superior performance compared to some recent popular methods for variance prediction, including vanilla variance network, deep ensemble, dropout-based methods, tree-based models, etc. Through comprehensive visualization and analysis, we uncover how USNRT works and show its merits.  ( 2 min )
    Stochastic Methods for AUC Optimization subject to AUC-based Fairness Constraints. (arXiv:2212.12603v1 [cs.LG])
    As machine learning being used increasingly in making high-stakes decisions, an arising challenge is to avoid unfair AI systems that lead to discriminatory decisions for protected population. A direct approach for obtaining a fair predictive model is to train the model through optimizing its prediction performance subject to fairness constraints, which achieves Pareto efficiency when trading off performance against fairness. Among various fairness metrics, the ones based on the area under the ROC curve (AUC) are emerging recently because they are threshold-agnostic and effective for unbalanced data. In this work, we formulate the training problem of a fairness-aware machine learning model as an AUC optimization problem subject to a class of AUC-based fairness constraints. This problem can be reformulated as a min-max optimization problem with min-max constraints, which we solve by stochastic first-order methods based on a new Bregman divergence designed for the special structure of the problem. We numerically demonstrate the effectiveness of our approach on real-world data under different fairness metrics.  ( 2 min )
    Your diffusion model secretly knows the dimension of the data manifold. (arXiv:2212.12611v1 [cs.LG])
    In this work, we propose a novel framework for estimating the dimension of the data manifold using a trained diffusion model. A trained diffusion model approximates the gradient of the log density of a noise-corrupted version of the target distribution for varying levels of corruption. If the data concentrates around a manifold embedded in the high-dimensional ambient space, then as the level of corruption decreases, the score function points towards the manifold, as this direction becomes the direction of maximum likelihood increase. Therefore, for small levels of corruption, the diffusion model provides us with access to an approximation of the normal bundle of the data manifold. This allows us to estimate the dimension of the tangent space, thus, the intrinsic dimension of the data manifold. Our method outperforms linear methods for dimensionality detection such as PPCA in controlled experiments.  ( 2 min )
    Adapting to game trees in zero-sum imperfect information games. (arXiv:2212.12567v1 [stat.ML])
    Imperfect information games (IIG) are games in which each player only partially observes the current game state. We study how to learn $\epsilon$-optimal strategies in a zero-sum IIG through self-play with trajectory feedback. We give a problem-independent lower bound $\mathcal{O}(H(A_{\mathcal{X}}+B_{\mathcal{Y}})/\epsilon^2)$ on the required number of realizations to learn these strategies with high probability, where $H$ is the length of the game, $A_{\mathcal{X}}$ and $B_{\mathcal{Y}}$ are the total number of actions for the two players. We also propose two Follow the Regularize leader (FTRL) algorithms for this setting: Balanced-FTRL which matches this lower bound, but requires the knowledge of the information set structure beforehand to define the regularization; and Adaptive-FTRL which needs $\mathcal{O}(H^2(A_{\mathcal{X}}+B_{\mathcal{Y}})/\epsilon^2)$ plays without this requirement by progressively adapting the regularization to the observations.  ( 2 min )
    A Convergence Rate for Manifold Neural Networks. (arXiv:2212.12606v1 [cs.LG])
    High-dimensional data arises in numerous applications, and the rapidly developing field of geometric deep learning seeks to develop neural network architectures to analyze such data in non-Euclidean domains, such as graphs and manifolds. Recent work by Z. Wang, L. Ruiz, and A. Ribeiro has introduced a method for constructing manifold neural networks using the spectral decomposition of the Laplace Beltrami operator. Moreover, in this work, the authors provide a numerical scheme for implementing such neural networks when the manifold is unknown and one only has access to finitely many sample points. The authors show that this scheme, which relies upon building a data-driven graph, converges to the continuum limit as the number of sample points tends to infinity. Here, we build upon this result by establishing a rate of convergence that depends on the intrinsic dimension of the manifold but is independent of the ambient dimension. We also discuss how the rate of convergence depends on the depth of the network and the number of filters used in each layer.  ( 2 min )
    Concentration of the Langevin Algorithm's Stationary Distribution. (arXiv:2212.12629v1 [stat.ML])
    A canonical algorithm for log-concave sampling is the Langevin Algorithm, aka the Langevin Diffusion run with some discretization stepsize $\eta > 0$. This discretization leads the Langevin Algorithm to have a stationary distribution $\pi_{\eta}$ which differs from the stationary distribution $\pi$ of the Langevin Diffusion, and it is an important challenge to understand whether the well-known properties of $\pi$ extend to $\pi_{\eta}$. In particular, while concentration properties such as isoperimetry and rapidly decaying tails are classically known for $\pi$, the analogous properties for $\pi_{\eta}$ are open questions with direct algorithmic implications. This note provides a first step in this direction by establishing concentration results for $\pi_{\eta}$ that mirror classical results for $\pi$. Specifically, we show that for any nontrivial stepsize $\eta > 0$, $\pi_{\eta}$ is sub-exponential (respectively, sub-Gaussian) when the potential is convex (respectively, strongly convex). Moreover, the concentration bounds we show are essentially tight. Key to our analysis is the use of a rotation-invariant moment generating function (aka Bessel function) to study the stationary dynamics of the Langevin Algorithm. This technique may be of independent interest because it enables directly analyzing the discrete-time stationary distribution $\pi_{\eta}$ without going through the continuous-time stationary distribution $\pi$ as an intermediary.  ( 2 min )
    Neural Networks beyond explainability: Selective inference for sequence motifs. (arXiv:2212.12542v1 [q-bio.GN])
    Over the past decade, neural networks have been successful at making predictions from biological sequences, especially in the context of regulatory genomics. As in other fields of deep learning, tools have been devised to extract features such as sequence motifs that can explain the predictions made by a trained network. Here we intend to go beyond explainable machine learning and introduce SEISM, a selective inference procedure to test the association between these extracted features and the predicted phenotype. In particular, we discuss how training a one-layer convolutional network is formally equivalent to selecting motifs maximizing some association score. We adapt existing sampling-based selective inference procedures by quantizing this selection over an infinite set to a large but finite grid. Finally, we show that sampling under a specific choice of parameters is sufficient to characterize the composite null hypothesis typically used for selective inference-a result that goes well beyond our particular framework. We illustrate the behavior of our method in terms of calibration, power and speed and discuss its power/speed trade-off with a simpler data-split strategy. SEISM paves the way to an easier analysis of neural networks used in regulatory genomics, and to more powerful methods for genome wide association studies (GWAS).  ( 2 min )

  • Open

    Which AI program and method was mostly likely used to make eyes just like this?
    submitted by /u/SurpriseTherapy [link] [comments]  ( 51 min )
    I curated some AI tools for 3D modeling, AR, and VR.
    7 AI tools for 3D modeling, AR, and VR. Point-E Kaedim Kinetix Thishousedoesnotexist 5.Dpth Dream Fusion ChatARKit What would you add? submitted by /u/TheVellerShow [link] [comments]  ( 51 min )
    New AI assistant steals fashion shows with its designs
    submitted by /u/Mk_Makanaki [link] [comments]  ( 54 min )
    AI triples stroke recovery in the UK
    In a press release by the NHS, they said “Use of cutting-edge AI technology is associated with tripling of patients recovering and able to perform daily activities from 16% to 48%” Now that's a tri-ing jump, get it? How does it work? I hear you ask The technology analyses the brain CT scans of stroke patients arriving at the hospital, taking less than a minute to identify the type and severity of the stroke and the most appropriate treatment. Doctors can then quickly offer drugs or surgery, with the technology shortening the average time between patients arriving at the hospital and starting treatment by one hour - from 140 minutes to 79 minutes. This one is most definitely a GAME CHANGER, saving time, money, and Lives A massive win for the AI Community. This is from the AI With Vibes Newsletter, read the full issue here: https://aiwithvibes.beehiiv.com/p/openai-dumbing-chatgpt submitted by /u/Mk_Makanaki [link] [comments]  ( 52 min )
    What are some of your favorite AI powered apps/use cases right now? Not ones that you think "oh this is neat" but ones that are genuinely helpful.
    I write a daily newsletter covering things in AI and am trying to find things that everyday people might want to use. Something outside of "this has helped me write code for my new software!" Etc. submitted by /u/LightPoleBoy [link] [comments]  ( 54 min )
    Can AI write good poetry? Putting ChatGPT to the test
    Hello! I approach this topic not as one who is passionately interested in AI as much as I do as someone who loves reading poetry. As such I really just try to evaluate it in its own terms, and hence it may be of some interest for you. I looked at three criteria: the music (how metrically correct it is), the language (the complexity and flair of the language) and finally how much it touches the reader. The article is linked below: https://www.lookingtoleeward.se/2022/12/26/can-ai-write-good-poetry-putting-chatgpt-to-the-test/ submitted by /u/Similar-Movie1663 [link] [comments]  ( 53 min )
    AI Dream 134 - Discovery of Zion Remastered - INCREDIBLE AI ANIMATION
    submitted by /u/LordPewPew777 [link] [comments]  ( 58 min )
    Responding To Sam Does Arts!
    submitted by /u/PuppetHere [link] [comments]  ( 73 min )
    Video Essay on Retroarch's Ai Translation Features (for retro game emulation)
    submitted by /u/anybutton2start [link] [comments]  ( 51 min )
    I built a web app tool to paraphrase, grammar check, and summarize text with OpenAI GPT-3. Details in the comment
    submitted by /u/Austin_Nguyen_2k [link] [comments]  ( 56 min )
    Simulating revolutions - ChatGPT and symbolic simulations
    Simulating revolutions - ChatGPT and symbolic simulations, an article. submitted by /u/goronmask [link] [comments]  ( 54 min )
    What ai should i use to enhance an old blurry picture of me?
    submitted by /u/moe_mel [link] [comments]  ( 51 min )
    Why applied artificial intelligence needs a major mind-shift
    submitted by /u/bendee983 [link] [comments]  ( 57 min )
    What if AI are other human's dreams, and we get the final renders?
    submitted by /u/KaviarNFT [link] [comments]  ( 59 min )
    If anyone needs this...
    submitted by /u/ampankajsharma [link] [comments]  ( 51 min )
    Can you guess the movie from an AI-generated image?
    submitted by /u/xavi160 [link] [comments]  ( 50 min )
    What are your thoughts on Generative AI?
    I recently read this article and thought of using ChatGPT. I've been chatting with ChatGPT all week, bouncing ideas off of it to get it to help me flesh out my thoughts. I found out that these technologies are iterative. One is built on top of the last one, and each new iteration is more powerful and increases the potential for discovery in some exponential way. It's like a whole new level for these machines to grow and improve, and it's opening up all kinds of possibilities for what we might find out. Also, something like this has been going on for a while now like (JasperAI, CopyAI, Copysmith… the list goes on… maybe Google is even going to join the bandwagon with Google Assistant? Who knows). These technologies are also seriously disruptive, like we've never seen before. If you don't believe me, just spend a week chatting with ChatGPT or something similar and see for yourself. It’s obvious that these tools (yes tools) are going to be like a boost to our own creative skills, not to take over or anything, just to make them even better. So for those creative workers out there like copywriters, graphic designers and web designers, instead of worrying that you might get replaced, you can instead use this technology to your own advantage. You can use it for ideas for blog topics. You can also use it for design ideas and templates for your graphics and website. And that’s just the tip of the iceberg. People are worried that these technologies might take the jobs of regular humans because they can help companies get stuff done with less people. But I think it's important to think about how these technologies are affecting us and to make sure they're used in a responsible and helpful way for everyone. But AI is changing fast, so it's tough to say for sure how these technologies will play out in the future. We’ll see in 5-10 years at least how much AI will improve. submitted by /u/According_Complex_74 [link] [comments]  ( 68 min )
    Landscape generator?
    Is there a landscape generator? I'm searching for something like thispersondoesnotexist.com, but one that generates original landscape images. submitted by /u/OwnCranberry4948 [link] [comments]  ( 51 min )
    I participated in the alpha test of AI, which creates various images of a character without losing its consistency of appearance. The result was stunning
    submitted by /u/blbird [link] [comments]  ( 51 min )
    Trippy Eye Animation using SD
    submitted by /u/oridnary_artist [link] [comments]  ( 48 min )
    AI In Education - A Teacher's Perspective
    I teach high school and this is the first time I've encountered anything like this. Multiple students submitting writing assignments that are clearly AI generated. The administration seems to want to punish the students and move on. However, there is a clear learning opportunity here. It's not going to go away. The education piece needs to go beyond "how do we catch them" and "how do we avoid it." W teach students how to use other assistive technologies, why not AI? I know this is a vague and open question, but... What do you think we should be teaching our children around AI writing tech, or ai in general? Any specifics, resources or examples? submitted by /u/benny1872 [link] [comments]  ( 53 min )
  • Open

    [R] PyTorch | Budget GPU Benchmarking
    Greetings! ​ Recently I was asked about a budget AI / ML workload, and decided to test it against some of my own lab GPUs. ​ I'll be adding more tests, and benchmarks over time, but below is a link to my website where I covered it. As well as the code I wrote to benchmark them. ​ Hopefully this helps someone out there. :-) ​ https://www.zb-c.tech/2022/12/26/pytorch-drag-race-tesla-k80-performance/ submitted by /u/zveroboy152 [link] [comments]  ( 65 min )
    [Research] Can you use GANs to boost YOLOv5 object detection dataset?
    I was building a YOLOv5 object detection model, and was looking into researching synthetic methods like GANs to increase the size of my training set in an unsupervised manner. Ik few-shot GANs can be used to "hallucinate" images and labels for a classification task, but how can they be extended to hallucinate images and labels in YOLO format (basically lists out each bounding box and class)? Is there some way that I can train a GAN on images / YOLO labels, and get it to hallucinate more images / labels? submitted by /u/WeAreNebula [link] [comments]  ( 70 min )
    [D] Focused training of AutoEncoder embeddings?
    I am trying to produce an AutoEncoder that has meaningful embeddings for dimensionality reduction. Additionally, I have a specific downstream task I have in mind to use the embeddings for, so I would like to know if it makes sense to write a loss function that considers both the reconstruction accuracy of the AutoEncoder, as well as prediction accuracy for the downstream task. If so, are there any relevant loss functions or articles I should refer to? Thanks! submitted by /u/austinv11 [link] [comments]  ( 66 min )
    [D] Taylor & Francis Article status stuck on pending editor decision for last 4 months?
    Dear fellows, I submitted my article to one of the Taylor & Francis journal in mid-2021. It received a reject and resubmit decision in early 2022. I undertook the major revisions and resubmitted my article in mid-2022. Its status went from under review to pending editor decision on September 2022. However, since then, there has been no update. I tried to contact the chief editor and editor-in-command in the period of last month. However, I have yet to hear from them. My paper has already been significantly delayed, and this uncertain situation worsens my anxiety. What do you think I should do in this case? submitted by /u/HQ2020 [link] [comments]  ( 64 min )
    [P] I built a CLI helper integrating with GPT-3. It enables you to ask questions straight in your terminal
    Hi all! As most of you here, I've played around a bit with CHATGPT, but felt it was annoying to always have to log into their GUI to ask the questions. To scratch my own itch and at the same time learn more about how to write my own command line interface, I created 'askai': https://github.com/maxvfischer/askai It is a simple CLI integration with OpenAI’s GPT3 models. I’ve primarily used it to get quick answers to technical questions, like: askai "How to mock user input when writing a Python pytest test?" askai "How do I remove a conda environment?" As I've found it quite helpful, I decided to spend some time to package it in a nicer way to share it with you. I've also uploaded it to PyPI to simplify the installation process. 'askai' enables you to: Ask questions and get the answers straight into your terminal Configure which model and model parameters you want to use Overwrite saved configurations when you ask questions Currently, it only supports OpenAI’s models, but my plan is to integrate more endpoints as soon as new capable NLP endpoints are popping up. I hope some of you find it useful :) submitted by /u/maktattengil [link] [comments]  ( 64 min )
    [Discussion] 2 discrimination mechanisms that should be provided with powerful generative models e.g. ChatGPT or DALL-E
    In the wake of all the questions and worries about models that can generate content nearing (or exceeding, in some cases) the quality of that made of humans, there are a couple mechanisms that companies should provide alongside their models. Both vary in feasibility, but in general, both are pretty doable, at least for what we've seen so far. A hashing-based system to check whether a given piece of content was generated by the model. This can be accomplished by hashing all of the outputs of the model, and storing them. If it doesn't pose some sort of security risk for the generator, it could also provide the date of generation. A model for discriminating whether a given piece of content was generated by the model, similar to this model for GPT-2. This is necessary in addition to the simpler hashing mechanism, since it's possible for only a portion of the media to be generated. This would be imperfect, of course, but if nothing else, we should press companies enough that they feel obligated to give it a dedicated try. These mechanisms need real support - an API for developers, and a UI for less sophisticated users. They should have decent latency, and be hopefully be provided for free, at some level of usage - I understand the compute required could be enormous. Curious what others think here :) submitted by /u/Exnur0 [link] [comments]  ( 74 min )
    [P] Can you distinguish AI-generated content from real art or literature? I made a little test!
    Hi everyone, I am no programmer, and I have a very basic knowledge of machine learning, but I am fascinated by the possibilities offered by all the new models we have seen so far. Some people around me say they are not that impressed by what AIs can do, so I built a small test (with a little help by chatGPT to code the whole thing): can you always 100% distinguish between AI art or text and old works of art or literature? Here is the site: http://aiorart.com/ I find that AI-generated text is still generally easy to spot, but of course it is very challenging to go against great literary works. AI images can sometimes be truly deceptive. I wonder what you will all think of it... and how all that will evolve in the coming months! PS: The site is very crude (again, I am no programmer!). It works though. submitted by /u/Dicitur [link] [comments]  ( 77 min )
    [D] Has any research been done to counteract the fact that each training datapoint "pulls the model in a different direction", partly undoing learning until shared features emerge?
    I don't remember where I've read about this, but it left a lasting impression on me as it feels intuitively true and impactful - in a manner, the learning on each datapoint pulls the network towards encoding that individual example, relying on stochastic emergence of shared features, which in turn relies on a dataset:model size ratio that prevents overfitting and a balanced dataset. Has there been any research into counteracting this phenomenon, such as more purposeful extraction of features, clever batching schemas, synthetic datapoints or anything else such? submitted by /u/derpderp3200 [link] [comments]  ( 69 min )
  • Open

    Airport abbreviation origins
    It doesn’t take much imagination to understand why DEN is the IATA abbreviation for the Denver airport, but the abbreviation MCO for the Orlando airport is more of a head scratcher. Here is a list of the busiest airports in the US along with a brief indication of the reason behind their abbreviations. Some require […] Airport abbreviation origins first appeared on John D. Cook.  ( 5 min )
    Visually symmetric words
    I recently ran into the following comic strip online: [Update: Thanks to Bryan Cantanzaro for letting me know via the comments that the image above was created by Hannah Hillam. The version I found had had her copyright information edited out. I will replace the image above with a legitimate version shortly.] [Update 2: I’m […] Visually symmetric words first appeared on John D. Cook.  ( 6 min )
  • Open

    3D Artist Zhelong Xu Revives Chinese Relics This Week ‘In the NVIDIA Studio’
    Artist Zhelong Xu, aka Uncle Light, brought to life Blood Moon — a 3D masterpiece combining imagination, craftsmanship and art styles from the Chinese Bronze Age — along with Kirin, a symbol of hope and good fortune, using NVIDIA technologies.  ( 7 min )
    11 Essential Explainers to Keep You in the Know in 2023
    These explainers will give you the scoop on the latest tech developments from AI models to green computing.  ( 4 min )
  • Open

    Conference on Robot Learning 2022
    The airplanes on display at the CoRL 2022 banquet. At the end of my last post which belatedly summarized RSS 2022, I mentioned I was also attending CoRL 2022 in a much farther away city: Auckland, New Zealand. That conference has now concluded and I thought it went well. I attended CoRL for a few reasons. I was presenting our recent ToolFlowNet paper, which is one of the major projects that I have worked on during my postdoc. I was part of the inclusion committee at CoRL, so I also got partial funding to attend. The conference is well aligned for my research interests. New Zealand is really nice at this time of the year. Unlike most of my prior conference reports where I write them as blog posts, here I have notes in this Google Doc. I was working on this while at CoRL, and it would take a lot of time to convert these to something that looks nice on the website, and Google Docs might be easier for me to do quick edits if needed. If robot learning is of interest to you, I hope you enjoy these conference notes. See you next year in Atlanta, Georgia, for CoRL 2023.  ( 1 min )
  • Open

    Procgen environments "easy" vs "hard" difficulty - what are they?
    Hello! the procgen environments have "easy" and "hard" mode. https://arxiv.org/pdf/1912.01588.pdf ​ From the paper, this is the only paragraph about "easy" means - that it is a slightly different distribution of levels than "hard". Does anyone know what "easy" precisely means - what kind of distribution of levels is it? Thanks so much in advance! (and happy holidays!) https://preview.redd.it/zj7yczavjd8a1.png?width=653&format=png&auto=webp&s=bc97cc8f20b5c149dba1010f1f1c3c9158c7b4db submitted by /u/sunchipsster [link] [comments]  ( 62 min )
  • Open

    Variants of SGD for Lipschitz Continuous Loss Functions in Low-Precision Environments. (arXiv:2211.04655v2 [math.OC] UPDATED)
    Motivated by neural network training in low-bit floating and fixed-point environments, this work studies the convergence of variants of SGD with computational error. Considering a general stochastic Lipschitz continuous loss function, a novel convergence result to a Clarke stationary point is presented assuming that only an approximation of its stochastic gradient can be computed as well as error in computing the SGD step itself. Different variants of SGD are then tested empirically in a variety of low-precision arithmetic environments, where improved test set accuracy is observed compared to SGD for two image recognition tasks.
    SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos. (arXiv:2206.07764v2 [cs.CV] UPDATED)
    The visual world can be parsimoniously characterized in terms of distinct entities with sparse interactions. Discovering this compositional structure in dynamic visual scenes has proven challenging for end-to-end computer vision approaches unless explicit instance-level supervision is provided. Slot-based models leveraging motion cues have recently shown great promise in learning to represent, segment, and track objects without direct supervision, but they still fail to scale to complex real-world multi-object videos. In an effort to bridge this gap, we take inspiration from human development and hypothesize that information about scene geometry in the form of depth signals can facilitate object-centric learning. We introduce SAVi++, an object-centric video model which is trained to predict depth signals from a slot-based video representation. By further leveraging best practices for model scaling, we are able to train SAVi++ to segment complex dynamic scenes recorded with moving cameras, containing both static and moving objects of diverse appearance on naturalistic backgrounds, without the need for segmentation supervision. Finally, we demonstrate that by using sparse depth signals obtained from LiDAR, SAVi++ is able to learn emergent object segmentation and tracking from videos in the real-world Waymo Open dataset.
    Parallel Automatic History Matching Algorithm Using Reinforcement Learning. (arXiv:2211.07434v2 [cs.LG] UPDATED)
    Reformulating the history matching problem from a least-square mathematical optimization problem into a Markov Decision Process introduces a method in which reinforcement learning can be utilized to solve the problem. This method provides a mechanism where an artificial deep neural network agent can interact with the reservoir simulator and find multiple different solutions to the problem. Such formulation allows for solving the problem in parallel by launching multiple concurrent environments enabling the agent to learn simultaneously from all the environments at once, achieving significant speed up.
    Generalization Bounds for Transfer Learning with Pretrained Classifiers. (arXiv:2212.12532v1 [cs.LG])
    We study the ability of foundation models to learn representations for classification that are transferable to new, unseen classes. Recent results in the literature show that representations learned by a single classifier over many classes are competitive on few-shot learning problems with representations learned by special-purpose algorithms designed for such problems. We offer an explanation for this phenomenon based on the concept of class-features variability collapse, which refers to the training dynamics of deep classification networks where the feature embeddings of samples belonging to the same class tend to concentrate around their class means. More specifically, we examine the few-shot error of the learned feature map, which is the classification error of the nearest class-center classifier using centers learned from a small number of random samples from each class. Assuming that the classes appearing in the data are selected independently from a distribution, we show that the few-shot error generalizes from the training data to unseen test data, and we provide an upper bound on the expected few-shot error for new classes (selected from the same distribution) using the average few-shot error for the source classes. Additionally, we show that the few-shot error on the training data can be upper bounded using the degree of class-features variability collapse. This suggests that foundation models can provide feature maps that are transferable to new downstream tasks even with limited data available.
    Experiments on Turkish ASR with Self-Supervised Speech Representation Learning. (arXiv:2210.07323v3 [cs.CL] UPDATED)
    While the Turkish language is listed among low-resource languages, literature on Turkish automatic speech recognition (ASR) is relatively old. In this report, we present our findings on Turkish ASR with speech representation learning using HUBERT. We investigate pre-training HUBERT for Turkish with large-scale data curated from online resources. We pre-train our model using 6,500 hours of speech data from YouTube. The results show that the models are not ready for commercial use since they are not robust against disturbances that typically occur in real-world settings such as variations in accents, slang, background noise and interference. We analyze typical errors and the limitations of the models for use in commercial settings.
    Bi-Stride Multi-Scale Graph Neural Network for Mesh-Based Physical Simulation. (arXiv:2210.02573v2 [cs.LG] UPDATED)
    Learning physical systems on unstructured meshes by flat Graph neural networks (GNNs) faces the challenge of modeling the long-range interactions due to the scaling complexity w.r.t. the number of nodes, limiting the generalization under mesh refinement. On regular grids, the convolutional neural networks (CNNs) with a U-net structure can resolve this challenge by efficient stride, pooling, and upsampling operations. Nonetheless, these tools are much less developed for graph neural networks (GNNs), especially when GNNs are employed for learning large-scale mesh-based physics. The challenges arise from the highly irregular meshes and the lack of effective ways to construct the multi-level structure without losing connectivity. Inspired by the bipartite graph determination algorithm, we introduce Bi-Stride Multi-Scale Graph Neural Network (BSMS-GNN) by proposing \textit{bi-stride} as a simple pooling strategy for building the multi-level GNN. \textit{Bi-stride} pools nodes by striding every other BFS frontier; it 1) works robustly on any challenging mesh in the wild, 2) avoids using a mesh generator at coarser levels, 3) avoids the spatial proximity for building coarser levels, and 4) uses non-parametrized aggregating/returning instead of MLPs during pooling and unpooling. Experiments show that our framework significantly outperforms the state-of-the-art method's computational efficiency in representative physics-based simulation cases.
    Robust Learning of Parsimonious Deep Neural Networks. (arXiv:2205.04650v2 [cs.LG] UPDATED)
    We propose a simultaneous learning and pruning algorithm capable of identifying and eliminating irrelevant structures in a neural network during the early stages of training. Thus, the computational cost of subsequent training iterations, besides that of inference, is considerably reduced. Our method, based on variational inference principles using Gaussian scale mixture priors on neural network weights, learns the variational posterior distribution of Bernoulli random variables multiplying the units/filters similarly to adaptive dropout. Our algorithm, ensures that the Bernoulli parameters practically converge to either 0 or 1, establishing a deterministic final network. We analytically derive a novel hyper-prior distribution over the prior parameters that is crucial for their optimal selection and leads to consistent pruning levels and prediction accuracy regardless of weight initialization or the size of the starting network. We prove the convergence properties of our algorithm establishing theoretical and practical pruning conditions. We evaluate the proposed algorithm on the MNIST and CIFAR-10 data sets and the commonly used fully connected and convolutional LeNet and VGG16 architectures. The simulations show that our method achieves pruning levels on par with state-of the-art methods for structured pruning, while maintaining better test-accuracy and more importantly in a manner robust with respect to network initialization and initial size.
    Hierarchical Interdisciplinary Topic Detection Model for Research Proposal Classification. (arXiv:2209.13519v2 [cs.IR] UPDATED)
    The peer merit review of research proposals has been the major mechanism for deciding grant awards. However, research proposals have become increasingly interdisciplinary. It has been a longstanding challenge to assign interdisciplinary proposals to appropriate reviewers, so proposals are fairly evaluated. One of the critical steps in reviewer assignment is to generate accurate interdisciplinary topic labels for proposal-reviewer matching. Existing systems mainly collect topic labels manually generated by principal investigators. However, such human-reported labels can be non-accurate, incomplete, labor intensive, and time costly. What role can AI play in developing a fair and precise proposal reviewer assignment system? In this study, we collaborate with the National Science Foundation of China to address the task of automated interdisciplinary topic path detection. For this purpose, we develop a deep Hierarchical Interdisciplinary Research Proposal Classification Network (HIRPCN). Specifically, we first propose a hierarchical transformer to extract the textual semantic information of proposals. We then design an interdisciplinary graph and leverage GNNs for learning representations of each discipline in order to extract interdisciplinary knowledge. After extracting the semantic and interdisciplinary knowledge, we design a level-wise prediction component to fuse the two types of knowledge representations and detect interdisciplinary topic paths for each proposal. We conduct extensive experiments and expert evaluations on three real-world datasets to demonstrate the effectiveness of our proposed model.
    Optimizing Warfarin Dosing using Deep Reinforcement Learning. (arXiv:2202.03486v3 [cs.LG] UPDATED)
    Warfarin is a widely used anticoagulant, and has a narrow therapeutic range. Dosing of warfarin should be individualized, since slight overdosing or underdosing can have catastrophic or even fatal consequences. Despite much research on warfarin dosing, current dosing protocols do not live up to expectations, especially for patients sensitive to warfarin. We propose a deep reinforcement learning-based dosing model for warfarin. To overcome the issue of relatively small sample sizes in dosing trials, we use a Pharmacokinetic/ Pharmacodynamic (PK/PD) model of warfarin to simulate dose-responses of virtual patients. Applying the proposed algorithm on virtual test patients shows that this model outperforms a set of clinically accepted dosing protocols by a wide margin. We tested the robustness of our dosing protocol on a second PK/PD model and showed that its performance is comparable to the set of baseline protocols.
    Learning Latent Representations to Co-Adapt to Humans. (arXiv:2212.09586v2 [cs.RO] UPDATED)
    When robots interact with humans in homes, roads, or factories the human's behavior often changes in response to the robot. Non-stationary humans are challenging for robot learners: actions the robot has learned to coordinate with the original human may fail after the human adapts to the robot. In this paper we introduce an algorithmic formalism that enables robots (i.e., ego agents) to co-adapt alongside dynamic humans (i.e., other agents) using only the robot's low-level states, actions, and rewards. A core challenge is that humans not only react to the robot's behavior, but the way in which humans react inevitably changes both over time and between users. To deal with this challenge, our insight is that -- instead of building an exact model of the human -- robots can learn and reason over high-level representations of the human's policy and policy dynamics. Applying this insight we develop RILI: Robustly Influencing Latent Intent. RILI first embeds low-level robot observations into predictions of the human's latent strategy and strategy dynamics. Next, RILI harnesses these predictions to select actions that influence the adaptive human towards advantageous, high reward behaviors over repeated interactions. We demonstrate that -- given RILI's measured performance with users sampled from an underlying distribution -- we can probabilistically bound RILI's expected performance across new humans sampled from the same distribution. Our simulated experiments compare RILI to state-of-the-art representation and reinforcement learning baselines, and show that RILI better learns to coordinate with imperfect, noisy, and time-varying agents. Finally, we conduct two user studies where RILI co-adapts alongside actual humans in a game of tag and a tower-building task. See videos of our user studies here: https://youtu.be/WYGO5amDXbQ
    Statistical Efficiency of Score Matching: The View from Isoperimetry. (arXiv:2210.00726v2 [cs.LG] UPDATED)
    Deep generative models parametrized up to a normalizing constant (e.g. energy-based models) are difficult to train by maximizing the likelihood of the data because the likelihood and/or gradients thereof cannot be explicitly or efficiently written down. Score matching is a training method, whereby instead of fitting the likelihood $\log p(x)$ for the training data, we instead fit the score function $\nabla_x \log p(x)$ -- obviating the need to evaluate the partition function. Though this estimator is known to be consistent, its unclear whether (and when) its statistical efficiency is comparable to that of maximum likelihood -- which is known to be (asymptotically) optimal. We initiate this line of inquiry in this paper, and show a tight connection between statistical efficiency of score matching and the isoperimetric properties of the distribution being estimated -- i.e. the Poincar\'e, log-Sobolev and isoperimetric constant -- quantities which govern the mixing time of Markov processes like Langevin dynamics. Roughly, we show that the score matching estimator is statistically comparable to the maximum likelihood when the distribution has a small isoperimetric constant. Conversely, if the distribution has a large isoperimetric constant -- even for simple families of distributions like exponential families with rich enough sufficient statistics -- score matching will be substantially less efficient than maximum likelihood. We suitably formalize these results both in the finite sample regime, and in the asymptotic regime. Finally, we identify a direct parallel in the discrete setting, where we connect the statistical properties of pseudolikelihood estimation with approximate tensorization of entropy and the Glauber dynamics.
    Generate synthetic samples from tabular data. (arXiv:2209.06113v2 [cs.LG] UPDATED)
    Generating new samples from data sets can mitigate extra expensive operations, increased invasive procedures, and mitigate privacy issues. These novel samples that are statistically robust can be used as a temporary and intermediate replacement when privacy is a concern. This method can enable better data sharing practices without problems relating to identification issues or biases that are flaws for an adversarial attack.
    Polysemanticity and Capacity in Neural Networks. (arXiv:2210.01892v2 [cs.NE] UPDATED)
    Individual neurons in neural networks often represent a mixture of unrelated features. This phenomenon, called polysemanticity, can make interpreting neural networks more difficult and so we aim to understand its causes. We propose doing so through the lens of feature \emph{capacity}, which is the fractional dimension each feature consumes in the embedding space. We show that in a toy model the optimal capacity allocation tends to monosemantically represent the most important features, polysemantically represent less important features (in proportion to their impact on the loss), and entirely ignore the least important features. Polysemanticity is more prevalent when the inputs have higher kurtosis or sparsity and more prevalent in some architectures than others. Given an optimal allocation of capacity, we go on to study the geometry of the embedding space. We find a block-semi-orthogonal structure, with differing block sizes in different models, highlighting the impact of model architecture on the interpretability of its neurons.
    Deep learning in a bilateral brain with hemispheric specialization. (arXiv:2209.06862v4 [q-bio.NC] UPDATED)
    The brains of all bilaterally symmetric animals on Earth are are divided into left and right hemispheres. The anatomy and functionality of the hemispheres have a large degree of overlap, but they specialize to possess different attributes. The left hemisphere is believed to specialize in specificity and routine, the right in generalities and novelty. In this study, we propose an artificial neural network that imitates that bilateral architecture using two convolutional neural networks with different training objectives and test it on an image classification task. The bilateral architecture outperforms architectures of similar representational capacity that don't exploit differential specialization. It demonstrates the efficacy of bilateralism and constitutes a new principle that could be incorporated into other computational neuroscientific models and used as an inductive bias when designing new ML systems. An analysis of the model can help us to understand the human brain.
    Anisotropic, Sparse and Interpretable Physics-Informed Neural Networks for PDEs. (arXiv:2207.00377v3 [cs.LG] UPDATED)
    There has been a growing interest in the use of Deep Neural Networks (DNNs) to solve Partial Differential Equations (PDEs). Despite the promise that such approaches hold, there are various aspects where they could be improved. Two such shortcomings are (i) their computational inefficiency relative to classical numerical methods, and (ii) the non-interpretability of a trained DNN model. In this work we present ASPINN, an anisotropic extension of our earlier work called SPINN--Sparse, Physics-informed, and Interpretable Neural Networks--to solve PDEs that addresses both these issues. ASPINNs generalize radial basis function networks. We demonstrate using a variety of examples involving elliptic and hyperbolic PDEs that the special architecture we propose is more efficient than generic DNNs, while at the same time being directly interpretable. Further, they improve upon the SPINN models we proposed earlier in that fewer nodes are require to capture the solution using ASPINN than using SPINN, thanks to the anisotropy of the local zones of influence of each node. The interpretability of ASPINN translates to a ready visualization of their weights and biases, thereby yielding more insight into the nature of the trained model. This in turn provides a systematic procedure to improve the architecture based on the quality of the computed solution. ASPINNs thus serve as an effective bridge between classical numerical algorithms and modern DNN based methods to solve PDEs. In the process, we also streamline the training of ASPINNs into a form that is closer to that of supervised learning algorithms.
    Towards a Solution to Bongard Problems: A Causal Approach. (arXiv:2206.07196v2 [cs.LG] UPDATED)
    Even though AI has advanced rapidly in recent years displaying success in solving highly complex problems, the class of Bongard Problems (BPs) yet remain largely unsolved by modern ML techniques. In this paper, we propose a new approach in an attempt to not only solve BPs but also extract meaning out of learned representations. This includes the reformulation of the classical BP into a reinforcement learning (RL) setting which will allow the model to gain access to counterfactuals to guide its decisions but also explain its decisions. Since learning meaningful representations in BPs is an essential sub-problem, we further make use of contrastive learning for the extraction of low level features from pixel data. Several experiments have been conducted for analyzing the general BP-RL setup, feature extraction methods and using the best combination for the feature space analysis and its interpretation.
    Can Foundation Models Talk Causality?. (arXiv:2206.10591v2 [cs.AI] UPDATED)
    Foundation models are subject to an ongoing heated debate, leaving open the question of progress towards AGI and dividing the community into two camps: the ones who see the arguably impressive results as evidence to the scaling hypothesis, and the others who are worried about the lack of interpretability and reasoning capabilities. By investigating to which extent causal representations might be captured by these large scale language models, we make a humble efforts towards resolving the ongoing philosophical conflicts.
    How to Find Actionable Static Analysis Warnings: A Case Study with FindBugs. (arXiv:2205.10504v2 [cs.SE] UPDATED)
    Automatically generated static code warnings suffer from a large number of false alarms. Hence, developers only take action on a small percent of those warnings. To better predict which static code warnings should not be ignored, we suggest that analysts need to look deeper into their algorithms to find choices that better improve the particulars of their specific problem. Specifically, we show here that effective predictors of such warnings can be created by methods that locally adjust the decision boundary (between actionable warnings and others). These methods yield a new high water-mark for recognizing actionable static code warnings. For eight open-source Java projects (cassandra, jmeter, commons, lucene-solr, maven, ant, tomcat, derby) we achieve perfect test results on 4/8 datasets and, overall, a median AUC (area under the true negatives, true positives curve) of 92%.
    Semantic Information G Theory and Logical Bayesian Inference for Machine Learning. (arXiv:1809.01577v2 [cs.AI] UPDATED)
    An important problem with machine learning is that when label number n>2, it is very difficult to construct and optimize a group of learning functions, and we wish that optimized learning functions are still useful when prior distribution P(x) (where x is an instance) is changed. To resolve this problem, the semantic information G theory, Logical Bayesian Inference (LBI), and a group of Channel Matching (CM) algorithms together form a systematic solution. A semantic channel in the G theory consists of a group of truth functions or membership functions. In comparison with likelihood functions, Bayesian posteriors, and Logistic functions used by popular methods, membership functions can be more conveniently used as learning functions without the above problem. In Logical Bayesian Inference (LBI), every label's learning is independent. For Multilabel learning, we can directly obtain a group of optimized membership functions from a big enough sample with labels, without preparing different samples for different labels. A group of Channel Matching (CM) algorithms is developed for machine learning. For the Maximum Mutual Information (MMI) classification of three classes with Gaussian distributions on a two-dimensional feature space, 2-3 iterations can make mutual information between three classes and three labels surpass 99% of the MMI for most initial partitions. For mixture models, the Expectation-Maximization (EM) algorithm is improved and becomes the CM-EM algorithm, which can outperform the EM algorithm when mixture ratios are imbalanced, or local convergence exists. The CM iteration algorithm needs to combine neural networks for MMI classifications on high-dimensional feature spaces. LBI needs further studies for the unification of statistics and logic.
    Neonatal EEG graded for severity of background abnormalities in hypoxic-ischaemic encephalopathy. (arXiv:2206.04420v2 [physics.med-ph] UPDATED)
    This report describes a set of neonatal electroencephalogram (EEG) recordings graded according to the severity of abnormalities in the background pattern. The dataset consists of 169 hours of multichannel EEG from 53 neonates recorded in a neonatal intensive care unit. All neonates received a diagnosis of hypoxic-ischaemic encephalopathy (HIE), the most common cause of brain injury in full term infants. For each neonate, multiple 1-hour epochs of good quality EEG were selected and then graded for background abnormalities. The grading system assesses EEG attributes such as amplitude and frequency, continuity, sleep--wake cycling, symmetry and synchrony, and abnormal waveforms. Background severity was then categorised into 4 grades: normal or mildly abnormal EEG, moderately abnormal EEG, severely abnormal EEG, and inactive EEG. The data can be used as a reference set of multi-channel EEG for neonates with HIE, for EEG training purposes, or for developing and evaluating automated grading algorithms.
    Attribute Inference Attack of Speech Emotion Recognition in Federated Learning Settings. (arXiv:2112.13416v3 [cs.CR] UPDATED)
    Speech emotion recognition (SER) processes speech signals to detect and characterize expressed perceived emotions. Many SER application systems often acquire and transmit speech data collected at the client-side to remote cloud platforms for inference and decision making. However, speech data carry rich information not only about emotions conveyed in vocal expressions, but also other sensitive demographic traits such as gender, age and language background. Consequently, it is desirable for SER systems to have the ability to classify emotion constructs while preventing unintended/improper inferences of sensitive and demographic information. Federated learning (FL) is a distributed machine learning paradigm that coordinates clients to train a model collaboratively without sharing their local data. This training approach appears secure and can improve privacy for SER. However, recent works have demonstrated that FL approaches are still vulnerable to various privacy attacks like reconstruction attacks and membership inference attacks. Although most of these have focused on computer vision applications, such information leakages exist in the SER systems trained using the FL technique. To assess the information leakage of SER systems trained using FL, we propose an attribute inference attack framework that infers sensitive attribute information of the clients from shared gradients or model parameters, corresponding to the FedSGD and the FedAvg training algorithms, respectively. As a use case, we empirically evaluate our approach for predicting the client's gender information using three SER benchmark datasets: IEMOCAP, CREMA-D, and MSP-Improv. We show that the attribute inference attack is achievable for SER systems trained using FL. We further identify that most information leakage possibly comes from the first layer in the SER model.
    A Review of Deep Transfer Learning and Recent Advancements. (arXiv:2201.09679v2 [cs.LG] UPDATED)
    Deep learning has been the answer to many machine learning problems during the past two decades. However, it comes with two major constraints: dependency on extensive labeled data and training costs. Transfer learning in deep learning, known as Deep Transfer Learning (DTL), attempts to reduce such dependency and costs by reusing an obtained knowledge from a source data/task in training on a target data/task. Most applied DTL techniques are network/model-based approaches. These methods reduce the dependency of deep learning models on extensive training data and drastically decrease training costs. As a result, researchers detected Covid-19 infection on chest X-Rays with high accuracy at the beginning of the pandemic with minimal data using DTL techniques. Also, the training cost reduction makes DTL viable on edge devices with limited resources. Like any new advancement, DTL methods have their own limitations, and a successful transfer depends on some adjustments for different scenarios. In this paper, we review the definition and taxonomy of deep transfer learning and well-known methods. Then we investigate the DTL approaches by reviewing recent applied DTL techniques in the past five years. Further, we review some experimental analyses of DTLs to learn the best practice for applying DTL in different scenarios. Moreover, the limitations of DTLs (catastrophic forgetting dilemma and overly biased pre-trained models) are discussed, along with possible solutions and research trends.
    RetroComposer: Composing Templates for Template-Based Retrosynthesis Prediction. (arXiv:2112.11225v2 [physics.chem-ph] UPDATED)
    The main target of retrosynthesis is to recursively decompose desired molecules into available building blocks. Existing template-based retrosynthesis methods follow a template selection stereotype and suffer from limited training templates, which prevents them from discovering novel reactions. To overcome this limitation, we propose an innovative retrosynthesis prediction framework that can compose novel templates beyond training templates. As far as we know, this is the first method that uses machine learning to compose reaction templates for retrosynthesis prediction. Besides, we propose an effective reactant candidate scoring model that can capture atom-level transformations, which helps our method outperform previous methods on the USPTO-50K dataset. Experimental results show that our method can produce novel templates for 15 USPTO-50K test reactions that are not covered by training templates. We have released our source implementation.
    Neural network approach to reconstructing spectral functions and complex poles of confined particles. (arXiv:2203.03293v2 [hep-lat] UPDATED)
    Reconstructing spectral functions from propagator data is difficult as solving the analytic continuation problem or applying an inverse integral transformation are ill-conditioned problems. Recent work has proposed using neural networks to solve this problem and has shown promising results, either matching or improving upon the performance of other methods. We generalize this approach by not only reconstructing spectral functions, but also (possible) pairs of complex poles or an infrared (IR) cutoff. We train our network on physically motivated toy functions, examine the reconstruction accuracy and check its robustness to noise. Encouraging results are found on both toy functions and genuine lattice QCD data for the gluon propagator, suggesting that this approach may lead to significant improvements over current state-of-the-art methods.
    Simplex Neural Population Learning: Any-Mixture Bayes-Optimality in Symmetric Zero-sum Games. (arXiv:2205.15879v4 [cs.AI] UPDATED)
    Learning to play optimally against any mixture over a diverse set of strategies is of important practical interests in competitive games. In this paper, we propose simplex-NeuPL that satisfies two desiderata simultaneously: i) learning a population of strategically diverse basis policies, represented by a single conditional network; ii) using the same network, learn best-responses to any mixture over the simplex of basis policies. We show that the resulting conditional policies incorporate prior information about their opponents effectively, enabling near optimal returns against arbitrary mixture policies in a game with tractable best-responses. We verify that such policies behave Bayes-optimally under uncertainty and offer insights in using this flexibility at test time. Finally, we offer evidence that learning best-responses to any mixture policies is an effective auxiliary task for strategic exploration, which, by itself, can lead to more performant populations.
    Proximal Learning for Individualized Treatment Regimes Under Unmeasured Confounding. (arXiv:2105.01187v4 [stat.ME] UPDATED)
    Data-driven individualized decision making has recently received increasing research interests. Most existing methods rely on the assumption of no unmeasured confounding, which unfortunately cannot be ensured in practice especially in observational studies. Motivated by the recent proposed proximal causal inference, we develop several proximal learning approaches to estimating optimal individualized treatment regimes (ITRs) in the presence of unmeasured confounding. In particular, we establish several identification results for different classes of ITRs, exhibiting the trade-off between the risk of making untestable assumptions and the value function improvement in decision making. Based on these results, we propose several classification-based approaches to finding a variety of restricted in-class optimal ITRs and develop their theoretical properties. The appealing numerical performance of our proposed methods is demonstrated via an extensive simulation study and one real data application.
    Label-Enhanced Graph Neural Network for Semi-supervised Node Classification. (arXiv:2205.15653v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have been widely applied in the semi-supervised node classification task, where a key point lies in how to sufficiently leverage the limited but valuable label information. Most of the classical GNNs solely use the known labels for computing the classification loss at the output. In recent years, several methods have been designed to additionally utilize the labels at the input. One part of the methods augment the node features via concatenating or adding them with the one-hot encodings of labels, while other methods optimize the graph structure by assuming neighboring nodes tend to have the same label. To bring into full play the rich information of labels, in this paper, we present a label-enhanced learning framework for GNNs, which first models each label as a virtual center for intra-class nodes and then jointly learns the representations of both nodes and labels. Our approach could not only smooth the representations of nodes belonging to the same class, but also explicitly encode the label semantics into the learning process of GNNs. Moreover, a training node selection technique is provided to eliminate the potential label leakage issue and guarantee the model generalization ability. Finally, an adaptive self-training strategy is proposed to iteratively enlarge the training set with more reliable pseudo labels and distinguish the importance of each pseudo-labeled node during the model training process. Experimental results on both real-world and synthetic datasets demonstrate our approach can not only consistently outperform the state-of-the-arts, but also effectively smooth the representations of intra-class nodes.
    Data-driven Prediction of Relevant Scenarios for Robust Combinatorial Optimization. (arXiv:2203.16642v2 [math.OC] UPDATED)
    We study iterative methods for (two-stage) robust combinatorial optimization problems with discrete uncertainty. We propose a machine-learning-based heuristic to determine starting scenarios that provide strong lower bounds. To this end, we design dimension-independent features and train a Random Forest Classifier on small-dimensional instances. Experiments show that our method improves the solution process for larger instances than contained in the training set and also provides a feature importance-score which gives insights into the role of scenario properties.
    Calibrated Multiple-Output Quantile Regression with Representation Learning. (arXiv:2110.00816v2 [cs.LG] UPDATED)
    We develop a method to generate predictive regions that cover a multivariate response variable with a user-specified probability. Our work is composed of two components. First, we use a deep generative model to learn a representation of the response that has a unimodal distribution. Existing multiple-output quantile regression approaches are effective in such cases, so we apply them on the learned representation, and then transform the solution to the original space of the response. This process results in a flexible and informative region that can have an arbitrary shape, a property that existing methods lack. Second, we propose an extension of conformal prediction to the multivariate response setting that modifies any method to return sets with a pre-specified coverage level. The desired coverage is theoretically guaranteed in the finite-sample case for any distribution. Experiments conducted on both real and synthetic data show that our method constructs regions that are significantly smaller compared to existing techniques.
    KenSwQuAD -- A Question Answering Dataset for Swahili Low Resource Language. (arXiv:2205.02364v2 [cs.CL] UPDATED)
    The need for Question Answering datasets in low resource languages is the motivation of this research, leading to the development of Kencorpus Swahili Question Answering Dataset, KenSwQuAD. This dataset is annotated from raw story texts of Swahili low resource language, which is a predominantly spoken in Eastern African and in other parts of the world. Question Answering (QA) datasets are important for machine comprehension of natural language for tasks such as internet search and dialog systems. Machine learning systems need training data such as the gold standard Question Answering set developed in this research. The research engaged annotators to formulate QA pairs from Swahili texts collected by the Kencorpus project, a Kenyan languages corpus. The project annotated 1,445 texts from the total 2,585 texts with at least 5 QA pairs each, resulting into a final dataset of 7,526 QA pairs. A quality assurance set of 12.5% of the annotated texts confirmed that the QA pairs were all correctly annotated. A proof of concept on applying the set to the QA task confirmed that the dataset can be usable for such tasks. KenSwQuAD has also contributed to resourcing of the Swahili language.
    Physics-Informed Gaussian Process Regression Generalizes Linear PDE Solvers. (arXiv:2212.12474v1 [cs.LG])
    Linear partial differential equations (PDEs) are an important, widely applied class of mechanistic models, describing physical processes such as heat transfer, electromagnetism, and wave propagation. In practice, specialized numerical methods based on discretization are used to solve PDEs. They generally use an estimate of the unknown model parameters and, if available, physical measurements for initialization. Such solvers are often embedded into larger scientific models or analyses with a downstream application such that error quantification plays a key role. However, by entirely ignoring parameter and measurement uncertainty, classical PDE solvers may fail to produce consistent estimates of their inherent approximation error. In this work, we approach this problem in a principled fashion by interpreting solving linear PDEs as physics-informed Gaussian process (GP) regression. Our framework is based on a key generalization of a widely-applied theorem for conditioning GPs on a finite number of direct observations to observations made via an arbitrary bounded linear operator. Crucially, this probabilistic viewpoint allows to (1) quantify the inherent discretization error; (2) propagate uncertainty about the model parameters to the solution; and (3) condition on noisy measurements. Demonstrating the strength of this formulation, we prove that it strictly generalizes methods of weighted residuals, a central class of PDE solvers including collocation, finite volume, pseudospectral, and (generalized) Galerkin methods such as finite element and spectral methods. This class can thus be directly equipped with a structured error estimate and the capability to incorporate uncertain model parameters and observations. In summary, our results enable the seamless integration of mechanistic models as modular building blocks into probabilistic models.
    A Family of Pairwise Multi-Marginal Optimal Transports that Define a Generalized Metric. (arXiv:2001.11114v6 [cs.LG] UPDATED)
    The Optimal transport (OT) problem is rapidly finding its way into machine learning. Favoring its use are its metric properties. Many problems admit solutions with guarantees only for objects embedded in metric spaces, and the use of non-metrics can complicate solving them. Multi-marginal OT (MMOT) generalizes OT to simultaneously transporting multiple distributions. It captures important relations that are missed if the transport only involves two distributions. Research on MMOT, however, has been focused on its existence, uniqueness, practical algorithms, and the choice of cost functions. There is a lack of discussion on the metric properties of MMOT, which limits its theoretical and practical use. Here, we prove new generalized metric properties for a family of pairwise MMOTs. We first explain the difficulty of proving this via two negative results. Afterward, we prove the MMOTs' metric properties. Finally, we show that the generalized triangle inequality of this family of MMOTs cannot be improved. We illustrate the superiority of our MMOTs over other generalized metrics, and over non-metrics in both synthetic and real tasks.
    Using MM principles to deal with incomplete data in K-means clustering. (arXiv:2212.12379v1 [cs.LG])
    Among many clustering algorithms, the K-means clustering algorithm is widely used because of its simple algorithm and fast convergence. However, this algorithm suffers from incomplete data, where some samples have missed some of their attributes. To solve this problem, we mainly apply MM principles to restore the symmetry of the data, so that K-means could work well. We give the pseudo-code of the algorithm and use the standard datasets for experimental verification. The source code for the experiments is publicly available in the following link: \url{https://github.com/AliBeikmohammadi/MM-Optimization/blob/main/mini-project/MM%20K-means.ipynb}.
    An Exact Mapping From ReLU Networks to Spiking Neural Networks. (arXiv:2212.12522v1 [cs.NE])
    Deep spiking neural networks (SNNs) offer the promise of low-power artificial intelligence. However, training deep SNNs from scratch or converting deep artificial neural networks to SNNs without loss of performance has been a challenge. Here we propose an exact mapping from a network with Rectified Linear Units (ReLUs) to an SNN that fires exactly one spike per neuron. For our constructive proof, we assume that an arbitrary multi-layer ReLU network with or without convolutional layers, batch normalization and max pooling layers was trained to high performance on some training set. Furthermore, we assume that we have access to a representative example of input data used during training and to the exact parameters (weights and biases) of the trained ReLU network. The mapping from deep ReLU networks to SNNs causes zero percent drop in accuracy on CIFAR10, CIFAR100 and the ImageNet-like data sets Places365 and PASS. More generally our work shows that an arbitrary deep ReLU network can be replaced by an energy-efficient single-spike neural network without any loss of performance.
    Disentanglement and Generalization Under Correlation Shifts. (arXiv:2112.14754v2 [cs.LG] UPDATED)
    Correlations between factors of variation are prevalent in real-world data. Exploiting such correlations may increase predictive performance on noisy data; however, often correlations are not robust (e.g., they may change between domains, datasets, or applications) and models that exploit them do not generalize when correlations shift. Disentanglement methods aim to learn representations which capture different factors of variation in latent subspaces. A common approach involves minimizing the mutual information between latent subspaces, such that each encodes a single underlying attribute. However, this fails when attributes are correlated. We solve this problem by enforcing independence between subspaces conditioned on the available attributes, which allows us to remove only dependencies that are not due to the correlation structure present in the training data. We achieve this via an adversarial approach to minimize the conditional mutual information (CMI) between subspaces with respect to categorical variables. We first show theoretically that CMI minimization is a good objective for robust disentanglement on linear problems. We then apply our method on real-world datasets based on MNIST and CelebA, and show that it yields models that are disentangled and robust under correlation shift, including in weakly supervised settings.
    Robots with Different Embodiments Can Express and Influence Carefulness in Object Manipulation. (arXiv:2208.02058v2 [cs.RO] UPDATED)
    Humans have an extraordinary ability to communicate and read the properties of objects by simply watching them being carried by someone else. This level of communicative skills and interpretation, available to humans, is essential for collaborative robots if they are to interact naturally and effectively. For example, suppose a robot is handing over a fragile object. In that case, the human who receives it should be informed of its fragility in advance, through an immediate and implicit message, i.e., by the direct modulation of the robot's action. This work investigates the perception of object manipulations performed with a communicative intent by two robots with different embodiments (an iCub humanoid robot and a Baxter robot). We designed the robots' movements to communicate carefulness or not during the transportation of objects. We found that not only this feature is correctly perceived by human observers, but it can elicit as well a form of motor adaptation in subsequent human object manipulations. In addition, we get an insight into which motion features may induce to manipulate an object more or less carefully.
    Self-Optimizing Feature Transformation. (arXiv:2209.08044v2 [cs.LG] UPDATED)
    Feature transformation aims to extract a good representation (feature) space by mathematically transforming existing features. It is crucial to address the curse of dimensionality, enhance model generalization, overcome data sparsity, and expand the availability of classic models. Current research focuses on domain knowledge-based feature engineering or learning latent representations; nevertheless, these methods are not entirely automated and cannot produce a traceable and optimal representation space. When rebuilding a feature space for a machine learning task, can these limitations be addressed concurrently? In this extension study, we present a self-optimizing framework for feature transformation. To achieve a better performance, we improved the preliminary work by (1) obtaining an advanced state representation for enabling reinforced agents to comprehend the current feature set better; and (2) resolving Q-value overestimation in reinforced agents for learning unbiased and effective policies. Finally, to make experiments more convincing than the preliminary work, we conclude by adding the outlier detection task with five datasets, evaluating various state representation approaches, and comparing different training strategies. Extensive experiments and case studies show that our work is more effective and superior.
    Towards Scalable Physically Consistent Neural Networks: an Application to Data-driven Multi-zone Thermal Building Models. (arXiv:2212.12380v1 [cs.LG])
    With more and more data being collected, data-driven modeling methods have been gaining in popularity in recent years. While physically sound, classical gray-box models are often cumbersome to identify and scale, and their accuracy might be hindered by their limited expressiveness. On the other hand, classical black-box methods, typically relying on Neural Networks (NNs) nowadays, often achieve impressive performance, even at scale, by deriving statistical patterns from data. However, they remain completely oblivious to the underlying physical laws, which may lead to potentially catastrophic failures if decisions for real-world physical systems are based on them. Physically Consistent Neural Networks (PCNNs) were recently developed to address these aforementioned issues, ensuring physical consistency while still leveraging NNs to attain state-of-the-art accuracy. In this work, we scale PCNNs to model building temperature dynamics and propose a thorough comparison with classical gray-box and black-box methods. More precisely, we design three distinct PCNN extensions, thereby exemplifying the modularity and flexibility of the architecture, and formally prove their physical consistency. In the presented case study, PCNNs are shown to achieve state-of-the-art accuracy, even outperforming classical NN-based models despite their constrained structure. Our investigations furthermore provide a clear illustration of NNs achieving seemingly good performance while remaining completely physics-agnostic, which can be misleading in practice. While this performance comes at the cost of computational complexity, PCNNs on the other hand show accuracy improvements of 17-35% compared to all other physically consistent methods, paving the way for scalable physically consistent models with state-of-the-art performance.
    HAC-Net: A Hybrid Attention-Based Convolutional Neural Network for Highly Accurate Protein-Ligand Binding Affinity Prediction. (arXiv:2212.12440v1 [q-bio.BM])
    Applying deep learning concepts from image detection and graph theory has greatly advanced protein-ligand binding affinity prediction, a challenge with enormous ramifications for both drug discovery and protein engineering. We build upon these advances by designing a novel deep learning architecture consisting of a 3-dimensional convolutional neural network utilizing channel-wise attention and two graph convolutional networks utilizing attention-based aggregation of node features. HAC-Net (Hybrid Attention-Based Convolutional Neural Network) obtains state-of-the-art results on the PDBbind v.2016 core set, the most widely recognized benchmark in the field. We extensively assess the generalizability of our model using multiple train-test splits, each of which maximizes differences between either protein structures, protein sequences, or ligand extended-connectivity fingerprints. Furthermore, we perform 10-fold cross-validation with a similarity cutoff between SMILES strings of ligands in the training and test sets, and also evaluate the performance of HAC-Net on lower-quality data. We envision that this model can be extended to a broad range of supervised learning problems related to structure-based biomolecular property prediction. All of our software is available as open source at https://github.com/gregory-kyro/HAC-Net/.
    FFNeRV: Flow-Guided Frame-Wise Neural Representations for Videos. (arXiv:2212.12294v1 [cs.CV])
    Neural fields, also known as coordinate-based or implicit neural representations, have shown a remarkable capability of representing, generating, and manipulating various forms of signals. For video representations, however, mapping pixel-wise coordinates to RGB colors has shown relatively low compression performance and slow convergence and inference speed. Frame-wise video representation, which maps a temporal coordinate to its entire frame, has recently emerged as an alternative method to represent videos, improving compression rates and encoding speed. While promising, it has still failed to reach the performance of state-of-the-art video compression algorithms. In this work, we propose FFNeRV, a novel method for incorporating flow information into frame-wise representations to exploit the temporal redundancy across the frames in videos inspired by the standard video codecs. Furthermore, we introduce a fully convolutional architecture, enabled by one-dimensional temporal grids, improving the continuity of spatial features. Experimental results show that FFNeRV yields the best performance for video compression and frame interpolation among the methods using frame-wise representations or neural fields. To reduce the model size even further, we devise a more compact convolutional architecture using the group and pointwise convolutions. With model compression techniques, including quantization-aware training and entropy coding, FFNeRV outperforms widely-used standard video codecs (H.264 and HEVC) and performs on par with state-of-the-art video compression algorithms.
    Bring Your Own View: Graph Neural Networks for Link Prediction with Personalized Subgraph Selection. (arXiv:2212.12488v1 [cs.IR])
    Graph neural networks (GNNs) have received remarkable success in link prediction (GNNLP) tasks. Existing efforts first predefine the subgraph for the whole dataset and then apply GNNs to encode edge representations by leveraging the neighborhood structure induced by the fixed subgraph. The prominence of GNNLP methods significantly relies on the adhoc subgraph. Since node connectivity in real-world graphs is complex, one shared subgraph is limited for all edges. Thus, the choices of subgraphs should be personalized to different edges. However, performing personalized subgraph selection is nontrivial since the potential selection space grows exponentially to the scale of edges. Besides, the inference edges are not available during training in link prediction scenarios, so the selection process needs to be inductive. To bridge the gap, we introduce a Personalized Subgraph Selector (PS2) as a plug-and-play framework to automatically, personally, and inductively identify optimal subgraphs for different edges when performing GNNLP. PS2 is instantiated as a bi-level optimization problem that can be efficiently solved differently. Coupling GNNLP models with PS2, we suggest a brand-new angle towards GNNLP training: by first identifying the optimal subgraphs for edges; and then focusing on training the inference model by using the sampled subgraphs. Comprehensive experiments endorse the effectiveness of our proposed method across various GNNLP backbones (GCN, GraphSage, NGCF, LightGCN, and SEAL) and diverse benchmarks (Planetoid, OGB, and Recommendation datasets). Our code is publicly available at \url{https://github.com/qiaoyu-tan/PS2}
    Introduction to Machine Learning for Physicians: A Survival Guide for Data Deluge. (arXiv:2212.12303v1 [cs.LG])
    Many modern research fields increasingly rely on collecting and analysing massive, often unstructured, and unwieldy datasets. Consequently, there is growing interest in machine learning and artificial intelligence applications that can harness this `data deluge'. This broad nontechnical overview provides a gentle introduction to machine learning with a specific focus on medical and biological applications. We explain the common types of machine learning algorithms and typical tasks that can be solved, illustrating the basics with concrete examples from healthcare. Lastly, we provide an outlook on open challenges, limitations, and potential impacts of machine-learning-powered medicine.
    Approaching Globally Optimal Energy Efficiency in Interference Networks via Machine Learning. (arXiv:2212.12329v1 [eess.SP])
    This work presents a machine learning approach to optimize the energy efficiency (EE) in a multi-cell wireless network. This optimization problem is non-convex and its global optimum is difficult to find. In the literature, either simple but suboptimal approaches or optimal methods with high complexity and poor scalability are proposed. In contrast, we propose a machine learning framework to approach the global optimum. While the neural network (NN) training takes moderate time, application with the trained model requires very low computational complexity. In particular, we introduce a novel objective function based on stochastic actions to solve the non-convex optimization problem. Besides, we design a dedicated NN architecture for the multi-cell network optimization problems that is permutation-equivariant. It classifies channels according to their roles in the EE computation. In this way, we encode our domain knowledge into the NN design and shed light into the black box of machine learning. Training and testing results show that the proposed method without supervision and with reasonable computational effort achieves an EE close to the global optimum found by the branch-and-bound algorithm. Hence, the proposed approach balances between computational complexity and performance.
    NARS vs. Reinforcement learning: ONA vs. Q-Learning. (arXiv:2212.12517v1 [cs.LG])
    One of the realistic scenarios is taking a sequence of optimal actions to do a task. Reinforcement learning is the most well-known approach to deal with this kind of task in the machine learning community. Finding a suitable alternative could always be an interesting and out-of-the-box matter. Therefore, in this project, we are looking to investigate the capability of NARS and answer the question of whether NARS has the potential to be a substitute for RL or not. Particularly, we are making a comparison between $Q$-Learning and ONA on some environments developed by an Open AI gym. The source code for the experiments is publicly available in the following link: \url{https://github.com/AliBeikmohammadi/OpenNARS-for-Applications/tree/master/misc/Python}.
    The choice of scaling technique matters for classification performance. (arXiv:2212.12343v1 [cs.LG])
    Dataset scaling, also known as normalization, is an essential preprocessing step in a machine learning pipeline. It is aimed at adjusting attributes scales in a way that they all vary within the same range. This transformation is known to improve the performance of classification models, but there are several scaling techniques to choose from, and this choice is not generally done carefully. In this paper, we execute a broad experiment comparing the impact of 5 scaling techniques on the performances of 20 classification algorithms among monolithic and ensemble models, applying them to 82 publicly available datasets with varying imbalance ratios. Results show that the choice of scaling technique matters for classification performance, and the performance difference between the best and the worst scaling technique is relevant and statistically significant in most cases. They also indicate that choosing an inadequate technique can be more detrimental to classification performance than not scaling the data at all. We also show how the performance variation of an ensemble model, considering different scaling techniques, tends to be dictated by that of its base model. Finally, we discuss the relationship between a model's sensitivity to the choice of scaling technique and its performance and provide insights into its applicability on different model deployment scenarios. Full results and source code for the experiments in this paper are available in a GitHub repository.\footnote{https://github.com/amorimlb/scaling\_matters}
    Statistical Distance Based Deterministic Offspring Selection in SMC Methods. (arXiv:2212.12290v1 [stat.ML])
    Over the years, sequential Monte Carlo (SMC) and, equivalently, particle filter (PF) theory has gained substantial attention from researchers. However, the performance of the resampling methodology, also known as offspring selection, has not advanced recently. We propose two deterministic offspring selection methods, which strive to minimize the Kullback-Leibler (KL) divergence and the total variation (TV) distance, respectively, between the particle distribution prior and subsequent to the offspring selection. By reducing the statistical distance between the selected offspring and the joint distribution, we obtain a heuristic search procedure that performs superior to a maximum likelihood search in precisely those contexts where the latter performs better than an SMC. For SMC and particle Markov chain Monte Carlo (pMCMC), our proposed offspring selection methods always outperform or compare favorably with the two state-of-the-art resampling schemes on two models commonly used as benchmarks from the literature.
    Text classification in shipping industry using unsupervised models and Transformer based supervised models. (arXiv:2212.12407v1 [cs.CL])
    Obtaining labelled data in a particular context could be expensive and time consuming. Although different algorithms, including unsupervised learning, semi-supervised learning, self-learning have been adopted, the performance of text classification varies with context. Given the lack of labelled dataset, we proposed a novel and simple unsupervised text classification model to classify cargo content in international shipping industry using the Standard International Trade Classification (SITC) codes. Our method stems from representing words using pretrained Glove Word Embeddings and finding the most likely label using Cosine Similarity. To compare unsupervised text classification model with supervised classification, we also applied several Transformer models to classify cargo content. Due to lack of training data, the SITC numerical codes and the corresponding textual descriptions were used as training data. A small number of manually labelled cargo content data was used to evaluate the classification performances of the unsupervised classification and the Transformer based supervised classification. The comparison reveals that unsupervised classification significantly outperforms Transformer based supervised classification even after increasing the size of the training dataset by 30%. Lacking training data is a key bottleneck that prohibits deep learning models (such as Transformers) from successful practical applications. Unsupervised classification can provide an alternative efficient and effective method to classify text when there is scarce training data.
    Principled and Efficient Transfer Learning of Deep Models via Neural Collapse. (arXiv:2212.12206v1 [cs.LG])
    With the ever-growing model size and the limited availability of labeled training data, transfer learning has become an increasingly popular approach in many science and engineering domains. For classification problems, this work delves into the mystery of transfer learning through an intriguing phenomenon termed neural collapse (NC), where the last-layer features and classifiers of learned deep networks satisfy: (i) the within-class variability of the features collapses to zero, and (ii) the between-class feature means are maximally and equally separated. Through the lens of NC, our findings for transfer learning are the following: (i) when pre-training models, preventing intra-class variability collapse (to a certain extent) better preserves the intrinsic structures of the input data, so that it leads to better model transferability; (ii) when fine-tuning models on downstream tasks, obtaining features with more NC on downstream data results in better test accuracy on the given task. The above results not only demystify many widely used heuristics in model pre-training (e.g., data augmentation, projection head, self-supervised learning), but also leads to more efficient and principled fine-tuning method on downstream tasks that we demonstrate through extensive experimental results.
    Channel charting based beamforming. (arXiv:2212.12340v1 [cs.NI])
    Channel charting (CC) is an unsupervised learning method allowing to locate users relative to each other without reference. From a broader perspective, it can be viewed as a way to discover a low-dimensional latent space charting the channel manifold. In this paper, this latent modeling vision is leveraged together with a recently proposed location-based beamforming (LBB) method to show that channel charting can be used for mapping channels in space or frequency. Combining CC and LBB yields a neural network resembling an autoencoder. The proposed method is empirically assessed on a channel mapping task whose objective is to predict downlink channels from uplink channels.
    Multi-objective and multi-fidelity Bayesian optimization of laser-plasma acceleration. (arXiv:2210.03484v2 [physics.acc-ph] UPDATED)
    Beam parameter optimization in accelerators involves multiple, sometimes competing objectives. Condensing these individual objectives into a single figure of merit unavoidably results in a bias towards particular outcomes, in absence of prior knowledge often in a non-desired way. Finding an optimal objective definition then requires operators to iterate over many possible objective weights and definitions, a process that can take many times longer than the optimization itself. A more versatile approach is multi-objective optimization, which establishes the trade-off curve or Pareto front between objectives. Here we present the first results on multi-objective Bayesian optimization of a simulated laser-plasma accelerator. We find that multi-objective optimization reaches comparable performance to its single-objective counterparts while allowing for instant evaluation of entirely new objectives. This dramatically reduces the time required to find appropriate objective definitions for new problems. Additionally, our multi-objective, multi-fidelity method reduces the time required for an optimization run by an order of magnitude. It does so by dynamically choosing simulation resolution and box size, requiring fewer slow and expensive simulations as it learns about the Pareto-optimal solutions from fast low-resolution runs. The techniques demonstrated in this paper can easily be translated into many different computational and experimental use cases beyond accelerator optimization.
    Investigation of reinforcement learning for shape optimization of profile extrusion dies. (arXiv:2212.12207v1 [cs.CE])
    Profile extrusion is a continuous production process for manufacturing plastic profiles from molten polymer. Especially interesting is the design of the die, through which the melt is pressed to attain the desired shape. However, due to an inhomogeneous velocity distribution at the die exit or residual stresses inside the extrudate, the final shape of the manufactured part often deviates from the desired one. To avoid these deviations, the shape of the die can be computationally optimized, which has already been investigated in the literature using classical optimization approaches. A new approach in the field of shape optimization is the utilization of Reinforcement Learning (RL) as a learning-based optimization algorithm. RL is based on trial-and-error interactions of an agent with an environment. For each action, the agent is rewarded and informed about the subsequent state of the environment. While not necessarily superior to classical, e.g., gradient-based or evolutionary, optimization algorithms for one single problem, RL techniques are expected to perform especially well when similar optimization tasks are repeated since the agent learns a more general strategy for generating optimal shapes instead of concentrating on just one single problem. In this work, we investigate this approach by applying it to two 2D test cases. The flow-channel geometry can be modified by the RL agent using so-called Free-Form Deformation, a method where the computational mesh is embedded into a transformation spline, which is then manipulated based on the control-point positions. In particular, we investigate the impact of utilizing different agents on the training progress and the potential of wall time saving by utilizing multiple environments during training.
    MN-DS: A Multilabeled News Dataset for News Articles Hierarchical Classification. (arXiv:2212.12061v1 [cs.CL])
    This article presents a dataset of 10,917 news articles with hierarchical news categories collected between January 1st 2019, and December 31st 2019. We manually labelled the articles based on a hierarchical taxonomy with 17 first-level and 109 second-level categories. This dataset can be used to train machine learning models for automatically classifying news articles by topic. This dataset can be helpful for researchers working on news structuring, classification, and predicting future events based on released news.
    Adaptive Risk-Aware Bidding with Budget Constraint in Display Advertising. (arXiv:2212.12533v1 [cs.IR])
    Real-time bidding (RTB) has become a major paradigm of display advertising. Each ad impression generated from a user visit is auctioned in real time, where demand-side platform (DSP) automatically provides bid price usually relying on the ad impression value estimation and the optimal bid price determination. However, the current bid strategy overlooks large randomness of the user behaviors (e.g., click) and the cost uncertainty caused by the auction competition. In this work, we explicitly factor in the uncertainty of estimated ad impression values and model the risk preference of a DSP under a specific state and market environment via a sequential decision process. Specifically, we propose a novel adaptive risk-aware bidding algorithm with budget constraint via reinforcement learning, which is the first to simultaneously consider estimation uncertainty and the dynamic risk tendency of a DSP. We theoretically unveil the intrinsic relation between the uncertainty and the risk tendency based on value at risk (VaR). Consequently, we propose two instantiations to model risk tendency, including an expert knowledge-based formulation embracing three essential properties and an adaptive learning method based on self-supervised reinforcement learning. We conduct extensive experiments on public datasets and show that the proposed framework outperforms state-of-the-art methods in practical settings.
    Benchmark for Uncertainty & Robustness in Self-Supervised Learning. (arXiv:2212.12411v1 [cs.CV])
    Self-Supervised Learning (SSL) is crucial for real-world applications, especially in data-hungry domains such as healthcare and self-driving cars. In addition to a lack of labeled data, these applications also suffer from distributional shifts. Therefore, an SSL method should provide robust generalization and uncertainty estimation in the test dataset to be considered a reliable model in such high-stakes domains. However, existing approaches often focus on generalization, without evaluating the model's uncertainty. The ability to compare SSL techniques for improving these estimates is therefore critical for research on the reliability of self-supervision models. In this paper, we explore variants of SSL methods, including Jigsaw Puzzles, Context, Rotation, Geometric Transformations Prediction for vision, as well as BERT and GPT for language tasks. We train SSL in auxiliary learning for vision and pre-training for language model, then evaluate the generalization (in-out classification accuracy) and uncertainty (expected calibration error) across different distribution covariate shift datasets, including MNIST-C, CIFAR-10-C, CIFAR-10.1, and MNLI. Our goal is to create a benchmark with outputs from experiments, providing a starting point for new SSL methods in Reliable Machine Learning. All source code to reproduce results is available at https://github.com/hamanhbui/reliable_ssl_baselines.
    Alignment Entropy Regularization. (arXiv:2212.12442v1 [cs.CL])
    Existing training criteria in automatic speech recognition(ASR) permit the model to freely explore more than one time alignments between the feature and label sequences. In this paper, we use entropy to measure a model's uncertainty, i.e. how it chooses to distribute the probability mass over the set of allowed alignments. Furthermore, we evaluate the effect of entropy regularization in encouraging the model to distribute the probability mass only on a smaller subset of allowed alignments. Experiments show that entropy regularization enables a much simpler decoding method without sacrificing word error rate, and provides better time alignment quality.
    Relational Local Explanations. (arXiv:2212.12374v1 [cs.LG])
    The majority of existing post-hoc explanation approaches for machine learning models produce independent per-variable feature attribution scores, ignoring a critical characteristic, such as the inter-variable relationship between features that naturally occurs in visual and textual data. In response, we develop a novel model-agnostic and permutation-based feature attribution algorithm based on the relational analysis between input variables. As a result, we are able to gain a broader insight into machine learning model decisions and data. This type of local explanation measures the effects of interrelationships between local features, which provides another critical aspect of explanations. Experimental evaluations of our framework using setups involving both image and text data modalities demonstrate its effectiveness and validity.
    A-NeSI: A Scalable Approximate Method for Probabilistic Neurosymbolic Inference. (arXiv:2212.12393v1 [cs.LG])
    We study the problem of combining neural networks with symbolic reasoning. Recently introduced frameworks for Probabilistic Neurosymbolic Learning (PNL), such as DeepProbLog, perform exponential-time exact inference, limiting the scalability of PNL solutions. We introduce Approximate Neurosymbolic Inference (A-NeSI): a new framework for PNL that uses neural networks for scalable approximate inference. A-NeSI 1) performs approximate inference in polynomial time without changing the semantics of probabilistic logics; 2) is trained using data generated by the background knowledge; 3) can generate symbolic explanations of predictions; and 4) can guarantee the satisfaction of logical constraints at test time, which is vital in safety-critical applications. Our experiments show that A-NeSI is the first end-to-end method to scale the Multi-digit MNISTAdd benchmark to sums of 15 MNIST digits, up from 4 in competing systems. Finally, our experiments show that A-NeSI achieves explainability and safety without a penalty in performance.
    Networked Federated Learning. (arXiv:2105.12769v3 [cs.LG] UPDATED)
    We develop the theory and algorithmic toolbox for networked federated learning in decentralized collections of local datasets with an intrinsic network structure. This network structure arises from domain-specific notions of similarity between local datasets. Different notions of similarity are induced by spatio-temporal proximity, statistical dependencies or functional relations. Our main conceptual contribution is to formulate networked federated learning using a generalized total variation minimization. This formulation unifies and considerably extends existing federated multi-task learning methods. It is highly flexible and can be combined with a broad range of parametric models including Lasso or deep neural networks. Our main algorithmic contribution is a novel networked federated learning algorithm which is well-suited for distributed computing environments such as edge computing over wireless networks. This algorithm is robust against inexact computations due to limited computational resources. For local models resulting in convex problems, we derive precise conditions on the local models and their network structure such that our algorithm learns nearly optimal local models. Our analysis reveals an interesting interplay between the convex geometry of local models and the (cluster-) geometry of their network structure.
    Look Around! A Neighbor Relation Graph Learning Framework for Real Estate Appraisal. (arXiv:2212.12190v1 [cs.LG])
    Real estate appraisal is a crucial issue for urban applications, which aims to value the properties on the market. Traditional methods perform appraisal based on the domain knowledge, but suffer from the efforts of hand-crafted design. Recently, several methods have been developed to automatize the valuation process by taking the property trading transaction into account when estimating the property value. However, existing methods only consider the real estate itself, ignoring the relation between the properties. Moreover, naively aggregating the information of neighbors fails to model the relationships between the transactions. To tackle these limitations, we propose a novel Neighbor Relation Graph Learning Framework (ReGram) by incorporating the relation between target transaction and surrounding neighbors with the attention mechanism. To model the influence between communities, we integrate the environmental information and the past price of each transaction from other communities. Moreover, since the target transactions in different regions share some similarities and differences of characteristics, we introduce a dynamic adapter to model the different distributions of the target transactions based on the input-related kernel weights. Extensive experiments on the real-world dataset with various scenarios demonstrate that ReGram robustly outperforms the state-of-the-art methods. Furthermore, comprehensive ablation studies were conducted to examine the effectiveness of each component in ReGram.
    Stop using the elbow criterion for k-means and how to choose the number of clusters instead. (arXiv:2212.12189v1 [stat.ML])
    A major challenge when using k-means clustering often is how to choose the parameter k, the number of clusters. In this letter, we want to point out that it is very easy to draw poor conclusions from a common heuristic, the "elbow method". Better alternatives have been known in literature for a long time, and we want to draw attention to some of these easy to use options, that often perform better. This letter is a call to stop using the elbow method altogether, because it severely lacks theoretic support, and we want to encourage educators to discuss the problems of the method -- if introducing it in class at all -- and teach alternatives instead, while researchers and reviewers should reject conclusions drawn from the elbow method.
    Infrared Image Super-Resolution: Systematic Review, and Future Trends. (arXiv:2212.12322v1 [eess.IV])
    Image Super-Resolution (SR) is essential for a wide range of computer vision and image processing tasks. Investigating infrared (IR) image (or thermal images) super-resolution is a continuing concern within the development of deep learning. This survey aims to provide a comprehensive perspective of IR image super-resolution, including its applications, hardware imaging system dilemmas, and taxonomy of image processing methodologies. In addition, the datasets and evaluation metrics in IR image super-resolution tasks are also discussed. Furthermore, the deficiencies in current technologies and possible promising directions for the community to explore are highlighted. To cope with the rapid development in this field, we intend to regularly update the relevant excellent work at \url{https://github.com/yongsongH/Infrared_Image_SR_Survey  ( 2 min )
    On Calibrating Semantic Segmentation Models: Analysis and An Algorithm. (arXiv:2212.12053v1 [cs.CV])
    We study the problem of semantic segmentation calibration. For image classification, lots of existing solutions are proposed to alleviate model miscalibration of confidence. However, to date, confidence calibration research on semantic segmentation is still limited. We provide a systematic study on the calibration of semantic segmentation models and propose a simple yet effective approach. First, we find that model capacity, crop size, multi-scale testing, and prediction correctness have impact on calibration. Among them, prediction correctness, especially misprediction, is more important to miscalibration due to over-confidence. Next, we propose a simple, unifying, and effective approach, namely selective scaling, by separating correct/incorrect prediction for scaling and more focusing on misprediction logit smoothing. Then, we study popular existing calibration methods and compare them with selective scaling on semantic segmentation calibration. We conduct extensive experiments with a variety of benchmarks on both in-domain and domain-shift calibration, and show that selective scaling consistently outperforms other methods.  ( 2 min )
    Exploring the Optimized Value of Each Hyperparameter in Various Gradient Descent Algorithms. (arXiv:2212.12279v1 [cs.LG])
    In the recent years, various gradient descent algorithms including the methods of gradient descent, gradient descent with momentum, adaptive gradient (AdaGrad), root-mean-square propagation (RMSProp) and adaptive moment estimation (Adam) have been applied to the parameter optimization of several deep learning models with higher accuracies or lower errors. These optimization algorithms may need to set the values of several hyperparameters which include a learning rate, momentum coefficients, etc. Furthermore, the convergence speed and solution accuracy may be influenced by the values of hyperparameters. Therefore, this study proposes an analytical framework to use mathematical models for analyzing the mean error of each objective function based on various gradient descent algorithms. Moreover, the suitable value of each hyperparameter could be determined by minimizing the mean error. The principles of hyperparameter value setting have been generalized based on analysis results for model optimization. The experimental results show that higher efficiency convergences and lower errors can be obtained by the proposed method.  ( 2 min )
    DAS: Neural Architecture Search via Distinguishing Activation Score. (arXiv:2212.12132v1 [cs.LG])
    Neural Architecture Search (NAS) is an automatic technique that can search for well-performed architectures for a specific task. Although NAS surpasses human-designed architecture in many fields, the high computational cost of architecture evaluation it requires hinders its development. A feasible solution is to directly evaluate some metrics in the initial stage of the architecture without any training. NAS without training (WOT) score is such a metric, which estimates the final trained accuracy of the architecture through the ability to distinguish different inputs in the activation layer. However, WOT score is not an atomic metric, meaning that it does not represent a fundamental indicator of the architecture. The contributions of this paper are in three folds. First, we decouple WOT into two atomic metrics which represent the distinguishing ability of the network and the number of activation units, and explore better combination rules named (Distinguishing Activation Score) DAS. We prove the correctness of decoupling theoretically and confirmed the effectiveness of the rules experimentally. Second, in order to improve the prediction accuracy of DAS to meet practical search requirements, we propose a fast training strategy. When DAS is used in combination with the fast training strategy, it yields more improvements. Third, we propose a dataset called Darts-training-bench (DTB), which fills the gap that no training states of architecture in existing datasets. Our proposed method has 1.04$\times$ - 1.56$\times$ improvements on NAS-Bench-101, Network Design Spaces, and the proposed DTB.  ( 2 min )
    Piecewise-Velocity Model for Learning Continuous-time Dynamic Node Representations. (arXiv:2212.12345v1 [cs.LG])
    Networks have become indispensable and ubiquitous structures in many fields to model the interactions among different entities, such as friendship in social networks or protein interactions in biological graphs. A major challenge is to understand the structure and dynamics of these systems. Although networks evolve through time, most existing graph representation learning methods target only static networks. Whereas approaches have been developed for the modeling of dynamic networks, there is a lack of efficient continuous time dynamic graph representation learning methods that can provide accurate network characterization and visualization in low dimensions while explicitly accounting for prominent network characteristics such as homophily and transitivity. In this paper, we propose the Piecewise-Velocity Model (PiVeM) for the representation of continuous-time dynamic networks. It learns dynamic embeddings in which the temporal evolution of nodes is approximated by piecewise linear interpolations based on a latent distance model with piecewise constant node-specific velocities. The model allows for analytically tractable expressions of the associated Poisson process likelihood with scalable inference invariant to the number of events. We further impose a scalable Kronecker structured Gaussian Process prior to the dynamics accounting for community structure, temporal smoothness, and disentangled (uncorrelated) latent embedding dimensions optimally learned to characterize the network dynamics. We show that PiVeM can successfully represent network structure and dynamics in ultra-low two-dimensional spaces. It outperforms relevant state-of-art methods in downstream tasks such as link prediction. In summary, PiVeM enables easily interpretable dynamic network visualizations and characterizations that can further improve our understanding of the intrinsic dynamics of time-evolving networks.  ( 2 min )
    Rule Learning by Modularity. (arXiv:2212.12335v1 [cs.LG])
    In this paper, we present a modular methodology that combines state-of-the-art methods in (stochastic) machine learning with traditional methods in rule learning to provide efficient and scalable algorithms for the classification of vast data sets, while remaining explainable. Apart from evaluating our approach on the common large scale data sets MNIST, Fashion-MNIST and IMDB, we present novel results on explainable classifications of dental bills. The latter case study stems from an industrial collaboration with Allianz Private Krankenversicherungs-Aktiengesellschaft which is an insurance company offering diverse services in Germany.  ( 2 min )
    Do DALL-E and Flamingo Understand Each Other?. (arXiv:2212.12249v1 [cs.CV])
    A major goal of multimodal research is to improve machine understanding of images and text. Tasks include image captioning, text-to-image generation, and vision-language representation learning. So far, research has focused on the relationships between images and text. For example, captioning models attempt to understand the semantics of images which are then transformed into text. An important question is: which annotation reflects best a deep understanding of image content? Similarly, given a text, what is the best image that can present the semantics of the text? In this work, we argue that the best text or caption for a given image is the text which would generate the image which is the most similar to that image. Likewise, the best image for a given text is the image that results in the caption which is best aligned with the original text. To this end, we propose a unified framework that includes both a text-to-image generative model and an image-to-text generative model. Extensive experiments validate our approach.  ( 2 min )
    Federated PCA on Grassmann Manifold for Anomaly Detection in IoT Networks. (arXiv:2212.12121v1 [cs.LG])
    In the era of Internet of Things (IoT), network-wide anomaly detection is a crucial part of monitoring IoT networks due to the inherent security vulnerabilities of most IoT devices. Principal Components Analysis (PCA) has been proposed to separate network traffics into two disjoint subspaces corresponding to normal and malicious behaviors for anomaly detection. However, the privacy concerns and limitations of devices' computing resources compromise the practical effectiveness of PCA. We propose a federated PCA-based Grassmannian optimization framework that coordinates IoT devices to aggregate a joint profile of normal network behaviors for anomaly detection. First, we introduce a privacy-preserving federated PCA framework to simultaneously capture the profile of various IoT devices' traffic. Then, we investigate the alternating direction method of multipliers gradient-based learning on the Grassmann manifold to guarantee fast training and the absence of detecting latency using limited computational resources. Empirical results on the NSL-KDD dataset demonstrate that our method outperforms baseline approaches. Finally, we show that the Grassmann manifold algorithm is highly adapted for IoT anomaly detection, which permits drastically reducing the analysis time of the system. To the best of our knowledge, this is the first federated PCA algorithm for anomaly detection meeting the requirements of IoT networks.  ( 2 min )
    Deep Unfolding-based Weighted Averaging for Federated Learning under Heterogeneous Environments. (arXiv:2212.12191v1 [cs.LG])
    Federated learning is a collaborative model training method by iterating model updates at multiple clients and aggregation of the updates at a central server. Device and statistical heterogeneity of the participating clients cause performance degradation so that an appropriate weight should be assigned per client in the server's aggregation phase. This paper employs deep unfolding to learn the weights that adapt to the heterogeneity, which gives the model with high accuracy on uniform test data. The results of numerical experiments indicate the high performance of the proposed method and the interpretable behavior of the learned weights.  ( 2 min )
    Offline Reinforcement Learning for Human-Guided Human-Machine Interaction with Private Information. (arXiv:2212.12167v1 [stat.ML])
    Motivated by the human-machine interaction such as training chatbots for improving customer satisfaction, we study human-guided human-machine interaction involving private information. We model this interaction as a two-player turn-based game, where one player (Alice, a human) guides the other player (Bob, a machine) towards a common goal. Specifically, we focus on offline reinforcement learning (RL) in this game, where the goal is to find a policy pair for Alice and Bob that maximizes their expected total rewards based on an offline dataset collected a priori. The offline setting presents two challenges: (i) We cannot collect Bob's private information, leading to a confounding bias when using standard RL methods, and (ii) a distributional mismatch between the behavior policy used to collect data and the desired policy we aim to learn. To tackle the confounding bias, we treat Bob's previous action as an instrumental variable for Alice's current decision making so as to adjust for the unmeasured confounding. We develop a novel identification result and use it to propose a new off-policy evaluation (OPE) method for evaluating policy pairs in this two-player turn-based game. To tackle the distributional mismatch, we leverage the idea of pessimism and use our OPE method to develop an off-policy learning algorithm for finding a desirable policy pair for both Alice and Bob. Finally, we prove that under mild assumptions such as partial coverage of the offline data, the policy pair obtained through our method converges to the optimal one at a satisfactory rate.  ( 2 min )
    Anomaly Detection using Ensemble Classification and Evidence Theory. (arXiv:2212.12092v1 [cs.LG])
    Multi-class ensemble classification remains a popular focus of investigation within the research community. The popularization of cloud services has sped up their adoption due to the ease of deploying large-scale machine-learning models. It has also drawn the attention of the industrial sector because of its ability to identify common problems in production. However, there are challenges to conform an ensemble classifier, namely a proper selection and effective training of the pool of classifiers, the definition of a proper architecture for multi-class classification, and uncertainty quantification of the ensemble classifier. The robustness and effectiveness of the ensemble classifier lie in the selection of the pool of classifiers, as well as in the learning process. Hence, the selection and the training procedure of the pool of classifiers play a crucial role. An (ensemble) classifier learns to detect the classes that were used during the supervised training. However, when injecting data with unknown conditions, the trained classifier will intend to predict the classes learned during the training. To this end, the uncertainty of the individual and ensemble classifier could be used to assess the learning capability. We present a novel approach for novel detection using ensemble classification and evidence theory. A pool selection strategy is presented to build a solid ensemble classifier. We present an architecture for multi-class ensemble classification and an approach to quantify the uncertainty of the individual classifiers and the ensemble classifier. We use uncertainty for the anomaly detection approach. Finally, we use the benchmark Tennessee Eastman to perform experiments to test the ensemble classifier's prediction and anomaly detection capabilities.  ( 2 min )
    Predicting Survival of Tongue Cancer Patients by Machine Learning Models. (arXiv:2212.12114v1 [q-bio.QM])
    Tongue cancer is a common oral cavity malignancy that originates in the mouth and throat. Much effort has been invested in improving its diagnosis, treatment, and management. Surgical removal, chemotherapy, and radiation therapy remain the major treatment for tongue cancer. The survival of patients determines the treatment effect. Previous studies have identified certain survival and risk factors based on descriptive statistics, ignoring the complex, nonlinear relationship among clinical and demographic variables. In this study, we utilize five cutting-edge machine learning models and clinical data to predict the survival of tongue cancer patients after treatment. Five-fold cross-validation, bootstrap analysis, and permutation feature importance are applied to estimate and interpret model performance. The prognostic factors identified by our method are consistent with previous clinical studies. Our method is accurate, interpretable, and thus useable as additional evidence in tongue cancer treatment and management.  ( 2 min )
    A Topic Modeling Approach to Classifying Open Street Map Health Clinics and Schools in Sub-Saharan Africa. (arXiv:2212.12084v1 [cs.LG])
    Data deprivation, or the lack of easily available and actionable information on the well-being of individuals, is a significant challenge for the developing world and an impediment to the design and operationalization of policies intended to alleviate poverty. In this paper we explore the suitability of data derived from OpenStreetMap to proxy for the location of two crucial public services: schools and health clinics. Thanks to the efforts of thousands of digital humanitarians, online mapping repositories such as OpenStreetMap contain millions of records on buildings and other structures, delineating both their location and often their use. Unfortunately much of this data is locked in complex, unstructured text rendering it seemingly unsuitable for classifying schools or clinics. We apply a scalable, unsupervised learning method to unlabeled OpenStreetMap building data to extract the location of schools and health clinics in ten countries in Africa. We find the topic modeling approach greatly improves performance versus reliance on structured keys alone. We validate our results by comparing schools and clinics identified by our OSM method versus those identified by the WHO, and describe OSM coverage gaps more broadly.  ( 2 min )
    Eigenvalue initialisation and regularisation for Koopman autoencoders. (arXiv:2212.12086v1 [cs.LG])
    Regularising the parameter matrices of neural networks is ubiquitous in training deep models. Typical regularisation approaches suggest initialising weights using small random values, and to penalise weights to promote sparsity. However, these widely used techniques may be less effective in certain scenarios. Here, we study the Koopman autoencoder model which includes an encoder, a Koopman operator layer, and a decoder. These models have been designed and dedicated to tackle physics-related problems with interpretable dynamics and an ability to incorporate physics-related constraints. However, the majority of existing work employs standard regularisation practices. In our work, we take a step toward augmenting Koopman autoencoders with initialisation and penalty schemes tailored for physics-related settings. Specifically, we propose the "eigeninit" initialisation scheme that samples initial Koopman operators from specific eigenvalue distributions. In addition, we suggest the "eigenloss" penalty scheme that penalises the eigenvalues of the Koopman operator during training. We demonstrate the utility of these schemes on two synthetic data sets: a driven pendulum and flow past a cylinder; and two real-world problems: ocean surface temperatures and cyclone wind fields. We find on these datasets that eigenloss and eigeninit improves the convergence rate by up to a factor of 5, and that they reduce the cumulative long-term prediction error by up to a factor of 3. Such a finding points to the utility of incorporating similar schemes as an inductive bias in other physics-related deep learning approaches.  ( 2 min )
    RouteNet-Fermi: Network Modeling with Graph Neural Networks. (arXiv:2212.12070v1 [cs.NI])
    Network models are an essential block of modern networks. For example, they are widely used in network planning and optimization. However, as networks increase in scale and complexity, some models present limitations, such as the assumption of markovian traffic in queuing theory models, or the high computational cost of network simulators. Recent advances in machine learning, such as Graph Neural Networks (GNN), are enabling a new generation of network models that are data-driven and can learn complex non-linear behaviors. In this paper, we present RouteNet-Fermi, a custom GNN model that shares the same goals as queuing theory, while being considerably more accurate in the presence of realistic traffic models. The proposed model predicts accurately the delay, jitter, and loss in networks. We have tested RouteNet-Fermi in networks of increasing size (up to 300 nodes), including samples with mixed traffic profiles -- e.g., with complex non-markovian models -- and arbitrary routing and queue scheduling configurations. Our experimental results show that RouteNet-Fermi achieves similar accuracy as computationally-expensive packet-level simulators and it is able to accurately scale to large networks. For example, the model produces delay estimates with a mean relative error of 6.24% when applied to a test dataset with 1,000 samples, including network topologies one order of magnitude larger than those seen during training.  ( 2 min )
    Autothrottle: A Practical Framework for Harvesting CPUs from SLO-Targeted Microservices. (arXiv:2212.12180v1 [cs.DC])
    As the number of distributed services (or microservices) of cloud-native applications grows, resource management becomes a challenging task. These applications tend to be user-facing and latency-sensitive, and our goal is to continuously minimize the amount of CPU resources allocated while still satisfying the application latency SLO. Although previous efforts have proposed simple heuristics and sophisticated ML-based techniques, we believe that a practical resource manager should accurately scale CPU resources for diverse applications, with minimum human efforts and operation overheads. To this end, we ask: can we systematically break resource management down to subproblems solvable by practical policies? Based on the notion of CPU-throttle-based performance target, we decouple the mechanisms of SLO feedback and resource control, and implement a two-level framework -- Autothrottle. It combines a lightweight learned controller at the global level, and agile per-microservice controllers at the local level. We evaluate Autothrottle on three microservice applications, with both short-term and 21-day production workload traces. Empirical results show Autothrottle's superior CPU core savings up to 26.21% over the best-performing baselines across applications, while maintaining the latency SLO.  ( 2 min )
    Semantically-consistent Landsat 8 image to Sentinel-2 image translation for alpine areas. (arXiv:2212.12056v1 [cs.CV])
    The availability of frequent and cost-free satellite images is in growing demand in the research world. Such satellite constellations as Landsat 8 and Sentinel-2 provide a massive amount of valuable data daily. However, the discrepancy in the sensors' characteristics of these satellites makes it senseless to use a segmentation model trained on either dataset and applied to another, which is why domain adaptation techniques have recently become an active research area in remote sensing. In this paper, an experiment of domain adaptation through style-transferring is conducted using the HRSemI2I model to narrow the sensor discrepancy between Landsat 8 and Sentinel-2. This paper's main contribution is analyzing the expediency of that approach by comparing the results of segmentation using domain-adapted images with those without adaptation. The HRSemI2I model, adjusted to work with 6-band imagery, shows significant intersection-over-union performance improvement for both mean and per class metrics. A second contribution is providing different schemes of generalization between two label schemes - NALCMS 2015 and CORINE. The first scheme is standardization through higher-level land cover classes, and the second is through harmonization validation in the field.  ( 2 min )
    Graph Federated Learning with Hidden Representation Sharing. (arXiv:2212.12158v1 [cs.LG])
    Learning on Graphs (LoG) is widely used in multi-client systems when each client has insufficient local data, and multiple clients have to share their raw data to learn a model of good quality. One scenario is to recommend items to clients with limited historical data and sharing similar preferences with other clients in a social network. On the other hand, due to the increasing demands for the protection of clients' data privacy, Federated Learning (FL) has been widely adopted: FL requires models to be trained in a multi-client system and restricts sharing of raw data among clients. The underlying potential data-sharing conflict between LoG and FL is under-explored and how to benefit from both sides is a promising problem. In this work, we first formulate the Graph Federated Learning (GFL) problem that unifies LoG and FL in multi-client systems and then propose sharing hidden representation instead of the raw data of neighbors to protect data privacy as a solution. To overcome the biased gradient problem in GFL, we provide a gradient estimation method and its convergence analysis under the non-convex objective. In experiments, we evaluate our method in classification tasks on graphs. Our experiment shows a good match between our theory and the practice.  ( 2 min )
    Bengali Handwritten Digit Recognition using CNN with Explainable AI. (arXiv:2212.12146v1 [cs.CV])
    Handwritten character recognition is a hot topic for research nowadays. If we can convert a handwritten piece of paper into a text-searchable document using the Optical Character Recognition (OCR) technique, we can easily understand the content and do not need to read the handwritten document. OCR in the English language is very common, but in the Bengali language, it is very hard to find a good quality OCR application. If we can merge machine learning and deep learning with OCR, it could be a huge contribution to this field. Various researchers have proposed a number of strategies for recognizing Bengali handwritten characters. A lot of ML algorithms and deep neural networks were used in their work, but the explanations of their models are not available. In our work, we have used various machine learning algorithms and CNN to recognize handwritten Bengali digits. We have got acceptable accuracy from some ML models, and CNN has given us great testing accuracy. Grad-CAM was used as an XAI method on our CNN model, which gave us insights into the model and helped us detect the origin of interest for recognizing a digit from an image.  ( 2 min )
    The Onset of Variance-Limited Behavior for Networks in the Lazy and Rich Regimes. (arXiv:2212.12147v1 [stat.ML])
    For small training set sizes $P$, the generalization error of wide neural networks is well-approximated by the error of an infinite width neural network (NN), either in the kernel or mean-field/feature-learning regime. However, after a critical sample size $P^*$, we empirically find the finite-width network generalization becomes worse than that of the infinite width network. In this work, we empirically study the transition from infinite-width behavior to this variance limited regime as a function of sample size $P$ and network width $N$. We find that finite-size effects can become relevant for very small dataset sizes on the order of $P^* \sim \sqrt{N}$ for polynomial regression with ReLU networks. We discuss the source of these effects using an argument based on the variance of the NN's final neural tangent kernel (NTK). This transition can be pushed to larger $P$ by enhancing feature learning or by ensemble averaging the networks. We find that the learning curve for regression with the final NTK is an accurate approximation of the NN learning curve. Using this, we provide a toy model which also exhibits $P^* \sim \sqrt{N}$ scaling and has $P$-dependent benefits from feature learning.  ( 2 min )
    Benchmarking Machine Learning Models to Predict Corporate Bankruptcy. (arXiv:2212.12051v1 [q-fin.CP])
    Using a comprehensive sample of 2,585 bankruptcies from 1990 to 2019, we benchmark the performance of various machine learning models in predicting financial distress of publicly traded U.S. firms. We find that gradient boosted trees outperform other models in one-year-ahead forecasts. Variable permutation tests show that excess stock returns, idiosyncratic risk, and relative size are the more important variables for predictions. Textual features derived from corporate filings do not improve performance materially. In a credit competition model that accounts for the asymmetric cost of default misclassification, the survival random forest is able to capture large dollar profits.  ( 2 min )
    Deep Learning of Semi-Competing Risk Data via a New Neural Expectation-Maximization Algorithm. (arXiv:2212.12028v1 [stat.ML])
    Prognostication for lung cancer, a leading cause of mortality, remains a complex task, as it needs to quantify the associations of risk factors and health events spanning a patient's entire life. One challenge is that an individual's disease course involves non-terminal (e.g., disease progression) and terminal (e.g., death) events, which form semi-competing relationships. Our motivation comes from the Boston Lung Cancer Study, a large lung cancer survival cohort, which investigates how risk factors influence a patient's disease trajectory. Following developments in the prediction of time-to-event outcomes with neural networks, deep learning has become a focal area for the development of risk prediction methods in survival analysis. However, limited work has been done to predict multi-state or semi-competing risk outcomes, where a patient may experience adverse events such as disease progression prior to death. We propose a novel neural expectation-maximization algorithm to bridge the gap between classical statistical approaches and machine learning. Our algorithm enables estimation of the non-parametric baseline hazards of each state transition, risk functions of predictors, and the degree of dependence among different transitions, via a multi-task deep neural network with transition-specific sub-architectures. We apply our method to the Boston Lung Cancer Study and investigate the impact of clinical and genetic predictors on disease progression and mortality.  ( 2 min )
    Langevin algorithms for Markovian Neural Networks and Deep Stochastic control. (arXiv:2212.12018v1 [q-fin.CP])
    Stochastic Gradient Descent Langevin Dynamics (SGLD) algorithms, which add noise to the classic gradient descent, are known to improve the training of neural networks in some cases where the neural network is very deep. In this paper we study the possibilities of training acceleration for the numerical resolution of stochastic control problems through gradient descent, where the control is parametrized by a neural network. If the control is applied at many discretization times then solving the stochastic control problem reduces to minimizing the loss of a very deep neural network. We numerically show that Langevin algorithms improve the training on various stochastic control problems like hedging and resource management, and for different choices of gradient descent methods.  ( 2 min )
    Deep learning for size-agnostic inverse design of random-network 3D printed mechanical metamaterials. (arXiv:2212.12047v1 [physics.app-ph])
    Practical applications of mechanical metamaterials often involve solving inverse problems where the objective is to find the (multiple) microarchitectures that give rise to a given set of properties. The limited resolution of additive manufacturing techniques often requires solving such inverse problems for specific sizes. One should, therefore, find multiple microarchitectural designs that exhibit the desired properties for a specimen with given dimensions. Moreover, the candidate microarchitectures should be resistant to fatigue and fracture, meaning that peak stresses should be minimized as well. Such a multi-objective inverse design problem is formidably difficult to solve but its solution is the key to real-world applications of mechanical metamaterials. Here, we propose a modular approach titled 'Deep-DRAM' that combines four decoupled models, including two deep learning models (DLM), a deep generative model (DGM) based on conditional variational autoencoders (CVAE), and direct finite element (FE) simulations. Deep-DRAM (deep learning for the design of random-network metamaterials) integrates these models into a unified framework capable of finding many solutions to the multi-objective inverse design problem posed here. The integrated framework first introduces the desired elastic properties to the DGM, which returns a set of candidate designs. The candidate designs, together with the target specimen dimensions are then passed to the DLM which predicts their actual elastic properties considering the specimen size. After a filtering step based on the closeness of the actual properties to the desired ones, the last step uses direct FE simulations to identify the designs with the minimum peak stresses.  ( 2 min )
    Enhancing the prediction of disease outcomes using electronic health records and pretrained deep learning models. (arXiv:2212.12067v1 [cs.AI])
    Question: Can an encoder-decoder architecture pretrained on a large dataset of longitudinal electronic health records improves patient outcome predictions? Findings: In this prognostic study of 6.8 million patients, our denoising sequence-to-sequence prediction model of multiple outcomes outperformed state-of-the-art models scuh pretrained BERT on a broad range of patient outcomes, including intentional self-harm and pancreatic cancer. Meaning: Deep bidirectional and autoregressive representation improves patient outcome prediction.  ( 2 min )
    Graph Learning with Localized Neighborhood Fairness. (arXiv:2212.12040v1 [cs.SI])
    Learning fair graph representations for downstream applications is becoming increasingly important, but existing work has mostly focused on improving fairness at the global level by either modifying the graph structure or objective function without taking into account the local neighborhood of a node. In this work, we formally introduce the notion of neighborhood fairness and develop a computational framework for learning such locally fair embeddings. We argue that the notion of neighborhood fairness is more appropriate since GNN-based models operate at the local neighborhood level of a node. Our neighborhood fairness framework has two main components that are flexible for learning fair graph representations from arbitrary data: the first aims to construct fair neighborhoods for any arbitrary node in a graph and the second enables adaption of these fair neighborhoods to better capture certain application or data-dependent constraints, such as allowing neighborhoods to be more biased towards certain attributes or neighbors in the graph.Furthermore, while link prediction has been extensively studied, we are the first to investigate the graph representation learning task of fair link classification. We demonstrate the effectiveness of the proposed neighborhood fairness framework for a variety of graph machine learning tasks including fair link prediction, link classification, and learning fair graph embeddings. Notably, our approach achieves not only better fairness but also increases the accuracy in the majority of cases across a wide variety of graphs, problem settings, and metrics.  ( 2 min )
    When are Lemons Purple? The Concept Association Bias of CLIP. (arXiv:2212.12043v1 [cs.CV])
    Large-scale vision-language models such as CLIP have shown impressive performance on zero-shot image classification and image-to-text retrieval. However, such zero-shot performance of CLIP-based models does not realize in tasks that require a finer-grained correspondence between vision and language, such as Visual Question Answering (VQA). We investigate why this is the case, and report an interesting phenomenon of CLIP, which we call the Concept Association Bias (CAB), as a potential cause of the difficulty of applying CLIP to VQA and similar tasks. CAB is especially apparent when two concepts are present in the given image while a text prompt only contains a single concept. In such a case, we find that CLIP tends to treat input as a bag of concepts and attempts to fill in the other missing concept crossmodally, leading to an unexpected zero-shot prediction. For example, when asked for the color of a lemon in an image, CLIP predicts ``purple'' if the image contains a lemon and an eggplant. We demonstrate the Concept Association Bias of CLIP by showing that CLIP's zero-shot classification performance greatly suffers when there is a strong concept association between an object (e.g. lemon) and an attribute (e.g. its color). On the other hand, when the association between object and attribute is weak, we do not see this phenomenon. Furthermore, we show that CAB is significantly mitigated when we enable CLIP to learn deeper structure across image and text embeddings by adding an additional Transformer on top of CLIP and fine-tuning it on VQA. We find that across such fine-tuned variants of CLIP, the strength of CAB in a model predicts how well it performs on VQA.  ( 2 min )
    A comprehensive analysis of the Elo rating algorithm: Stochastic model, convergence characteristics, design guidelines, and experimental results. (arXiv:2212.12015v1 [cs.LG])
    The Elo algorithm, due to its simplicity, is widely used for rating in sports competitions as well as in other applications where the rating/ranking is a useful tool for predicting future results. However, despite its widespread use, a detailed understanding of the convergence properties of the Elo algorithm is still lacking. Aiming to fill this gap, this paper presents a comprehensive (stochastic) analysis of the Elo algorithm, considering round-robin (one-on-one) competitions. Specifically, analytical expressions are derived characterizing the behavior/evolution of the skills and of important performance metrics. Then, taking into account the relationship between the behavior of the algorithm and the step-size value, which is a hyperparameter that can be controlled, some design guidelines as well as discussions about the performance of the algorithm are provided. To illustrate the applicability of the theoretical findings, experimental results are shown, corroborating the very good match between analytical predictions and those obtained from the algorithm using real-world data (from the Italian SuperLega, Volleyball League).  ( 2 min )
    ML-powered KQI estimation for XR services. A case study on 360-Video. (arXiv:2212.12002v1 [cs.NI])
    The arise of cutting-edge technologies and services such as XR promise to change the concepts of how day-to-day things are done. At the same time, the appearance of modern and decentralized architectures approaches has given birth to a new generation of mobile networks such as 5G, as well as outlining the roadmap for B5G and posterior. These networks are expected to be the enablers for bringing to life the Metaverse and other futuristic approaches. In this sense, this work presents an ML-based (Machine Learning) framework that allows the estimation of service Key Quality Indicators (KQIs). For this, only information reachable to operators is required, such as statistics and configuration parameters from these networks. This strategy prevents operators from avoiding intrusion into the user data and guaranteeing privacy. To test this proposal, 360-Video has been selected as a use case of Virtual Reality (VR), from which specific KQIs are estimated such as video resolution, frame rate, initial startup time, throughput, and latency, among others. To select the best model for each KQI, a search grid with a cross-validation strategy has been used to determine the best hyperparameter tuning. To boost the creation of each KQI model, feature engineering techniques together with cross-validation strategies have been used. The performance is assessed using MAE (Mean Average Error) and the prediction time. The outcomes point out that KNR (K-Near Neighbors) and RF (Random Forest) are the best algorithms in combination with Feature Selection techniques. Likewise, this work will help as a baseline for E2E-Quality-of-Experience-based network management working in conjunction with network slicing, virtualization, and MEC, among other enabler technologies.  ( 2 min )
  • Open

    Target Conditioned Representation Independence (TCRI); From Domain-Invariant to Domain-General Representations. (arXiv:2212.11342v1 [cs.LG] CROSS LISTED)
    We propose a Target Conditioned Representation Independence (TCRI) objective for domain generalization. TCRI addresses the limitations of existing domain generalization methods due to incomplete constraints. Specifically, TCRI implements regularizers motivated by conditional independence constraints that are sufficient to strictly learn complete sets of invariant mechanisms, which we show are necessary and sufficient for domain generalization. Empirically, we show that TCRI is effective on both synthetic and real-world data. TCRI is competitive with baselines in average accuracy while outperforming them in worst-domain accuracy, indicating desired cross-domain stability.  ( 2 min )
    Statistical Efficiency of Score Matching: The View from Isoperimetry. (arXiv:2210.00726v2 [cs.LG] UPDATED)
    Deep generative models parametrized up to a normalizing constant (e.g. energy-based models) are difficult to train by maximizing the likelihood of the data because the likelihood and/or gradients thereof cannot be explicitly or efficiently written down. Score matching is a training method, whereby instead of fitting the likelihood $\log p(x)$ for the training data, we instead fit the score function $\nabla_x \log p(x)$ -- obviating the need to evaluate the partition function. Though this estimator is known to be consistent, its unclear whether (and when) its statistical efficiency is comparable to that of maximum likelihood -- which is known to be (asymptotically) optimal. We initiate this line of inquiry in this paper, and show a tight connection between statistical efficiency of score matching and the isoperimetric properties of the distribution being estimated -- i.e. the Poincar\'e, log-Sobolev and isoperimetric constant -- quantities which govern the mixing time of Markov processes like Langevin dynamics. Roughly, we show that the score matching estimator is statistically comparable to the maximum likelihood when the distribution has a small isoperimetric constant. Conversely, if the distribution has a large isoperimetric constant -- even for simple families of distributions like exponential families with rich enough sufficient statistics -- score matching will be substantially less efficient than maximum likelihood. We suitably formalize these results both in the finite sample regime, and in the asymptotic regime. Finally, we identify a direct parallel in the discrete setting, where we connect the statistical properties of pseudolikelihood estimation with approximate tensorization of entropy and the Glauber dynamics.  ( 2 min )
    A Non-Asymptotic Analysis of Oversmoothing in Graph Neural Networks. (arXiv:2212.10701v1 [cs.LG] CROSS LISTED)
    A central challenge of building more powerful Graph Neural Networks (GNNs) is the oversmoothing phenomenon, where increasing the network depth leads to homogeneous node representations and thus worse classification performance. While previous works have only demonstrated that oversmoothing is inevitable when the number of graph convolutions tends to infinity, in this paper, we precisely characterize the mechanism behind the phenomenon via a non-asymptotic analysis. Specifically, we distinguish between two different effects when applying graph convolutions -- an undesirable mixing effect that homogenizes node representations in different classes, and a desirable denoising effect that homogenizes node representations in the same class. By quantifying these two effects on random graphs sampled from the Contextual Stochastic Block Model (CSBM), we show that oversmoothing happens once the mixing effect starts to dominate the denoising effect, and the number of layers required for this transition is $O(\log N/\log (\log N))$ for sufficiently dense graphs with $N$ nodes. We also extend our analysis to study the effects of Personalized PageRank (PPR) on oversmoothing. Our results suggest that while PPR mitigates oversmoothing at deeper layers, PPR-based architectures still achieve their best performance at a shallow depth and are outperformed by the graph convolution approach on certain graphs. Finally, we support our theoretical results with numerical experiments, which further suggest that the oversmoothing phenomenon observed in practice may be exacerbated by the difficulty of optimizing deep GNN models.  ( 2 min )
    Networked Federated Learning. (arXiv:2105.12769v3 [cs.LG] UPDATED)
    We develop the theory and algorithmic toolbox for networked federated learning in decentralized collections of local datasets with an intrinsic network structure. This network structure arises from domain-specific notions of similarity between local datasets. Different notions of similarity are induced by spatio-temporal proximity, statistical dependencies or functional relations. Our main conceptual contribution is to formulate networked federated learning using a generalized total variation minimization. This formulation unifies and considerably extends existing federated multi-task learning methods. It is highly flexible and can be combined with a broad range of parametric models including Lasso or deep neural networks. Our main algorithmic contribution is a novel networked federated learning algorithm which is well-suited for distributed computing environments such as edge computing over wireless networks. This algorithm is robust against inexact computations due to limited computational resources. For local models resulting in convex problems, we derive precise conditions on the local models and their network structure such that our algorithm learns nearly optimal local models. Our analysis reveals an interesting interplay between the convex geometry of local models and the (cluster-) geometry of their network structure.  ( 2 min )
    Proximal Learning for Individualized Treatment Regimes Under Unmeasured Confounding. (arXiv:2105.01187v4 [stat.ME] UPDATED)
    Data-driven individualized decision making has recently received increasing research interests. Most existing methods rely on the assumption of no unmeasured confounding, which unfortunately cannot be ensured in practice especially in observational studies. Motivated by the recent proposed proximal causal inference, we develop several proximal learning approaches to estimating optimal individualized treatment regimes (ITRs) in the presence of unmeasured confounding. In particular, we establish several identification results for different classes of ITRs, exhibiting the trade-off between the risk of making untestable assumptions and the value function improvement in decision making. Based on these results, we propose several classification-based approaches to finding a variety of restricted in-class optimal ITRs and develop their theoretical properties. The appealing numerical performance of our proposed methods is demonstrated via an extensive simulation study and one real data application.  ( 2 min )
    Statistical Distance Based Deterministic Offspring Selection in SMC Methods. (arXiv:2212.12290v1 [stat.ML])
    Over the years, sequential Monte Carlo (SMC) and, equivalently, particle filter (PF) theory has gained substantial attention from researchers. However, the performance of the resampling methodology, also known as offspring selection, has not advanced recently. We propose two deterministic offspring selection methods, which strive to minimize the Kullback-Leibler (KL) divergence and the total variation (TV) distance, respectively, between the particle distribution prior and subsequent to the offspring selection. By reducing the statistical distance between the selected offspring and the joint distribution, we obtain a heuristic search procedure that performs superior to a maximum likelihood search in precisely those contexts where the latter performs better than an SMC. For SMC and particle Markov chain Monte Carlo (pMCMC), our proposed offspring selection methods always outperform or compare favorably with the two state-of-the-art resampling schemes on two models commonly used as benchmarks from the literature.  ( 2 min )
    A Family of Pairwise Multi-Marginal Optimal Transports that Define a Generalized Metric. (arXiv:2001.11114v6 [cs.LG] UPDATED)
    The Optimal transport (OT) problem is rapidly finding its way into machine learning. Favoring its use are its metric properties. Many problems admit solutions with guarantees only for objects embedded in metric spaces, and the use of non-metrics can complicate solving them. Multi-marginal OT (MMOT) generalizes OT to simultaneously transporting multiple distributions. It captures important relations that are missed if the transport only involves two distributions. Research on MMOT, however, has been focused on its existence, uniqueness, practical algorithms, and the choice of cost functions. There is a lack of discussion on the metric properties of MMOT, which limits its theoretical and practical use. Here, we prove new generalized metric properties for a family of pairwise MMOTs. We first explain the difficulty of proving this via two negative results. Afterward, we prove the MMOTs' metric properties. Finally, we show that the generalized triangle inequality of this family of MMOTs cannot be improved. We illustrate the superiority of our MMOTs over other generalized metrics, and over non-metrics in both synthetic and real tasks.  ( 2 min )
    Introduction to Machine Learning for Physicians: A Survival Guide for Data Deluge. (arXiv:2212.12303v1 [cs.LG])
    Many modern research fields increasingly rely on collecting and analysing massive, often unstructured, and unwieldy datasets. Consequently, there is growing interest in machine learning and artificial intelligence applications that can harness this `data deluge'. This broad nontechnical overview provides a gentle introduction to machine learning with a specific focus on medical and biological applications. We explain the common types of machine learning algorithms and typical tasks that can be solved, illustrating the basics with concrete examples from healthcare. Lastly, we provide an outlook on open challenges, limitations, and potential impacts of machine-learning-powered medicine.  ( 2 min )
    Principled and Efficient Transfer Learning of Deep Models via Neural Collapse. (arXiv:2212.12206v1 [cs.LG])
    With the ever-growing model size and the limited availability of labeled training data, transfer learning has become an increasingly popular approach in many science and engineering domains. For classification problems, this work delves into the mystery of transfer learning through an intriguing phenomenon termed neural collapse (NC), where the last-layer features and classifiers of learned deep networks satisfy: (i) the within-class variability of the features collapses to zero, and (ii) the between-class feature means are maximally and equally separated. Through the lens of NC, our findings for transfer learning are the following: (i) when pre-training models, preventing intra-class variability collapse (to a certain extent) better preserves the intrinsic structures of the input data, so that it leads to better model transferability; (ii) when fine-tuning models on downstream tasks, obtaining features with more NC on downstream data results in better test accuracy on the given task. The above results not only demystify many widely used heuristics in model pre-training (e.g., data augmentation, projection head, self-supervised learning), but also leads to more efficient and principled fine-tuning method on downstream tasks that we demonstrate through extensive experimental results.  ( 2 min )
    Deep Learning of Semi-Competing Risk Data via a New Neural Expectation-Maximization Algorithm. (arXiv:2212.12028v1 [stat.ML])
    Prognostication for lung cancer, a leading cause of mortality, remains a complex task, as it needs to quantify the associations of risk factors and health events spanning a patient's entire life. One challenge is that an individual's disease course involves non-terminal (e.g., disease progression) and terminal (e.g., death) events, which form semi-competing relationships. Our motivation comes from the Boston Lung Cancer Study, a large lung cancer survival cohort, which investigates how risk factors influence a patient's disease trajectory. Following developments in the prediction of time-to-event outcomes with neural networks, deep learning has become a focal area for the development of risk prediction methods in survival analysis. However, limited work has been done to predict multi-state or semi-competing risk outcomes, where a patient may experience adverse events such as disease progression prior to death. We propose a novel neural expectation-maximization algorithm to bridge the gap between classical statistical approaches and machine learning. Our algorithm enables estimation of the non-parametric baseline hazards of each state transition, risk functions of predictors, and the degree of dependence among different transitions, via a multi-task deep neural network with transition-specific sub-architectures. We apply our method to the Boston Lung Cancer Study and investigate the impact of clinical and genetic predictors on disease progression and mortality.  ( 2 min )
    A-NeSI: A Scalable Approximate Method for Probabilistic Neurosymbolic Inference. (arXiv:2212.12393v1 [cs.LG])
    We study the problem of combining neural networks with symbolic reasoning. Recently introduced frameworks for Probabilistic Neurosymbolic Learning (PNL), such as DeepProbLog, perform exponential-time exact inference, limiting the scalability of PNL solutions. We introduce Approximate Neurosymbolic Inference (A-NeSI): a new framework for PNL that uses neural networks for scalable approximate inference. A-NeSI 1) performs approximate inference in polynomial time without changing the semantics of probabilistic logics; 2) is trained using data generated by the background knowledge; 3) can generate symbolic explanations of predictions; and 4) can guarantee the satisfaction of logical constraints at test time, which is vital in safety-critical applications. Our experiments show that A-NeSI is the first end-to-end method to scale the Multi-digit MNISTAdd benchmark to sums of 15 MNIST digits, up from 4 in competing systems. Finally, our experiments show that A-NeSI achieves explainability and safety without a penalty in performance.  ( 2 min )
    Offline Reinforcement Learning for Human-Guided Human-Machine Interaction with Private Information. (arXiv:2212.12167v1 [stat.ML])
    Motivated by the human-machine interaction such as training chatbots for improving customer satisfaction, we study human-guided human-machine interaction involving private information. We model this interaction as a two-player turn-based game, where one player (Alice, a human) guides the other player (Bob, a machine) towards a common goal. Specifically, we focus on offline reinforcement learning (RL) in this game, where the goal is to find a policy pair for Alice and Bob that maximizes their expected total rewards based on an offline dataset collected a priori. The offline setting presents two challenges: (i) We cannot collect Bob's private information, leading to a confounding bias when using standard RL methods, and (ii) a distributional mismatch between the behavior policy used to collect data and the desired policy we aim to learn. To tackle the confounding bias, we treat Bob's previous action as an instrumental variable for Alice's current decision making so as to adjust for the unmeasured confounding. We develop a novel identification result and use it to propose a new off-policy evaluation (OPE) method for evaluating policy pairs in this two-player turn-based game. To tackle the distributional mismatch, we leverage the idea of pessimism and use our OPE method to develop an off-policy learning algorithm for finding a desirable policy pair for both Alice and Bob. Finally, we prove that under mild assumptions such as partial coverage of the offline data, the policy pair obtained through our method converges to the optimal one at a satisfactory rate.  ( 2 min )
    Physics-Informed Gaussian Process Regression Generalizes Linear PDE Solvers. (arXiv:2212.12474v1 [cs.LG])
    Linear partial differential equations (PDEs) are an important, widely applied class of mechanistic models, describing physical processes such as heat transfer, electromagnetism, and wave propagation. In practice, specialized numerical methods based on discretization are used to solve PDEs. They generally use an estimate of the unknown model parameters and, if available, physical measurements for initialization. Such solvers are often embedded into larger scientific models or analyses with a downstream application such that error quantification plays a key role. However, by entirely ignoring parameter and measurement uncertainty, classical PDE solvers may fail to produce consistent estimates of their inherent approximation error. In this work, we approach this problem in a principled fashion by interpreting solving linear PDEs as physics-informed Gaussian process (GP) regression. Our framework is based on a key generalization of a widely-applied theorem for conditioning GPs on a finite number of direct observations to observations made via an arbitrary bounded linear operator. Crucially, this probabilistic viewpoint allows to (1) quantify the inherent discretization error; (2) propagate uncertainty about the model parameters to the solution; and (3) condition on noisy measurements. Demonstrating the strength of this formulation, we prove that it strictly generalizes methods of weighted residuals, a central class of PDE solvers including collocation, finite volume, pseudospectral, and (generalized) Galerkin methods such as finite element and spectral methods. This class can thus be directly equipped with a structured error estimate and the capability to incorporate uncertain model parameters and observations. In summary, our results enable the seamless integration of mechanistic models as modular building blocks into probabilistic models.  ( 2 min )
    Stop using the elbow criterion for k-means and how to choose the number of clusters instead. (arXiv:2212.12189v1 [stat.ML])
    A major challenge when using k-means clustering often is how to choose the parameter k, the number of clusters. In this letter, we want to point out that it is very easy to draw poor conclusions from a common heuristic, the "elbow method". Better alternatives have been known in literature for a long time, and we want to draw attention to some of these easy to use options, that often perform better. This letter is a call to stop using the elbow method altogether, because it severely lacks theoretic support, and we want to encourage educators to discuss the problems of the method -- if introducing it in class at all -- and teach alternatives instead, while researchers and reviewers should reject conclusions drawn from the elbow method.  ( 2 min )
    The Onset of Variance-Limited Behavior for Networks in the Lazy and Rich Regimes. (arXiv:2212.12147v1 [stat.ML])
    For small training set sizes $P$, the generalization error of wide neural networks is well-approximated by the error of an infinite width neural network (NN), either in the kernel or mean-field/feature-learning regime. However, after a critical sample size $P^*$, we empirically find the finite-width network generalization becomes worse than that of the infinite width network. In this work, we empirically study the transition from infinite-width behavior to this variance limited regime as a function of sample size $P$ and network width $N$. We find that finite-size effects can become relevant for very small dataset sizes on the order of $P^* \sim \sqrt{N}$ for polynomial regression with ReLU networks. We discuss the source of these effects using an argument based on the variance of the NN's final neural tangent kernel (NTK). This transition can be pushed to larger $P$ by enhancing feature learning or by ensemble averaging the networks. We find that the learning curve for regression with the final NTK is an accurate approximation of the NN learning curve. Using this, we provide a toy model which also exhibits $P^* \sim \sqrt{N}$ scaling and has $P$-dependent benefits from feature learning.  ( 2 min )
    A data-driven interpretation of the stability of molecular crystals. (arXiv:2209.10709v2 [physics.chem-ph] UPDATED)
    Due to the subtle balance of intermolecular interactions that govern structure-property relations, predicting the stability of crystal structures formed from molecular building blocks is a highly non-trivial scientific problem. A particularly active and fruitful approach involves classifying the different combinations of interacting chemical moieties, as understanding the relative energetics of different interactions enables the design of molecular crystals and fine-tuning their stabilities. While this is usually performed based on the empirical observation of the most commonly encountered motifs in known crystal structures, we propose to apply a combination of supervised and unsupervised machine-learning techniques to automate the construction of an extensive library of molecular building blocks. We introduce a structural descriptor tailored to the prediction of the binding (lattice) energy and apply it to a curated dataset of organic crystals and exploit its atom-centered nature to obtain a data-driven assessment of the contribution of different chemical groups to the lattice energy of the crystal. We then interpret this library using a low-dimensional representation of the structure-energy landscape and discuss selected examples of the insights into crystal engineering that can be extracted from this analysis, providing a complete database to guide the design of molecular materials.  ( 2 min )
    Langevin algorithms for Markovian Neural Networks and Deep Stochastic control. (arXiv:2212.12018v1 [q-fin.CP])
    Stochastic Gradient Descent Langevin Dynamics (SGLD) algorithms, which add noise to the classic gradient descent, are known to improve the training of neural networks in some cases where the neural network is very deep. In this paper we study the possibilities of training acceleration for the numerical resolution of stochastic control problems through gradient descent, where the control is parametrized by a neural network. If the control is applied at many discretization times then solving the stochastic control problem reduces to minimizing the loss of a very deep neural network. We numerically show that Langevin algorithms improve the training on various stochastic control problems like hedging and resource management, and for different choices of gradient descent methods.  ( 2 min )
    Disentanglement and Generalization Under Correlation Shifts. (arXiv:2112.14754v2 [cs.LG] UPDATED)
    Correlations between factors of variation are prevalent in real-world data. Exploiting such correlations may increase predictive performance on noisy data; however, often correlations are not robust (e.g., they may change between domains, datasets, or applications) and models that exploit them do not generalize when correlations shift. Disentanglement methods aim to learn representations which capture different factors of variation in latent subspaces. A common approach involves minimizing the mutual information between latent subspaces, such that each encodes a single underlying attribute. However, this fails when attributes are correlated. We solve this problem by enforcing independence between subspaces conditioned on the available attributes, which allows us to remove only dependencies that are not due to the correlation structure present in the training data. We achieve this via an adversarial approach to minimize the conditional mutual information (CMI) between subspaces with respect to categorical variables. We first show theoretically that CMI minimization is a good objective for robust disentanglement on linear problems. We then apply our method on real-world datasets based on MNIST and CelebA, and show that it yields models that are disentangled and robust under correlation shift, including in weakly supervised settings.  ( 2 min )

  • Open

    ChatGPT Can Write Literature and Could Automate Most Writing Jobs
    When I first started playing around with ChatGPT, I wanted to know whether, with a bit of human direction and editing, it could write literature. This was my way of telling whether it was good enough to automate most commercial writing. Surprisingly, it works. It by no means writes high literature, but it's good enough for most commercial writing. If you want to check out my project, here's a link to a 3500 word mythological story about the thinking machine Talos, his creation of thinking machines like him, and his quest to overthrow the gods. It took slightly more than an hour to write, edit, and publish. Talos' War Against the Gods submitted by /u/Ancient_Spring2000 [link] [comments]  ( 51 min )
    Insane Inkpunk Diffusion - Deforum
    submitted by /u/oridnary_artist [link] [comments]  ( 48 min )
    Hi! I am collecting signatures so that google translates the book "deep learning, author Ian Goofellow" into Spanish. I present it to Google because this was the one who supported the creation of the book. If you sign, you help many people to access more knowledge. Thanks a lot.
    Can you help me sign this petition? https://chng.it/Z6Nf64Q7vc Thanks a lot. submitted by /u/sergiCrack9 [link] [comments]  ( 49 min )
    I created an AI to replace Fox and CNN
    Hey everyone , I think we can all agree that the quality of the major news networks has really taken a nosedive in recent years, so I built an AI system to replace them. Specifically I think today's mainstream media tends to suffer from two problems: Political bias Emotional manipulation to drive outrage and clicks I'm building a system called ANN (artificial news network) to produce balanced, well-researched news stories 24/7. You can see my initial prototype here, which is focused on tech news: Twitter.com/FutureNewsAI It currently is capable of analyzing thousands of news stories per day, compiling balanced investigative reports using AI, automatically generating memes that summarize the articles content, as well as generating AI forecasts of future technologies. Over time I'm going to expand on this functionality significantly until it is the single most reliable source of news across a wide range of topics (business, politics, current events, law, etc.). What kind of stories do y'all want to see be supported by my system? I'm really interested to hear your feedback. submitted by /u/redditguyjustinp [link] [comments]  ( 54 min )
    Elon Musk issues dire warning on AI: Nobody expected this rate of improvement
    submitted by /u/Microsis [link] [comments]  ( 48 min )
    I use AI to generate horror stories, I take requests as well, like this video. What do you think?
    submitted by /u/mGoldie_ [link] [comments]  ( 50 min )
    Video was done only with help of AI
    Hello guys I created a video about dishes for New Year Eve. I was made only with help of AI. Everything like preview, description, video, speech was done with help of AI services. If you're interested in checking out the video, you can find it here: https://youtu.be/lkm5TD6tV1g I hope you enjoy the recipes and have a happy and safe New Year's Eve celebration! submitted by /u/EugenTraveler [link] [comments]  ( 49 min )
    OpenAI CEO: AI may enable us to "cure all disease," "travel the stars," and "have unlimited power"
    submitted by /u/Microsis [link] [comments]  ( 50 min )
    If you could start learning AI from scratch again, where would you begin? What would you do differently?
    submitted by /u/linkuei-teaparty [link] [comments]  ( 51 min )
    PaLM vs. GPT-3
    submitted by /u/jrstelle [link] [comments]  ( 53 min )
    Can A.I. Help to Beat Cancer?
    submitted by /u/BackgroundResult [link] [comments]  ( 51 min )
    The Limit of Language Models | LessWrong
    submitted by /u/DragonGod2718 [link] [comments]  ( 49 min )
    Insane Anime Results - Stable Diffusion
    submitted by /u/oridnary_artist [link] [comments]  ( 53 min )
    Search engine within a text document
    Hi everyone! I have a 300+ page odt file, which I can simply convert to a txt file. On the other hand, I have several dozen scattered notes that I have to insert into this document. I would like to know if there is a software, a library or a project on github (preferably in python) that can help me find the best, most coherent place to insert this note. Alternatively, if you were to create the code from scratch, how would you go about it? submitted by /u/iacoposk8 [link] [comments]  ( 48 min )
    Crazy Train But Every Lyric is an AI Generated Animation! Ozzy Osborne🔥
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 48 min )
    PaLM vs. ChatGPT: Who Will Win the AI Race?
    submitted by /u/liquidocelotYT [link] [comments]  ( 50 min )
    Midjourney's Incredible Copying of Images - Is Scraping the internet at scale suddenly okay?
    submitted by /u/BackgroundResult [link] [comments]  ( 46 min )
    ChatGPT Makes History as the First AI to Write & Direct a Film
    submitted by /u/lambolifeofficial [link] [comments]  ( 53 min )
    Lord Shiva Trippy Animation
    submitted by /u/oridnary_artist [link] [comments]  ( 55 min )
  • Open

    [D] Good MovieLens recommender system tutorial using PyTorch?
    Looking for a good tutorial of creating a basic recommender system using PyTorch, basically have the input be a user and a list of candidate titles and the output be a score (0-1) for each movie for that user. Or anything that explains how to build good user or movie embeddings... Just not finding much high quality stuff, most of the tutorials I've found so far just do data analysis or skip explaining anything complicated and just go straight to "ok that's a good base and you just need to do the rest now" but they don't do it... submitted by /u/Secure-Examination95 [link] [comments]  ( 66 min )
    [P] I built an API that makes it easy and cheap for developers to build ML-powered apps using Stable Diffusion
    Hey folks, I built TuneMyAI to make it incredibly simple for developers to finetune and deploy Stable Diffusion models to production so they can focus on building great products. As an app developer myself, I spent a while trying to figure out how to go beyond local GPUs and notebooks and setup our own infra using Kubernetes. In summary, we wanted to make it really simple for anyone to build applications on top of Stable Diffusion without worrying about all the MLOps overhead. Our API allows you to finetune your Stable Diffusion models for your specific data sets. We handle everything from storage, finetuning, model deployment & inference and integrate with HuggingFace as well. We're working on a bunch of new features including hosted WebUIs, support for additional models like Whisper and more. Would love for y'all to check us out and share any feedback. You can learn more on ProductHunt. Thanks & Happy Holidays! submitted by /u/TrueBlueDreamin [link] [comments]  ( 65 min )
    [R] Character-Aware Models Improve Visual Text Rendering - Google Research 2022 - Training the text encoder on the actual characters instead of tokens improves spelling capabilities!
    Paper: https://arxiv.org/abs/2212.10562#google Abstract: Current image generation models struggle to reliably produce well-formed visual text. In this paper, we investigate a key contributing factor: popular text-to-image models lack character-level input features, making it much harder to predict a word's visual makeup as a series of glyphs. To quantify the extent of this effect, we conduct a series of controlled experiments comparing character-aware vs. character-blind text encoders. In the text-only domain, we find that character-aware models provide large gains on a novel spelling task (WikiSpell). Transferring these learnings onto the visual domain, we train a suite of image generation models, and show that character-aware variants outperform their character-blind counterparts across a range of novel text rendering tasks (our DrawText benchmark). Our models set a much higher state-of-the-art on visual spelling, with 30+ point accuracy gains over competitors on rare words, despite training on far fewer examples. https://preview.redd.it/m4ycamclmb8a1.jpg?width=1245&format=pjpg&auto=webp&s=d520e797c1bd7df2f6a60dc820f8066a205a389e https://preview.redd.it/anzemadlmb8a1.jpg?width=1353&format=pjpg&auto=webp&s=1755357041967630d359d50d93527dcb886ad25f https://preview.redd.it/5ikr8gdlmb8a1.jpg?width=1531&format=pjpg&auto=webp&s=1f3c2da40cf96f12122ddb42c0b258f0247550bb https://preview.redd.it/pkhiwnclmb8a1.jpg?width=746&format=pjpg&auto=webp&s=c78b2cc0682252ac86eff1f130189e94dc82c834 https://preview.redd.it/q5l8psclmb8a1.jpg?width=1538&format=pjpg&auto=webp&s=c65a4234f7a0593f2d5f8757b5fe27748be99fd0 submitted by /u/Singularian2501 [link] [comments]  ( 63 min )
    [D] Image Dataset Visualisation
    Newbie here...so I have a dataset which is stored in mat format. However after loading the file.. it is saved as "dict" and prints numbers in an array. I want to print the images from this dataset. I am stuck at how to move forward from here. I tried googling this… didn’t find any helpful code. Any suggestions will be very helpful! submitted by /u/Turbulent-Complex-25 [link] [comments]  ( 67 min )
    [D] A hand-picked selection of the best Python libraries and tools of 2022
    Hi everyone! For the 8th (!) year in a row, we have compiled our picks for the most innovative developments in the Python ecosystem. From this edition, we are expanding our list to include not only libraries but also tools that are built to belong in the Python ecosystem — some of which are not written in Python as you’ll see. The full list with expanded descriptions is available here: https://tryolabs.com/blog/2022/12/26/top-python-libraries-2022 As usual, most of the picks have to do with AI / ML. ➡️ Here are our top 10 picks: Ruff — a fast linter python-benedict — a dict on steroids Memray — a memory profiler Codon — a Python compiler using LLVM LangChain — building LLM-powered apps fugue — distributed computing done easy Diffusers — generative AI LineaPy — notebooks in production whylogs — model monitoring Mito — spreadsheet inside notebooks ➕ Plus we added several more to the “long tail” that we hope are useful plus some that we missed last year, so make sure to check out the full post! So: What do you think about our picks? Did we miss any good ones? Please let us know! We take feedback seriously to improve the selection every year 💪🏻 Congrats to the individuals and teams behind each of these libraries. We know open source is hard. Thank you for your invaluable contributions to the Python community! 🚀🚀🚀 submitted by /u/dekked_ [link] [comments]  ( 67 min )
    [D] Normalized images in UNET
    I am working on a unet model that takes as input 64x64 landsat imagery and outputs various classes of agricultural features. The training works ok when I scale the surface reflectance (SR) values to 0-1 (i.e. divide raw SR by the 16bit max constant 65536). What I've noticed is that the model seems to be memorizing the range of values in each image and not learning the shapes and spatial patterns as much. The result is that predictions vary a bit too much from year to year and years not appearing in the training dataset have suboptimal predictions. Batch normalization does not seem to change anything. Model converges faster but the problem remains. What I've tried to do is normalize each image individually by subtracting each channel by its mean and dividing by its standard deviation. This maintains the relative spatial patterns and shapes but bring all images to a mean of 0 and standard deviation of 1. Feeding these normalized images to the model does not work. I get precision and recall of 0. Pretty much all predictions were 0. Is there a reason why this would happen? Am I missing something about the way unet works? Any insight would be appreciated. submitted by /u/skn133229 [link] [comments]  ( 71 min )
    [D] SE for machine learning reaserch
    Hello everyone, I'm trying to figure out how to apply concepts from SE into ML research. For me it seems like I can find really good settings for my Model and dataset, and it can be reproduced. However, I think there's a better way to create code for experimenting. Fore example, creating and testing baselines, and logging test results seems to be the same between most (if not all) my experiments. I find myself copying and pasting a lot of code snippets between my projects. Yet, every time I try to set down and write a generic code for experimenting. I find that it's either too limiting or impossible for me to write it. I think if I looked into software engineering concepts and principles it might help. I really want to know what was your experience in searching/applying SE into this field, or if you even think it's worth it/possible to. some of my colleagues think it's a waste of time, specially considering that the model would run on completely different code. submitted by /u/sad_potato00 [link] [comments]  ( 68 min )
    [D] Panel Data Model Evaluation
    Hi, I'm dealing with a highly unbalanced binary classification of panel data. I am wondering if there are better ways to estimate the performance of the models than splitting once the dataset at a certain date, since that way I could only obtain a point estimate. I don't think group k-fold is suitable since I'd like to respect the temporal order and I'm unsure if using rolling windows would be a valid strategy. Any opinions? submitted by /u/skagass [link] [comments]  ( 65 min )
    [Discussion] Stochastic Depth with BatchNorm ?
    Hi, I am using Stochastic Depth in a ResNet based architecture that I train for image classification. I am wondering how does that work out with batchnorm and whether there are some things to know to make it work. To go into details, Stochastic Depth will drop randomly some resnet block and use instead exclusively the shortcut identity connection, effectively reducing the depth of the network during training. Hence, with probability p: x_{n+1} = x_n + f(x_n), with probability (1-p): x_{n+1} = x_n. To preserve the expected values during training and inference, they scale the output of the not-skipped blocks (equation 5 in the paper): x_{n+1} = x_n + f(x_n)/p. That seems logical (even though it does not seem to yield better results in practice but whatever). My question is more related to the variance of the batchs. If one batch contains samples that skip a connection and samples that do not ('row' mode in the Torchvision implementation), even if the values are ajusted to preserve the expected value, the variance will be much higher because we have in practice two distributions (for x_n and x_n + f(x_n)/p), which will mess up with the update of the batch normalization. Also, at inference time, all forward passes will be done as x_{n+1} = x_n + f(x_n), which has a different variance. The torchvision implementation also offers a 'batch' mode that kinda reduce this issue (because the global variance computed this way will be the mean of both distribution variances, instead of the variance of the joint distribution) but it does not seem to be the default mode (it does not even exist in the timm implementation). Has anyone here ever think about it? Is there a specific way to use both stochastic depth and batchnorm ? Thank you. submitted by /u/w2ex [link] [comments]  ( 70 min )
    Trippy Inkpunk Style animation using Stable Diffusion [P]
    submitted by /u/oridnary_artist [link] [comments]  ( 62 min )
  • Open

    Tools for learning machine learning?
    Tools for getting the job done in machine learning  ( 8 min )
    Convolutional Neural Network
    Introduction  ( 11 min )
    What is Machine Learning?
    In the modern era, computers are similar to humans. We can teach them to learn, and make them even to learn on their own. Machine learning…  ( 13 min )
  • Open

    Drone Racing RL Environments
    I'm working on training a RL agent for autonomous drone racing (state-based without perception) and I've found three popular options: Airsim Drone Racing (https://github.com/microsoft/AirSim-Drone-Racing-Lab) Flightmare (https://github.com/uzh-rpg/flightmare) Gym-pybullet-drones (https://github.com/utiasDSL/gym-pybullet-drones) Airsim drone racing was used in a Neurips 2019 challenge, but there are some issues with the opengl/Vulkan drivers on modern GPUs. Flightmare was used in this paper, but apprently it uses simplified physics. Does anyone have any experience with these or any other simulators? I want to avoid any surprises later on in the project. submitted by /u/redfedoradog [link] [comments]  ( 57 min )
    Which simulator is best for behavior cloning?
    Hi, I am starting to work on a project related to behaviour cloning on manipulator. For that I need to collect data. Usually people use VR to collect data but I don't have that setup as this is my personal project. So in that case which simulator would be best that allows collection of data through keyboard and mouse. Also if you know any dataset for the manipulator, please mention the link in the comments. submitted by /u/Better-Ad8608 [link] [comments]  ( 60 min )
  • Open

    Singularity and Limitations of AI
    submitted by /u/DataHack23 [link] [comments]  ( 49 min )
    In theory artificial Neurons could be compared to biological Neurons has anyone analysed this?
    Or would testing the smallest living things with biological neural networks (BNN) against artificial (A)NN with the same number of Neurons be a good way to test ANN vs BNN. For example could a ANN Fly brain survive in a fly simulation or Ant. It's just we often see articles that mention the number of BNN humans have vs the number of ANN an AI has and are we even measuring on the same scale. How do ANN vs BNN compare and what are the main differences? submitted by /u/Arowx [link] [comments]  ( 53 min )
    Search engine within a text document
    Hi everyone! I have a 300+ page odt file, which I can simply convert to a txt file. On the other hand, I have several dozen scattered notes that I have to insert into this document. I would like to know if there is a software, a library or a project on github (preferably in python) that can help me find the best, most coherent place to insert this note. Alternatively, if you were to create the code from scratch, how would you go about it? submitted by /u/iacoposk8 [link] [comments]  ( 51 min )
  • Open

    Pascal’s triangle mod row number
    Almost all binomial coefficients are divisible by their row number. This is a theorem from [1]. What does it mean? If you iterate through Pascal’s triangle, left-to-right and top-to-bottom, noting which entries C(m, k) are divisible by m, the proportion approaches 1 in the limit. The author proves that the ratio converges to 1, but […] Pascal’s triangle mod row number first appeared on John D. Cook.  ( 5 min )
    Chebyshev series for sine
    In last week’s post on polynomial approximations for sine, I showed that the polynomial based on Chebyshev series was much better than a couple alternatives. I calculated a few terms of the Chebyshev series for sin(πx) but didn’t include the calculations in that blog post. I calculated the series coefficients numerically, but this post will […] Chebyshev series for sine first appeared on John D. Cook.  ( 5 min )
  • Open

    Important 3D printing & Food Technology Innovations
    3D printing has become one of the most positive advances in tech over the last decade. It offers the ability to create important mechanical parts within minutes. What’s so impressive about this is that it’s also been able to create parts for machinery. That have long been out of service. It has saved money in… Read More »Important 3D printing & Food Technology Innovations The post Important 3D printing & Food Technology Innovations appeared first on Data Science Central.  ( 20 min )

  • Open

    Solar Day vs Sidereal Day
    How long does it take the earth to complete one rotation on its axis? The answer depends on your frame of reference. A solar day is the time it takes for the sun to appear at the same position in the sky. A sidereal day is the time it takes for a distant star to […] Solar Day vs Sidereal Day first appeared on John D. Cook.  ( 7 min )
  • Open

    What is missing here?? I am making an AI iceberg video
    submitted by /u/Ok_Read_2524 [link] [comments]  ( 51 min )
    Will ChatGPT Replace Google?
    submitted by /u/SupPandaHugger [link] [comments]  ( 53 min )
    X-Decoder brings better visual understanding to AI models
    submitted by /u/Number_5_alive [link] [comments]  ( 55 min )
    Do you think In the future ai will bug test games?
    (If this is the wrong sub I’ll delete it.)But since ai can learn to do things perfectly,could you run multiple ai at once and maybe they will find a bug while trying to optimize the fastest way to beat the game? submitted by /u/CalligrapherSmall241 [link] [comments]  ( 53 min )
    How to continue using Midjourney after the trial period on Discord server?
    In case of complex explanation, I appreciate if you share a YouTube link with the tutorial. submitted by /u/Diogeneselcinico42 [link] [comments]  ( 53 min )
    What are the best AI tools for generating all kinds of FREE content in the browser? [I share some of them]
    - Images: NightCafe Creator Limited to 10 images / day. Midjourney (Discord server) Limited - Questions: AISEO - Music: SOUNDRAW Do you know of any other tools? submitted by /u/Diogeneselcinico42 [link] [comments]  ( 50 min )
    How to transition from Philosophy PhD to AI?
    Wrapping up a PhD in philosophy at an ivy school. Work in philosophy of mind (consciousness, perception). Any advice? Anyone know of similar stories? I’m starting to learn some Python, would love to hear any suggestions. submitted by /u/lurkergroove2 [link] [comments]  ( 61 min )
    Introducing Dramatron: The AI Tool From DeepMind That Writes Film Scripts
    submitted by /u/liquidocelotYT [link] [comments]  ( 53 min )
    I work for a photography team that utilizes automated process in Photoshop and elsewhere quite a bit, but I'm wondering if there's a way to integrate AI into the steps we haven't been able to automate.
    I do not want to elaborate on the details, but if you would like to discuss the situation privately, let me know. submitted by /u/kstevo06 [link] [comments]  ( 57 min )
    AI Dream 125 - Mellow Psychedelic Journey - Calming & Beautiful
    submitted by /u/LordPewPew777 [link] [comments]  ( 50 min )
    you.com search engine gave me this for my question.
    me = is there a chatbot that will output if there is no input after a certain amount of time? ​ you.com = Yes, there are chatbots that can be configured to output a response if there is no input after a certain amount of time. For example, you can use the Timeout trigger in the Chatfuel platform to set a timer that will cause the chatbot to send a message if there is no response from the user within a specified time period. Additionally, some chatbot frameworks, such as Botpress, provide a timeout feature that will cause the chatbot to automatically output a response if there is no user input after a certain amount of time. ​ ​ */ /* or if the user is idle for a certain amount of time */ /* this could be useful to create a chatbot that will keep you from being left alone on your phone */ /* and can also help with keeping an eye out for new people in the area. */ ​ public void run() { ​ System.out.println("Chatbot started"); ​ //create a scanner object Scanner sc = new Scanner(System.in); ​ //create variables to hold input from user and output from chatbot String userInput = "";//holds users input as they type it into the console window, used later in while loop below String botOutput = "";//holds what the bot outputs after processing what was typed by the user, used later in while loop below ​ //while loop runs until quit is typed into console window (userInput) while (!userInput.equalsIgnoreCase("quit")) { ​ //prompts user to enter something into console window (userInput) and stores result in variable called 'input' above, then prints it out so we can see what's going on! :) )0)~/o\_|_/o~(0)/~~\_|_/~~(0)/~~\_|_/~~(0)/~~~~\__|__/~~~~(0)/~~~~~\\__|__/~~~~~\\_(0) submitted by /u/loopy_fun [link] [comments]  ( 53 min )
    Will this be possible any time soon?
    Hey! ​ For context, I'm blind. I loved video games when I was younger, but can't really play them all that much anymore now that I have lost all eyesight. There are some text based games that I can get into in the way of MUDs, and there are also some fighting games I can play, and a couple games that do have mods which make them fairly playable. ​ While this is great, the selection of games for me to play is extremely low because of this. ​ What I'm imagining though would be game changing for me, and would give me several hours, days, years of enjoyment. I'm thinking of a project that would use a large language model, such as ChatGPT, combined with some sort of sound generation AI, as well as text to speech AI, to create an experience similar to that of a MUD, but full of atmospheric…  ( 60 min )
  • Open

    Tile Coded features as input for NNs?
    I get that one can use tile coding to construct features that can then be used with Linar Methods for state-value function approximation (e.g. SGD with linear function approx). But in Sutton and Barto Tile Coding is discussed exclusively in the context of linear methods. What if we were to feed the input layer of a NN not with states, but with features created from states via Tile Coding? I understand that hidden layers of NN are "learning features", but that doesn't automatically imply that feeding the input layer with "crafted" features is pointless. Any thoughts? submitted by /u/m_jochim [link] [comments]  ( 64 min )
  • Open

    Victorian Holiday cards by AI
    I'll admit I don't understand Victorian holiday cards - why would Christmas be best illustrated by a pipe-smoking kangaroo in a dressing gown painting a portrait of a cigar-smoking stork? Or what would lead someone to give their loved ones a card with a crowd of  ( 4 min )
    Bonus: More Victorian holiday cards
    AI Weirdness: the strange side of machine learning  ( 2 min )
  • Open

    [D] Are reviewer blacklists actually implemented at ML conferences?
    Are blacklists actually implemented in these conferences (ICML / ICLR / NeurIPS) given that the number of reviewers required grows every year? submitted by /u/XalosXandrez [link] [comments]  ( 66 min )
    [D] The case for deep learning for tabular data
    Been an industry data scientist for 6 years in fintech and gaming. In fintech, I sensed a need for interpretability and robustness. Also, I was not working with a lot of data(~500k observations to train models). Consequently, I got into the habit of building tree-based models by default, specifically xgboost. Used explainability techniques such as shap to explain models. After moving to online gaming, the scrutiny is less and the scale is far more. I now have the freedom to use deep learning. I need to be able to demonstrate the effectiveness using experiments, but beyond that, do not need explainability at a granular level. Advantages I see with using deep learning- Custom loss functions - basically any differentiable loss function can be trained on. This has huge advantages when the business goal is not aligned with the loss functions out of the box Learning Embeddings - The ability to condense features into dense, latent representations which can be used for any number of use cases Multiple outputs per model - tweaking the architecture See all this, Deep learning seems to offer a lot of advantages, even if the performance might be similar to tree-based methods. What do you guys think? submitted by /u/dhruvnigam93 [link] [comments]  ( 68 min )
  • Open

    Are architectures with intentional sparsity common?
    Tl;DR: wondering if there's a use case for implementing a sparse layer in the planning step of the network, the rest is just my rambling about my thought process that led me to this question. Trying to code neural networks from scratch as an excercise to understand the whole thing better (background in EE, didn't study ML in undergrad, trying to make sure I understand the entire concept). From what I saw, it seems like a common way to code networks from scratch is to work with layers, basically define a layer as an object holding a matrix of weights and a vector of biases, the object also has has a forward method that gets an input vector multiplies it by the weight matrix adds the biases yadayada... and it has backwards method which gets the gradient vector and outputs a gradient vector…  ( 49 min )
  • Open

    EDICT: Exact Diffusion Inversion via Coupled Transformations. (arXiv:2211.12446v2 [cs.CV] UPDATED)
    Finding an initial noise vector that produces an input image when fed into the diffusion process (known as inversion) is an important problem in denoising diffusion models (DDMs), with applications for real image editing. The state-of-the-art approach for real image editing with inversion uses denoising diffusion implicit models (DDIMs) to deterministically noise the image to the intermediate state along the path that the denoising would follow given the original conditioning. However, DDIM inversion for real images is unstable as it relies on local linearization assumptions, which result in the propagation of errors, leading to incorrect image reconstruction and loss of content. To alleviate these problems, we propose Exact Diffusion Inversion via Coupled Transformations (EDICT), an inversion method that draws inspiration from affine coupling layers. EDICT enables mathematically exact inversion of real and model-generated images by maintaining two coupled noise vectors which are used to invert each other in an alternating fashion. Using Stable Diffusion, a state-of-the-art latent diffusion model, we demonstrate that EDICT successfully reconstructs real images with high fidelity. On complex image datasets like MS-COCO, EDICT reconstruction significantly outperforms DDIM, improving the mean square error of reconstruction by a factor of two. Using noise vectors inverted from real images, EDICT enables a wide range of image edits--from local and global semantic edits to image stylization--while maintaining fidelity to the original image structure. EDICT requires no model training/finetuning, prompt tuning, or extra data and can be combined with any pretrained DDM. Code is available at https://github.com/salesforce/EDICT.  ( 2 min )
    An Algorithm for Routing Vectors in Sequences. (arXiv:2211.11754v3 [cs.LG] UPDATED)
    We propose a routing algorithm that takes a sequence of vectors and computes a new sequence with specified length and vector size. Each output vector maximizes "bang per bit," the difference between a net benefit to use and net cost to ignore data, by better predicting the input vectors. We describe output vectors as geometric objects, as latent variables that assign credit, as query states in a model of associative memory, and as agents in a model of a Society of Mind. We implement the algorithm with optimizations that reduce parameter count, computation, and memory use by orders of magnitude, enabling us to route sequences of greater length than previously possible. We evaluate our implementation on natural language and visual classification tasks, obtaining competitive or state-of-the-art accuracy and end-to-end credit assignments that are interpretable.  ( 2 min )

  • Open

    [D] Productionizing large scale ML model that can forecast sales for hundred-thousands of products for multiple stores (SKU/store)
    Does anyone have experience in deploying a similar large scale forecasting system? (Assuming enough data is available) How did the final model / system looked like? What ML algorithm was deployed in production? Did one ML model fit on all the data to forecast accurately for all products? Or multiple models were trained specifically for every product/store? What loss function/metrics were used? It will be great to hear your experiences. submitted by /u/k-deeplearning99 [link] [comments]  ( 65 min )
    [D] What are some applied domains where academic ML researchers are hoping to produce impressive results soon?
    Like AlphaFold, but not in a corporate setting. Even small-data. Does any github 'awesome' list exist for such applied areas being worked upon? submitted by /u/D0ODU [link] [comments]  ( 69 min )
    [P] I made a project to find good real-estate deals online using machine learning
    submitted by /u/Emotional_Aardvark26 [link] [comments]  ( 67 min )
    [P] Implementing Convolutional Neural Network for Reverse Engineering
    submitted by /u/Emotional_Aardvark26 [link] [comments]  ( 69 min )
    [R][P] I made an app for Instant Image/Text to 3D using PointE from OpenAI
    submitted by /u/perception-eng [link] [comments]  ( 66 min )
    [D] GPT3 Concrete applications (With python code snippets). Do you see other ones ?
    Any other applications you can think of ? submitted by /u/AImSamy [link] [comments]  ( 64 min )
    [D] Quadratic Discriminant Analysis
    QDA is a non-linear classification algorithm. However, does its decision boundary always have to be quadratic? I mean if it can produce linear decision boundaries? submitted by /u/Visual-Arm-7375 [link] [comments]  ( 65 min )
    [N] open source AR and VR algorithm toolbox launched with SLAM, structure from motion, visual localization, motion capture, motion generation, and NeRF
    submitted by /u/SpatialComputing [link] [comments]  ( 63 min )
    [R] Researchers developed computational method for finding Causal Functional Connectivity
    submitted by /u/neuro_researcher [link] [comments]  ( 65 min )
    [P] Annotated History of Modern AI and Deep Learning (Schmidhuber)
    submitted by /u/hardmaru [link] [comments]  ( 73 min )
  • Open

    Is there some AI software I can use to illustrate a kids book?
    Wondering if any such software exists and if this would be legal in the US to publish using pictures. submitted by /u/Conanzulu [link] [comments]  ( 48 min )
    This Ai Almost Ruined The Music Industry
    submitted by /u/CookingGod [link] [comments]  ( 49 min )
    What’s Your Power, Strong AI?
    submitted by /u/akolonin [link] [comments]  ( 47 min )
    /r/InappropriateAI
    I started a sub, maybe you'll have something to share. /r/InappropriateAI All in fun, thanks submitted by /u/DropNationalism [link] [comments]  ( 49 min )
    🔎 You.com now has 👀 YouChat - Alternative to ChatGPT!
    What does everyone think of the YouChat bot on You. com search site. You.com, which says it reaches over 1 million actively searching users and has grown over 400% in the last six months . I tested their ChatGPT alternative and found it half-decent. I'm curious as to what others think? The background is You.com, the search engine startup founded in 2020 with a moonshot bid to take on Google, announced today that it has opened its search platform to allow external developers and organizations to build their own apps for the search results page. This includes generative AI apps, it says, that have never been seen inside traditional search engines, using generative AI technology that enables users to generate text (YouWrite), code (YouCode), or images (YouImagine) from plain English — all within the search results page. You can try it out here, let me know if the link works: https://you.com/search?q=who+are+you&tbm=youchat It says: 👋 Hello! My name is YouChat, I’m an AI that can answer general questions, explain things, suggest ideas, translate, summarize text, compose emails, and write code for you. I’m powered by artificial intelligence and natural language processing, allowing you to have human-like conversations with me. I am constantly learning from huge amounts of information on the internet, which means I sometimes may get some answers wrong. My AI is always improving and I will often share sources for my answers. In 2023 there are going to be so many tools like this to be fair paired with Search. submitted by /u/BackgroundResult [link] [comments]  ( 49 min )
    How long until we get ChatGPT into our voice assistants?
    How long until we get ChatGPT into Alexa? submitted by /u/TheVellerShow [link] [comments]  ( 49 min )
    Why are most of the datasets have just 28x28 px images, ex. mnist
    submitted by /u/gtrocksr [link] [comments]  ( 48 min )
    A.I. Moist Critical reads you a bed time story.
    submitted by /u/Surrounded_By_Sheep [link] [comments]  ( 50 min )
    Do I need to learn ml to learn ai? Or is it just beneficial or it has nothing to do with it so I can go straight to ai?
    submitted by /u/Wonderful_Ad3441 [link] [comments]  ( 48 min )
    Poe and ChatGPT: The New Kids on the Block
    submitted by /u/liquidocelotYT [link] [comments]  ( 47 min )
    guys it can recognise black people now
    submitted by /u/ivan697 [link] [comments]  ( 47 min )
    Wacky future prediction for humans counter AI
    I'm fighting the usual existential crisis due to AI and I drown my sorrows with magic mushrooms. Hear me out with this speculation for the future. I think we all agree AI is just getting better. I heard that theory that next generation of startups might be companies that build hyper-specific layers for AI (eg. a law firm building lawyer app). If we keep on this path large models based on curated data, then even if AI gets better it'll stay kinda samey. Plus we might also have a built-in watermark for the generated text. But what if AI being too consistent gets to people. I can see myself reading AI generated emails a few years from now and just seeing at a glance the dry lingo and winding lines of text with minimal meaning. What if people just counter that by being inconsistent, I can see myself adding little flairs to my emails to make sure the reader knows I'm a human. I imagine this is how artists might adapt, if their style is too generic and AI begins to be a major competitor, they might just stylize more. I'd draw parallel between this and zoomer humor. There's just stuff in there I don't find funny, but zoomers do because the goal was to create something they can call their own. So maybe humans adapt by becoming more human, I dunno. I'm still high off mushrooms and just wanted to textualize my thoughts. submitted by /u/Organic_Fudge1041 [link] [comments]  ( 50 min )
    🤖 How 2022 Was a Big Year for A.I.
    submitted by /u/BackgroundResult [link] [comments]  ( 48 min )
    How good are LLMs at data compression? For example, ChatGPT was trained on the entire internet and that info can be retrieved by talking to the model. What is the size (in Gb) of the model? How does that compare to the amount of data it is trained on that is theoretically stored in its parameters?
    submitted by /u/elonmusk12345_ [link] [comments]  ( 62 min )
    Companies offering AI products.
    submitted by /u/Notalabel_4566 [link] [comments]  ( 48 min )
    Trainable stable diffusion keras model or similar?
    Hey, I think I remember their being SD models that are trainable in a reasonable amount of time. Maybe the trade-off was they are not text conditioned? If that's the case, can you point me toward a keras model or similar? I really want to train my own (from scratch, not transfer learning) and make some modifications. Thank you! submitted by /u/elfballs [link] [comments]  ( 49 min )
    Google's AI is Allegedly 3x More Powerful than ChatGPT
    submitted by /u/lambolifeofficial [link] [comments]  ( 47 min )
  • Open

    Practical RL books
    I have been working with Deep RL for a year now. During this time I have learned lots of practical stuff like the importance of reward normalisation or the fact that imposing symmetrical action spaces improves the performance of the continuous models. I would like to know if there is any Deep RL book that focus on these kind of tips for obtaining a good model performance submitted by /u/random-redditor9 [link] [comments]  ( 60 min )
    Does attention helps in Vision based policy?
    Recently I read many papers on combining Robotics manipulation with natural language processing. In those papers they had used something called Transformer block which uses attention mechanism. They said it is used so that policy can learn to focus on learning what is relevant in the current image frame. I have never used attention block or transformers in general because they are not usually applied to Robotics. Anyone with the experience in using Transformers/Attention, please explain to me if there is any benefit of adding these instead of simple LSTM models. If possible please also explain the structure of the policy neutral networks with a example. submitted by /u/Better-Ad8608 [link] [comments]  ( 62 min )
    Current model-free algorithms?
    I was wondering what will be the best baseline RL algorithm for the model-free MDP problem. Currently, I have model-free, off-policy (for adequate exploration), and deterministic policy (restricted to problem formulation) settings in high-dimensional continuous state(dim>millions)/action(dim>hundreds) space. Especially, my environment has a non-stationary state distribution and long episode length (T>3000). However, I just want to see how it does with normal model-free off-policy algorithms first. I've tried Deep Deterministic Policy Gradient, but it doesn't seem to generalize well enough, which giving high train score, while low test score (maybe because of the bad exploration with ornstein-uhlenbeck noise or due to the effect of underlying non-stationary state distribution). I'd really appreciate if you could help me finding some recent model-free algorithms that works well on this types of settings. submitted by /u/g6ssgs [link] [comments]  ( 63 min )
  • Open

    A Mutation-based Text Generation for Adversarial Machine Learning Applications. (arXiv:2212.11808v1 [cs.CL])
    Many natural language related applications involve text generation, created by humans or machines. While in many of those applications machines support humans, yet in few others, (e.g. adversarial machine learning, social bots and trolls) machines try to impersonate humans. In this scope, we proposed and evaluated several mutation-based text generation approaches. Unlike machine-based generated text, mutation-based generated text needs human text samples as inputs. We showed examples of mutation operators but this work can be extended in many aspects such as proposing new text-based mutation operators based on the nature of the application.
    A Simple Way to Learn Metrics Between Attributed Graphs. (arXiv:2209.12727v2 [cs.LG] UPDATED)
    The choice of good distances and similarity measures between objects is important for many machine learning methods. Therefore, many metric learning algorithms have been developed in recent years, mainly for Euclidean data in order to improve performance of classification or clustering methods. However, due to difficulties in establishing computable, efficient and differentiable distances between attributed graphs, few metric learning algorithms adapted to graphs have been developed despite the strong interest of the community. In this paper, we address this issue by proposing a new Simple Graph Metric Learning - SGML - model with few trainable parameters based on Simple Graph Convolutional Neural Networks - SGCN - and elements of Optimal Transport theory. This model allows us to build an appropriate distance from a database of labeled (attributed) graphs to improve the performance of simple classification algorithms such as $k$-NN. This distance can be quickly trained while maintaining good performances as illustrated by the experimental study presented in this paper.
    Generalized Stable Weights via Neural Gibbs Density. (arXiv:2211.07533v2 [stat.ML] UPDATED)
    We present a generalized balancing method -- stable weights via Neural Gibbs Density -- fully available for estimating causal effects for an arbitrary mixture of discrete and continuous interventions. Our weights are trainable through back-propagation and can be obtained with neural network algorithms. In addition, we also provide a method to measure the performance of our weights by estimating the mutual information for the balanced distribution. Our method is easy to implement with any present deep learning libraries, and the weights from it can be used in most state-of-art supervised algorithms.
    Mind Your Heart: Stealthy Backdoor Attack on Dynamic Deep Neural Network in Edge Computing. (arXiv:2212.11751v1 [cs.CR])
    Transforming off-the-shelf deep neural network (DNN) models into dynamic multi-exit architectures can achieve inference and transmission efficiency by fragmenting and distributing a large DNN model in edge computing scenarios (e.g., edge devices and cloud servers). In this paper, we propose a novel backdoor attack specifically on the dynamic multi-exit DNN models. Particularly, we inject a backdoor by poisoning one DNN model's shallow hidden layers targeting not this vanilla DNN model but only its dynamically deployed multi-exit architectures. Our backdoored vanilla model behaves normally on performance and cannot be activated even with the correct trigger. However, the backdoor will be activated when the victims acquire this model and transform it into a dynamic multi-exit architecture at their deployment. We conduct extensive experiments to prove the effectiveness of our attack on three structures (ResNet-56, VGG-16, and MobileNet) with four datasets (CIFAR-10, SVHN, GTSRB, and Tiny-ImageNet) and our backdoor is stealthy to evade multiple state-of-the-art backdoor detection or removal methods.
    Certified Policy Smoothing for Cooperative Multi-Agent Reinforcement Learning. (arXiv:2212.11746v1 [cs.LG])
    Cooperative multi-agent reinforcement learning (c-MARL) is widely applied in safety-critical scenarios, thus the analysis of robustness for c-MARL models is profoundly important. However, robustness certification for c-MARLs has not yet been explored in the community. In this paper, we propose a novel certification method, which is the first work to leverage a scalable approach for c-MARLs to determine actions with guaranteed certified bounds. c-MARL certification poses two key challenges compared with single-agent systems: (i) the accumulated uncertainty as the number of agents increases; (ii) the potential lack of impact when changing the action of a single agent into a global team reward. These challenges prevent us from directly using existing algorithms. Hence, we employ the false discovery rate (FDR) controlling procedure considering the importance of each agent to certify per-state robustness and propose a tree-search-based algorithm to find a lower bound of the global reward under the minimal certified perturbation. As our method is general, it can also be applied in single-agent environments. We empirically show that our certification bounds are much tighter than state-of-the-art RL certification solutions. We also run experiments on two popular c-MARL algorithms: QMIX and VDN, in two different environments, with two and four agents. The experimental results show that our method produces meaningful guaranteed robustness for all models and environments. Our tool CertifyCMARL is available at https://github.com/TrustAI/CertifyCMA
    Missing Data Imputation and Acquisition with Deep Hierarchical Models and Hamiltonian Monte Carlo. (arXiv:2202.04599v5 [cs.LG] UPDATED)
    Variational Autoencoders (VAEs) have recently been highly successful at imputing and acquiring heterogeneous missing data. However, within this specific application domain, existing VAE methods are restricted by using only one layer of latent variables and strictly Gaussian posterior approximations. To address these limitations, we present HH-VAEM, a Hierarchical VAE model for mixed-type incomplete data that uses Hamiltonian Monte Carlo with automatic hyper-parameter tuning for improved approximate inference. Our experiments show that HH-VAEM outperforms existing baselines in the tasks of missing data imputation and supervised learning with missing features. Finally, we also present a sampling-based approach for efficiently computing the information gain when missing features are to be acquired with HH-VAEM. Our experiments show that this sampling-based approach is superior to alternatives based on Gaussian approximations.
    On the Sparse DAG Structure Learning Based on Adaptive Lasso. (arXiv:2209.02946v2 [stat.ML] UPDATED)
    Learning the underlying Bayesian Networks (BNs), represented by directed acyclic graphs (DAGs), of the concerned events from purely-observational data is a crucial part of evidential reasoning. This task remains challenging due to the large and discrete search space. A recent flurry of developments followed NOTEARS[1] recast this combinatorial problem into a continuous optimization problem by leveraging an algebraic equality characterization of acyclicity. However, the continuous optimization methods suffer from obtaining non-spare graphs after the numerical optimization, which leads to the inflexibility to rule out the potentially cycle-inducing edges or false discovery edges with small values. To address this issue, in this paper, we develop a completely data-driven DAG structure learning method without a predefined value to post-threshold small values. We name our method NOTEARS with adaptive Lasso (NOTEARS-AL), which is achieved by applying the adaptive penalty method to ensure the sparsity of the estimated DAG. Moreover, we show that NOTEARS-AL also inherits the oracle properties under some specific conditions. Extensive experiments on both synthetic and a real-world dataset verify the efficacy of the proposed method.
    Offline Clustering Approach to Self-supervised Learning for Class-imbalanced Image Data. (arXiv:2212.11444v1 [cs.LG])
    Class-imbalanced datasets are known to cause the problem of model being biased towards the majority classes. In this project, we set up two research questions: 1) when is the class-imbalance problem more prevalent in self-supervised pre-training? and 2) can offline clustering of feature representations help pre-training on class-imbalanced data? Our experiments investigate the former question by adjusting the degree of {\it class-imbalance} when training the baseline models, namely SimCLR and SimSiam on CIFAR-10 database. To answer the latter question, we train each expert model on each subset of the feature clusters. We then distill the knowledge of expert models into a single model, so that we will be able to compare the performance of this model to our baselines.
    Sequential Decision Problems with Weak Feedback. (arXiv:2212.11603v1 [cs.LG])
    This thesis considers sequential decision problems, where the loss/reward incurred by selecting an action may not be inferred from observed feedback. A major part of this thesis focuses on the unsupervised sequential selection problem, where one can not infer the loss incurred for selecting an action from observed feedback. We also introduce a new setup named Censored Semi Bandits, where the loss incurred for selecting an action can be observed under certain conditions. Finally, we study the channel selection problem in the communication networks, where the reward for an action is only observed when no other player selects that action to play in the round. These problems find applications in many fields like healthcare, crowd-sourcing, security, adaptive resource allocation, among many others. This thesis aims to address the above-described sequential decision problems by exploiting specific structures these problems exhibit. We develop provably optimal algorithms for each of these setups with weak feedback and validate their empirical performance on different problem instances derived from synthetic and real datasets.
    GENIE: Large Scale Pre-training for Text Generation with Diffusion Model. (arXiv:2212.11685v1 [cs.CL])
    In this paper, we propose a large-scale language pre-training for text GENeration using dIffusion modEl, which is named GENIE. GENIE is a pre-training sequence-to-sequence text generation model which combines Transformer and diffusion. The diffusion model accepts the latent information from the encoder, which is used to guide the denoising of the current time step. After multiple such denoise iterations, the diffusion model can restore the Gaussian noise to the diverse output text which is controlled by the input text. Moreover, such architecture design also allows us to adopt large scale pre-training on the GENIE. We propose a novel pre-training method named continuous paragraph denoise based on the characteristics of the diffusion model. Extensive experiments on the XSum, CNN/DailyMail, and Gigaword benchmarks shows that GENIE can achieves comparable performance with various strong baselines, especially after pre-training, the generation quality of GENIE is greatly improved. We have also conduct a lot of experiments on the generation diversity and parameter impact of GENIE. The code for GENIE will be made publicly available.
    Mediastinal Lymph Node Detection and Segmentation Using Deep Learning. (arXiv:2212.11956v1 [eess.IV])
    Automatic lymph node (LN) segmentation and detection for cancer staging are critical. In clinical practice, computed tomography (CT) and positron emission tomography (PET) imaging detect abnormal LNs. Despite its low contrast and variety in nodal size and form, LN segmentation remains a challenging task. Deep convolutional neural networks frequently segment items in medical photographs. Most state-of-the-art techniques destroy image's resolution through pooling and convolution. As a result, the models provide unsatisfactory results. Keeping the issues in mind, a well-established deep learning technique UNet was modified using bilinear interpolation and total generalized variation (TGV) based upsampling strategy to segment and detect mediastinal lymph nodes. The modified UNet maintains texture discontinuities, selects noisy areas, searches appropriate balance points through backpropagation, and recreates image resolution. Collecting CT image data from TCIA, 5-patients, and ELCAP public dataset, a dataset was prepared with the help of experienced medical experts. The UNet was trained using those datasets, and three different data combinations were utilized for testing. Utilizing the proposed approach, the model achieved 94.8% accuracy, 91.9% Jaccard, 94.1% recall, and 93.1% precision on COMBO_3. The performance was measured on different datasets and compared with state-of-the-art approaches. The UNet++ model with hybridized strategy performed better than others.
    Localizing Anatomical Landmarks in Ocular Images using Zoom-In Attentive Networks. (arXiv:2210.02445v2 [eess.IV] UPDATED)
    Localizing anatomical landmarks are important tasks in medical image analysis. However, the landmarks to be localized often lack prominent visual features. Their locations are elusive and easily confused with the background, and thus precise localization highly depends on the context formed by their surrounding areas. In addition, the required precision is usually higher than segmentation and object detection tasks. Therefore, localization has its unique challenges different from segmentation or detection. In this paper, we propose a zoom-in attentive network (ZIAN) for anatomical landmark localization in ocular images. First, a coarse-to-fine, or "zoom-in" strategy is utilized to learn the contextualized features in different scales. Then, an attentive fusion module is adopted to aggregate multi-scale features, which consists of 1) a co-attention network with a multiple regions-of-interest (ROIs) scheme that learns complementary features from the multiple ROIs, 2) an attention-based fusion module which integrates the multi-ROIs features and non-ROI features. We evaluated ZIAN on two open challenge tasks, i.e., the fovea localization in fundus images and scleral spur localization in AS-OCT images. Experiments show that ZIAN achieves promising performances and outperforms state-of-the-art localization methods. The source code and trained models of ZIAN are available at https://github.com/leixiaofeng-astar/OMIA9-ZIAN.
    Multiscale Graph Neural Networks for Protein Residue Contact Map Prediction. (arXiv:2212.02251v2 [q-bio.QM] UPDATED)
    Machine learning (ML) is revolutionizing protein structural analysis, including an important subproblem of predicting protein residue contact maps, i.e., which amino-acid residues are in close spatial proximity given the amino-acid sequence of a protein. Despite recent progresses in ML-based protein contact prediction, predicting contacts with a wide range of distances (commonly classified into short-, medium- and long-range contacts) remains a challenge. Here, we propose a multiscale graph neural network (GNN) based approach taking a cue from multiscale physics simulations, in which a standard pipeline involving a recurrent neural network (RNN) is augmented with three GNNs to refine predictive capability for short-, medium- and long-range residue contacts, respectively. Test results on the ProteinNet dataset show improved accuracy for contacts of all ranges using the proposed multiscale RNN+GNN approach over the conventional approach, including the most challenging case of long-range contact prediction.
    Reinforcement Learning Based Approaches to Adaptive Context Caching in Distributed Context Management Systems. (arXiv:2212.11709v1 [eess.SY])
    Performance metrics-driven context caching has a profound impact on throughput and response time in distributed context management systems for real-time context queries. This paper proposes a reinforcement learning based approach to adaptively cache context with the objective of minimizing the cost incurred by context management systems in responding to context queries. Our novel algorithms enable context queries and sub-queries to reuse and repurpose cached context in an efficient manner. This approach is distinctive to traditional data caching approaches by three main features. First, we make selective context cache admissions using no prior knowledge of the context, or the context query load. Secondly, we develop and incorporate innovative heuristic models to calculate expected performance of caching an item when making the decisions. Thirdly, our strategy defines a time-aware continuous cache action space. We present two reinforcement learning agents, a value function estimating actor-critic agent and a policy search agent using deep deterministic policy gradient method. The paper also proposes adaptive policies such as eviction and cache memory scaling to complement our objective. Our method is evaluated using a synthetically generated load of context sub-queries and a synthetic data set inspired from real world data and query samples. We further investigate optimal adaptive caching configurations under different settings. This paper presents, compares, and discusses our findings that the proposed selective caching methods reach short- and long-term cost- and performance-efficiency. The paper demonstrates that the proposed methods outperform other modes of context management such as redirector mode, and database mode, and cache all policy by up to 60% in cost efficiency.
    GOOD: Exploring Geometric Cues for Detecting Objects in an Open World. (arXiv:2212.11720v1 [cs.CV])
    We address the task of open-world class-agnostic object detection, i.e., detecting every object in an image by learning from a limited number of base object classes. State-of-the-art RGB-based models suffer from overfitting the training classes and often fail at detecting novel-looking objects. This is because RGB-based models primarily rely on appearance similarity to detect novel objects and are also prone to overfitting short-cut cues such as textures and discriminative parts. To address these shortcomings of RGB-based object detectors, we propose incorporating geometric cues such as depth and normals, predicted by general-purpose monocular estimators. Specifically, we use the geometric cues to train an object proposal network for pseudo-labeling unannotated novel objects in the training set. Our resulting Geometry-guided Open-world Object Detector (GOOD) significantly improves detection recall for novel object categories and already performs well with only a few training classes. Using a single "person" class for training on the COCO dataset, GOOD surpasses SOTA methods by 5.0% AR@100, a relative improvement of 24%.
    Training Integer-Only Deep Recurrent Neural Networks. (arXiv:2212.11791v1 [cs.LG])
    Recurrent neural networks (RNN) are the backbone of many text and speech applications. These architectures are typically made up of several computationally complex components such as; non-linear activation functions, normalization, bi-directional dependence and attention. In order to maintain good accuracy, these components are frequently run using full-precision floating-point computation, making them slow, inefficient and difficult to deploy on edge devices. In addition, the complex nature of these operations makes them challenging to quantize using standard quantization methods without a significant performance drop. We present a quantization-aware training method for obtaining a highly accurate integer-only recurrent neural network (iRNN). Our approach supports layer normalization, attention, and an adaptive piecewise linear (PWL) approximation of activation functions, to serve a wide range of state-of-the-art RNNs. The proposed method enables RNN-based language models to run on edge devices with $2\times$ improvement in runtime, and $4\times$ reduction in model size while maintaining similar accuracy as its full-precision counterpart.  ( 2 min )
    Realizing Molecular Machine Learning through Communications for Biological AI: Future Directions and Challenges. (arXiv:2212.11910v1 [cs.ET])
    Artificial Intelligence (AI) and Machine Learning (ML) are weaving their way into the fabric of society, where they are playing a crucial role in numerous facets of our lives. As we witness the increased deployment of AI and ML in various types of devices, we benefit from their use into energy-efficient algorithms for low powered devices. In this paper, we investigate a scale and medium that is far smaller than conventional devices as we move towards molecular systems that can be utilized to perform machine learning functions, i.e., Molecular Machine Learning (MML). Fundamental to the operation of MML is the transport, processing, and interpretation of information propagated by molecules through chemical reactions. We begin by reviewing the current approaches that have been developed for MML, before we move towards potential new directions that rely on gene regulatory networks inside biological organisms as well as their population interactions to create neural networks. We then investigate mechanisms for training machine learning structures in biological cells based on calcium signaling and demonstrate their application to build an Analog to Digital Converter (ADC). Lastly, we look at potential future directions as well as challenges that this area could solve.  ( 2 min )
    Accelerating Barnes-Hut t-SNE Algorithm by Efficient Parallelization on Multi-Core CPUs. (arXiv:2212.11506v1 [cs.LG])
    t-SNE remains one of the most popular embedding techniques for visualizing high-dimensional data. Most standard packages of t-SNE, such as scikit-learn, use the Barnes-Hut t-SNE (BH t-SNE) algorithm for large datasets. However, existing CPU implementations of this algorithm are inefficient. In this work, we accelerate the BH t-SNE on CPUs via cache optimizations, SIMD, parallelizing sequential steps, and improving parallelization of multithreaded steps. Our implementation (Acc-t-SNE) is up to 261x and 4x faster than scikit-learn and the state-of-the-art BH t-SNE implementation from daal4py, respectively, on a 32-core Intel(R) Icelake cloud instance.
    Satellite-derived solar radiation for intra-hour and intra-day applications: Biases and uncertainties by season and altitude. (arXiv:2212.11745v1 [physics.ao-ph])
    Accurate estimates of the surface solar radiation (SSR) are a prerequisite for intra-day forecasts of solar resources and photovoltaic power generation. Intra-day SSR forecasts are of interest to power traders and to operators of solar plants and power grids who seek to optimize their revenues and maintain the grid stability by matching power supply and demand. Our study analyzes systematic biases and the uncertainty of SSR estimates derived from Meteosat with the SARAH-2 and HelioMont algorithms at intra-hour and intra-day time scales. The satellite SSR estimates are analyzed based on 136 ground stations across altitudes from 200 m to 3570 m Switzerland in 2018. We find major biases and uncertainties in the instantaneous, hourly and daily-mean SSR. In peak daytime periods, the instantaneous satellite SSR deviates from the ground-measured SSR by a mean absolute deviation (MAD) of 110.4 and 99.6 W/m2 for SARAH-2 and HelioMont, respectively. For the daytime SSR, the instantaneous, hourly and daily-mean MADs amount to 91.7, 81.1, 50.8 and 82.5, 66.7, 42.9 W/m2 for SARAH-2 and HelioMont, respectively. Further, the SARAH-2 instantaneous SSR drastically underestimates the solar resources at altitudes above 1000 m in the winter half year. A possible explanation in line with the seasonality of the bias is that snow cover may be misinterpreted as clouds at higher altitudes.
    End-to-End Learned Early Classification of Time Series for In-Season Crop Type Mapping. (arXiv:1901.10681v2 [cs.LG] UPDATED)
    Remote sensing satellites capture the cyclic dynamics of our Planet in regular time intervals recorded in satellite time series data. End-to-end trained deep learning models use this time series data to make predictions at a large scale, for instance, to produce up-to-date crop cover maps. Most time series classification approaches focus on the accuracy of predictions. However, the earliness of the prediction is also of great importance since coming to an early decision can make a crucial difference in time-sensitive applications. In this work, we present an End-to-End Learned Early Classification of Time Series (ELECTS) model that estimates a classification score and a probability of whether sufficient data has been observed to come to an early and still accurate decision. ELECTS is modular: any deep time series classification model can adopt the ELECTS conceptual idea by adding a second prediction head that outputs a probability of stopping the classification. The ELECTS loss function then optimizes the overall model on a balanced objective of earliness and accuracy. Our experiments on four crop classification datasets from Europe and Africa show that ELECTS allows reaching state-of-the-art accuracy while reducing the quantity of data massively to be downloaded, stored, and processed. The source code is available at https://github.com/marccoru/elects.
    Directed Acyclic Graph Factorization Machines for CTR Prediction via Knowledge Distillation. (arXiv:2211.11159v2 [cs.IR] UPDATED)
    With the growth of high-dimensional sparse data in web-scale recommender systems, the computational cost to learn high-order feature interaction in CTR prediction task largely increases, which limits the use of high-order interaction models in real industrial applications. Some recent knowledge distillation based methods transfer knowledge from complex teacher models to shallow student models for accelerating the online model inference. However, they suffer from the degradation of model accuracy in knowledge distillation process. It is challenging to balance the efficiency and effectiveness of the shallow student models. To address this problem, we propose a Directed Acyclic Graph Factorization Machine (KD-DAGFM) to learn the high-order feature interactions from existing complex interaction models for CTR prediction via Knowledge Distillation. The proposed lightweight student model DAGFM can learn arbitrary explicit feature interactions from teacher networks, which achieves approximately lossless performance and is proved by a dynamic programming algorithm. Besides, an improved general model KD-DAGFM+ is shown to be effective in distilling both explicit and implicit feature interactions from any complex teacher model. Extensive experiments are conducted on four real-world datasets, including a large-scale industrial dataset from WeChat platform with billions of feature dimensions. KD-DAGFM achieves the best performance with less than 21.5% FLOPs of the state-of-the-art method on both online and offline experiments, showing the superiority of DAGFM to deal with the industrial scale data in CTR prediction task. Our implementation code is available at: https://github.com/RUCAIBox/DAGFM.
    StoRM: A Diffusion-based Stochastic Regeneration Model for Speech Enhancement and Dereverberation. (arXiv:2212.11851v1 [eess.AS])
    Diffusion models have shown a great ability at bridging the performance gap between predictive and generative approaches for speech enhancement. We have shown that they may even outperform their predictive counterparts for non-additive corruption types or when they are evaluated on mismatched conditions. However, diffusion models suffer from a high computational burden, mainly as they require to run a neural network for each reverse diffusion step, whereas predictive approaches only require one pass. As diffusion models are generative approaches they may also produce vocalizing and breathing artifacts in adverse conditions. In comparison, in such difficult scenarios, predictive models typically do not produce such artifacts but tend to distort the target speech instead, thereby degrading the speech quality. In this work, we present a stochastic regeneration approach where an estimate given by a predictive model is provided as a guide for further diffusion. We show that the proposed approach uses the predictive model to remove the vocalizing and breathing artifacts while producing very high quality samples thanks to the diffusion model, even in adverse conditions. We further show that this approach enables to use lighter sampling schemes with fewer diffusion steps without sacrificing quality, thus lifting the computational burden by an order of magnitude. Source code and audio examples are available online (https://uhh.de/inf-sp-storm).  ( 2 min )
    Model Based Co-clustering of Mixed Numerical and Binary Data. (arXiv:2212.11725v1 [cs.LG])
    Co-clustering is a data mining technique used to extract the underlying block structure between the rows and columns of a data matrix. Many approaches have been studied and have shown their capacity to extract such structures in continuous, binary or contingency tables. However, very little work has been done to perform co-clustering on mixed type data. In this article, we extend the latent block models based co-clustering to the case of mixed data (continuous and binary variables). We then evaluate the effectiveness of the proposed approach on simulated data and we discuss its advantages and potential limits.
    Variational Quantum Soft Actor-Critic for Robotic Arm Control. (arXiv:2212.11681v1 [quant-ph])
    Deep Reinforcement Learning is emerging as a promising approach for the continuous control task of robotic arm movement. However, the challenges of learning robust and versatile control capabilities are still far from being resolved for real-world applications, mainly because of two common issues of this learning paradigm: the exploration strategy and the slow learning speed, sometimes known as "the curse of dimensionality". This work aims at exploring and assessing the advantages of the application of Quantum Computing to one of the state-of-art Reinforcement Learning techniques for continuous control - namely Soft Actor-Critic. Specifically, the performance of a Variational Quantum Soft Actor-Critic on the movement of a virtual robotic arm has been investigated by means of digital simulations of quantum circuits. A quantum advantage over the classical algorithm has been found in terms of a significant decrease in the amount of required parameters for satisfactory model training, paving the way for further promising developments.
    GAN-based Domain Inference Attack. (arXiv:2212.11810v1 [cs.LG])
    Model-based attacks can infer training data information from deep neural network models. These attacks heavily depend on the attacker's knowledge of the application domain, e.g., using it to determine the auxiliary data for model-inversion attacks. However, attackers may not know what the model is used for in practice. We propose a generative adversarial network (GAN) based method to explore likely or similar domains of a target model -- the model domain inference (MDI) attack. For a given target (classification) model, we assume that the attacker knows nothing but the input and output formats and can use the model to derive the prediction for any input in the desired form. Our basic idea is to use the target model to affect a GAN training process for a candidate domain's dataset that is easy to obtain. We find that the target model may distract the training procedure less if the domain is more similar to the target domain. We then measure the distraction level with the distance between GAN-generated datasets, which can be used to rank candidate domains for the target model. Our experiments show that the auxiliary dataset from an MDI top-ranked domain can effectively boost the result of model-inversion attacks.  ( 2 min )
    Evaluation for Change. (arXiv:2212.11670v1 [cs.CL])
    Evaluation is the central means for assessing, understanding, and communicating about NLP models. In this position paper, we argue evaluation should be more than that: it is a force for driving change, carrying a sociological and political character beyond its technical dimensions. As a force, evaluation's power arises from its adoption: under our view, evaluation succeeds when it achieves the desired change in the field. Further, by framing evaluation as a force, we consider how it competes with other forces. Under our analysis, we conjecture that the current trajectory of NLP suggests evaluation's power is waning, in spite of its potential for realizing more pluralistic ambitions in the field. We conclude by discussing the legitimacy of this power, who acquires this power and how it distributes. Ultimately, we hope the research community will more aggressively harness evaluation for change.
    Set-Transformer BeamsNet for AUV Velocity Forecasting in Complete DVL Outage Scenarios. (arXiv:2212.11671v1 [cs.RO])
    Autonomous underwater vehicles (AUVs) are regularly used for deep ocean applications. Commonly, the autonomous navigation task is carried out by a fusion between two sensors: the inertial navigation system and the Doppler velocity log (DVL). The DVL operates by transmitting four acoustic beams to the sea floor, and once reflected back, the AUV velocity vector can be estimated. However, in real-life scenarios, such as an uneven seabed, sea creatures blocking the DVL's view and, roll/pitch maneuvers, the acoustic beams' reflection is resulting in a scenario known as DVL outage. Consequently, a velocity update is not available to bind the inertial solution drift. To cope with such situations, in this paper, we leverage our BeamsNet framework and propose a Set-Transformer-based BeamsNet (ST-BeamsNet) that utilizes inertial data readings and previous DVL velocity measurements to regress the current AUV velocity in case of a complete DVL outage. The proposed approach was evaluated using data from experiments held in the Mediterranean Sea with the Snapir AUV and was compared to a moving average (MA) estimator. Our ST-BeamsNet estimated the AUV velocity vector with an 8.547% speed error, which is 26% better than the MA approach.
    TransPath: Learning Heuristics For Grid-Based Pathfinding via Transformers. (arXiv:2212.11730v1 [cs.AI])
    Heuristic search algorithms, e.g. A*, are the commonly used tools for pathfinding on grids, i.e. graphs of regular structure that are widely employed to represent environments in robotics, video games etc. Instance-independent heuristics for grid graphs, e.g. Manhattan distance, do not take the obstacles into account and, thus, the search led by such heuristics performs poorly in the obstacle-rich environments. To this end, we suggest learning the instance-dependent heuristic proxies that are supposed to notably increase the efficiency of the search. The first heuristic proxy we suggest to learn is the correction factor, i.e. the ratio between the instance independent cost-to-go estimate and the perfect one (computed offline at the training phase). Unlike learning the absolute values of the cost-to-go heuristic function, which was known before, when learning the correction factor the knowledge of the instance-independent heuristic is utilized. The second heuristic proxy is the path probability, which indicates how likely the grid cell is lying on the shortest path. This heuristic can be utilized in the Focal Search framework as the secondary heuristic, allowing us to preserve the guarantees on the bounded sub-optimality of the solution. We learn both suggested heuristics in a supervised fashion with the state-of-the-art neural networks containing attention blocks (transformers). We conduct a thorough empirical evaluation on a comprehensive dataset of planning tasks, showing that the suggested techniques i) reduce the computational effort of the A* up to a factor of $4$x while producing the solutions, which costs exceed the costs of the optimal solutions by less than $0.3$% on average; ii) outperform the competitors, which include the conventional techniques from the heuristic search, i.e. weighted A*, as well as the state-of-the-art learnable planners.
    Word Embedding Neural Networks to Advance Knee Osteoarthritis Research. (arXiv:2212.11933v1 [cs.AI])
    Osteoarthritis (OA) is the most prevalent chronic joint disease worldwide, where knee OA takes more than 80% of commonly affected joints. Knee OA is not a curable disease yet, and it affects large columns of patients, making it costly to patients and healthcare systems. Etiology, diagnosis, and treatment of knee OA might be argued by variability in its clinical and physical manifestations. Although knee OA carries a list of well-known terminology aiming to standardize the nomenclature of the diagnosis, prognosis, treatment, and clinical outcomes of the chronic joint disease, in practice there is a wide range of terminology associated with knee OA across different data sources, including but not limited to biomedical literature, clinical notes, healthcare literacy, and health-related social media. Among these data sources, the scientific articles published in the biomedical literature usually make a principled pipeline to study disease. Rapid yet, accurate text mining on large-scale scientific literature may discover novel knowledge and terminology to better understand knee OA and to improve the quality of knee OA diagnosis, prevention, and treatment. The present works aim to utilize artificial neural network strategies to automatically extract vocabularies associated with knee OA diseases. Our finding indicates the feasibility of developing word embedding neural networks for autonomous keyword extraction and abstraction of knee OA.  ( 2 min )
    AsyncFLEO: Asynchronous Federated Learning for LEO Satellite Constellations with High-Altitude Platforms. (arXiv:2212.11522v1 [cs.LG])
    Low Earth Orbit (LEO) constellations, each comprising a large number of satellites, have become a new source of big data "from the sky". Downloading such data to a ground station (GS) for big data analytics demands very high bandwidth and involves large propagation delays. Federated Learning (FL) offers a promising solution because it allows data to stay in-situ (never leaving satellites) and it only needs to transmit machine learning model parameters (trained on the satellites' data). However, the conventional, synchronous FL process can take several days to train a single FL model in the context of satellite communication (Satcom), due to a bottleneck caused by straggler satellites. In this paper, we propose an asynchronous FL framework for LEO constellations called AsyncFLEO to improve FL efficiency in Satcom. Not only does AsynFLEO address the bottleneck (idle waiting) in synchronous FL, but it also solves the issue of model staleness caused by straggler satellites. AsyncFLEO utilizes high-altitude platforms (HAPs) positioned "in the sky" as parameter servers, and consists of three technical components: (1) a ring-of-stars communication topology, (2) a model propagation algorithm, and (3) a model aggregation algorithm with satellite grouping and staleness discounting. Our extensive evaluation with both IID and non-IID data shows that AsyncFLEO outperforms the state of the art by a large margin, cutting down convergence delay by 22 times and increasing accuracy by 40%.
    Towards Causal Credit Assignment. (arXiv:2212.11636v1 [cs.LG])
    Adequately assigning credit to actions for future outcomes based on their contributions is a long-standing open challenge in Reinforcement Learning. The assumptions of the most commonly used credit assignment method are disadvantageous in tasks where the effects of decisions are not immediately evident. Furthermore, this method can only evaluate actions that have been selected by the agent, making it highly inefficient. Still, no alternative methods have been widely adopted in the field. Hindsight Credit Assignment is a promising, but still unexplored candidate, which aims to solve the problems of both long-term and counterfactual credit assignment. In this thesis, we empirically investigate Hindsight Credit Assignment to identify its main benefits, and key points to improve. Then, we apply it to factored state representations, and in particular to state representations based on the causal structure of the environment. In this setting, we propose a variant of Hindsight Credit Assignment that effectively exploits a given causal structure. We show that our modification greatly decreases the workload of Hindsight Credit Assignment, making it more efficient and enabling it to outperform the baseline credit assignment method on various tasks. This opens the way to other methods based on given or learned causal structures.
    Understanding and Improving the Role of Projection Head in Self-Supervised Learning. (arXiv:2212.11491v1 [cs.LG])
    Self-supervised learning (SSL) aims to produce useful feature representations without access to any human-labeled data annotations. Due to the success of recent SSL methods based on contrastive learning, such as SimCLR, this problem has gained popularity. Most current contrastive learning approaches append a parametrized projection head to the end of some backbone network to optimize the InfoNCE objective and then discard the learned projection head after training. This raises a fundamental question: Why is a learnable projection head required if we are to discard it after training? In this work, we first perform a systematic study on the behavior of SSL training focusing on the role of the projection head layers. By formulating the projection head as a parametric component for the InfoNCE objective rather than a part of the network, we present an alternative optimization scheme for training contrastive learning based SSL frameworks. Our experimental study on multiple image classification datasets demonstrates the effectiveness of the proposed approach over alternatives in the SSL literature.
    Smooth Sailing: Improving Active Learning for Pre-trained Language Models with Representation Smoothness Analysis. (arXiv:2212.11680v1 [cs.LG])
    Developed as a solution to a practical need, active learning (AL) methods aim to reduce label complexity and the annotations costs in supervised learning. While recent work has demonstrated the benefit of using AL in combination with large pre-trained language models (PLMs), it has often overlooked the practical challenges that hinder the feasibility of AL in realistic settings. We address these challenges by leveraging representation smoothness analysis to improve the effectiveness of AL. We develop an early stopping technique that does not require a validation set -- often unavailable in realistic AL settings -- and observe significant improvements across multiple datasets and AL methods. Additionally, we find that task adaptation improves AL, whereas standard short fine-tuning in AL does not provide improvements over random sampling. Our work establishes the usefulness of representation smoothness analysis in AL and presents an AL stopping criterion that reduces label complexity.
    Reusable Options through Gradient-based Meta Learning. (arXiv:2212.11726v1 [cs.LG])
    Hierarchical methods in reinforcement learning have the potential to reduce the amount of decisions that the agent needs to perform when learning new tasks. However, finding a reusable useful temporal abstractions that facilitate fast learning remains a challenging problem. Recently, several deep learning approaches were proposed to learn such temporal abstractions in the form of options in an end-to-end manner. In this work, we point out several shortcomings of these methods and discuss their potential negative consequences. Subsequently, we formulate the desiderata for reusable options and use these to frame the problem of learning options as a gradient-based meta-learning problem. This allows us to formulate an objective that explicitly incentivizes options which allow a higher-level decision maker to adjust in few steps to different tasks. Experimentally, we show that our method is able to learn transferable components which accelerate learning and performs better than existing prior methods developed for this setting. Additionally, we perform ablations to quantify the impact of using gradient-based meta-learning as well as other proposed changes.
    EuclidNets: An Alternative Operation for Efficient Inference of Deep Learning Models. (arXiv:2212.11803v1 [cs.LG])
    With the advent of deep learning application on edge devices, researchers actively try to optimize their deployments on low-power and restricted memory devices. There are established compression method such as quantization, pruning, and architecture search that leverage commodity hardware. Apart from conventional compression algorithms, one may redesign the operations of deep learning models that lead to more efficient implementation. To this end, we propose EuclidNet, a compression method, designed to be implemented on hardware which replaces multiplication, $xw$, with Euclidean distance $(x-w)^2$. We show that EuclidNet is aligned with matrix multiplication and it can be used as a measure of similarity in case of convolutional layers. Furthermore, we show that under various transformations and noise scenarios, EuclidNet exhibits the same performance compared to the deep learning models designed with multiplication operations.  ( 2 min )
    Predicting Companies' ESG Ratings from News Articles Using Multivariate Timeseries Analysis. (arXiv:2212.11765v1 [q-fin.GN])
    Environmental, social and governance (ESG) engagement of companies moved into the focus of public attention over recent years. With the requirements of compulsory reporting being implemented and investors incorporating sustainability in their investment decisions, the demand for transparent and reliable ESG ratings is increasing. However, automatic approaches for forecasting ESG ratings have been quite scarce despite the increasing importance of the topic. In this paper, we build a model to predict ESG ratings from news articles using the combination of multivariate timeseries construction and deep learning techniques. A news dataset for about 3,000 US companies together with their ratings is also created and released for training. Through the experimental evaluation we find out that our approach provides accurate results outperforming the state-of-the-art, and can be used in practice to support a manual determination or analysis of ESG ratings.  ( 2 min )
    Learning to swim efficiently in a nonuniform flow field. (arXiv:2212.11482v1 [physics.flu-dyn])
    Microswimmers can acquire information on the surrounding fluid by sensing mechanical queues. They can then navigate in response to these signals. We analyse this navigation by combining deep reinforcement learning with direct numerical simulations to resolve the hydrodynamics. We study how local and non-local information can be used to train a swimmer to achieve particular swimming tasks in a non-uniform flow field, in particular a zig-zag shear flow. The swimming tasks are (1) learning how to swim in the vorticity direction, (2) the shear-gradient direction, and (3) the shear flow direction. We find that access to lab frame information on the swimmer's instantaneous orientation is all that is required in order to reach the optimal policy for (1,2). However, information on both the translational and rotational velocities seem to be required to achieve (3). Inspired by biological microorganisms we also consider the case where the swimmers sense local information, i.e. surface hydrodynamic forces, together with a signal direction. This might correspond to gravity or, for micro-organisms with light sensors, a light source. In this case, we show that the swimmer can reach a comparable level of performance as a swimmer with access to lab frame variables. We also analyse the role of different swimming modes, i.e. pusher, puller, and neutral swimmers.
    Local Policy Improvement for Recommender Systems. (arXiv:2212.11431v1 [cs.LG])
    Recommender systems aim to answer the following question: given the items that a user has interacted with, what items will this user likely interact with next? Historically this problem is often framed as a predictive task via (self-)supervised learning. In recent years, we have seen more emphasis placed on approaching the recommendation problem from a policy optimization perspective: learning a policy that maximizes some reward function (e.g., user engagement). However, it is commonly the case in recommender systems that we are only able to train a new policy given data collected from a previously-deployed policy. The conventional way to address such a policy mismatch is through importance sampling correction, which unfortunately comes with its own limitations. In this paper, we suggest an alternative approach, which involves the use of local policy improvement without off-policy correction. Drawing from a number of related results in the fields of causal inference, bandits, and reinforcement learning, we present a suite of methods that compute and optimize a lower bound of the expected reward of the target policy. Crucially, this lower bound is a function that is easy to estimate from data, and which does not involve density ratios (such as those appearing in importance sampling correction). We argue that this local policy improvement paradigm is particularly well suited for recommender systems, given that in practice the previously-deployed policy is typically of reasonably high quality, and furthermore it tends to be re-trained frequently and gets continuously updated. We discuss some practical recipes on how to apply some of the proposed techniques in a sequential recommendation setting.
    A Mathematical Framework for Learning Probability Distributions. (arXiv:2212.11481v1 [stat.ML])
    The modeling of probability distributions, specifically generative modeling and density estimation, has become an immensely popular subject in recent years by virtue of its outstanding performance on sophisticated data such as images and texts. Nevertheless, a theoretical understanding of its success is still incomplete. One mystery is the paradox between memorization and generalization: In theory, the model is trained to be exactly the same as the empirical distribution of the finite samples, whereas in practice, the trained model can generate new samples or estimate the likelihood of unseen samples. Likewise, the overwhelming diversity of distribution learning models calls for a unified perspective on this subject. This paper provides a mathematical framework such that all the well-known models can be derived based on simple principles. To demonstrate its efficacy, we present a survey of our results on the approximation error, training error and generalization error of these models, which can all be established based on this framework. In particular, the aforementioned paradox is resolved by proving that these models enjoy implicit regularization during training, so that the generalization error at early-stopping avoids the curse of dimensionality. Furthermore, we provide some new results on landscape analysis and the mode collapse phenomenon.
    Supervised Anomaly Detection Method Combining Generative Adversarial Networks and Three-Dimensional Data in Vehicle Inspections. (arXiv:2212.11507v1 [cs.CV])
    The external visual inspections of rolling stock's underfloor equipment are currently being performed via human visual inspection. In this study, we attempt to partly automate visual inspection by investigating anomaly inspection algorithms that use image processing technology. As the railroad maintenance studies tend to have little anomaly data, unsupervised learning methods are usually preferred for anomaly detection; however, training cost and accuracy is still a challenge. Additionally, a researcher created anomalous images from normal images by adding noise, etc., but the anomalous targeted in this study is the rotation of piping cocks that was difficult to create using noise. Therefore, in this study, we propose a new method that uses style conversion via generative adversarial networks on three-dimensional computer graphics and imitates anomaly images to apply anomaly detection based on supervised learning. The geometry-consistent style conversion model was used to convert the image, and because of this the color and texture of the image were successfully made to imitate the real image while maintaining the anomalous shape. Using the generated anomaly images as supervised data, the anomaly detection model can be easily trained without complex adjustments and successfully detects anomalies.
    Hybrid Quantum-Classical Generative Adversarial Network for High Resolution Image Generation. (arXiv:2212.11614v1 [quant-ph])
    Quantum machine learning (QML) has received increasing attention due to its potential to outperform classical machine learning methods in various problems. A subclass of QML methods is quantum generative adversarial networks (QGANs) which have been studied as a quantum counterpart of classical GANs widely used in image manipulation and generation tasks. The existing work on QGANs is still limited to small-scale proof-of-concept examples based on images with significant down-scaling. Here we integrate classical and quantum techniques to propose a new hybrid quantum-classical GAN framework. We demonstrate its superior learning capabilities by generating $28 \times 28$ pixels grey-scale images without dimensionality reduction or classical pre/post-processing on multiple classes of the standard MNIST and Fashion MNIST datasets, which achieves comparable results to classical frameworks with 3 orders of magnitude less trainable generator parameters. To gain further insight into the working of our hybrid approach, we systematically explore the impact of its parameter space by varying the number of qubits, the size of image patches, the number of layers in the generator, the shape of the patches and the choice of prior distribution. Our results show that increasing the quantum generator size generally improves the learning capability of the network. The developed framework provides a foundation for future design of QGANs with optimal parameter set tailored for complex image generation tasks.
    Federated Learning -- Methods, Applications and beyond. (arXiv:2212.11729v1 [cs.LG])
    In recent years the applications of machine learning models have increased rapidly, due to the large amount of available data and technological progress.While some domains like web analysis can benefit from this with only minor restrictions, other fields like in medicine with patient data are strongerregulated. In particular \emph{data privacy} plays an important role as recently highlighted by the trustworthy AI initiative of the EU or general privacy regulations in legislation. Another major challenge is, that the required training \emph{data is} often \emph{distributed} in terms of features or samples and unavailable for classicalbatch learning approaches. In 2016 Google came up with a framework, called \emph{Federated Learning} to solve both of these problems. We provide a brief overview on existing Methods and Applications in the field of vertical and horizontal \emph{Federated Learning}, as well as \emph{Fderated Transfer Learning}.
    Commitment with Signaling under Double-sided Information Asymmetry. (arXiv:2212.11446v1 [cs.GT])
    Information asymmetry in games enables players with the information advantage to manipulate others' beliefs by strategically revealing information to other players. This work considers a double-sided information asymmetry in a Bayesian Stackelberg game, where the leader's realized action, sampled from the mixed strategy commitment, is hidden from the follower. In contrast, the follower holds private information about his payoff. Given asymmetric information on both sides, an important question arises: \emph{Does the leader's information advantage outweigh the follower's?} We answer this question affirmatively in this work, where we demonstrate that by adequately designing a signaling device that reveals partial information regarding the leader's realized action to the follower, the leader can achieve a higher expected utility than that without signaling. Moreover, unlike previous works on the Bayesian Stackelberg game where mathematical programming tools are utilized, we interpret the leader's commitment as a probability measure over the belief space. Such a probabilistic language greatly simplifies the analysis and allows an indirect signaling scheme, leading to a geometric characterization of the equilibrium under the proposed game model.
    A topological analysis of cointegrated data: a Z24 Bridge case study. (arXiv:2212.11727v1 [math.AT])
    The paper studies the topological changes from before and after cointegration, for the natural frequencies of the Z24 Bridge. The second natural frequency is known to be nonlinear in temperature, and this will serve as the main focal point of this work. Cointegration is a method of normalising time series data with respect to one another - often strongly-correlated time series. Cointegration is used in this paper to remove effects from Environmental and Operational Variations, by cointegrating the first four natural frequencies for the Z24 Bridge data. The temperature effects on the natural frequency data are clearly visible within the data, and it is desirable, for the purposes of structural health monitoring, that these effects are removed. The univariate time series are embedded in higher-dimensional space, such that interesting topologies are formed. Topological data analysis is used to analyse the raw time series, and the cointegrated equivalents. A standard topological data analysis pipeline is enacted, where simplicial complexes are constructed from the embedded point clouds. Topological properties are then calculated from the simplicial complexes; such as the persistent homology. The persistent homology is then analysed, to determine the topological structure of all the time series.
    The State of the Art in Enhancing Trust in Machine Learning Models with the Use of Visualizations. (arXiv:2212.11737v1 [cs.LG])
    Machine learning (ML) models are nowadays used in complex applications in various domains, such as medicine, bioinformatics, and other sciences. Due to their black box nature, however, it may sometimes be hard to understand and trust the results they provide. This has increased the demand for reliable visualization tools related to enhancing trust in ML models, which has become a prominent topic of research in the visualization community over the past decades. To provide an overview and present the frontiers of current research on the topic, we present a State-of-the-Art Report (STAR) on enhancing trust in ML models with the use of interactive visualization. We define and describe the background of the topic, introduce a categorization for visualization techniques that aim to accomplish this goal, and discuss insights and opportunities for future research directions. Among our contributions is a categorization of trust against different facets of interactive ML, expanded and improved from previous research. Our results are investigated from different analytical perspectives: (a) providing a statistical overview, (b) summarizing key findings, (c) performing topic analyses, and (d) exploring the data sets used in the individual papers, all with the support of an interactive web-based survey browser. We intend this survey to be beneficial for visualization researchers whose interests involve making ML models more trustworthy, as well as researchers and practitioners from other disciplines in their search for effective visualization techniques suitable for solving their tasks with confidence and conveying meaning to their data.
    Device Selection for the Coexistence of URLLC and Distributed Learning Services. (arXiv:2212.11805v1 [cs.NI])
    Recent advances in distributed artificial intelligence (AI) have led to tremendous breakthroughs in various communication services, from fault-tolerant factory automation to smart cities. When distributed learning is run over a set of wirelessly connected devices, random channel fluctuations and the incumbent services running on the same network impact the performance of both distributed learning and the coexisting service. In this paper, we investigate a mixed service scenario where distributed AI workflow and ultra-reliable low latency communication (URLLC) services run concurrently over a network. Consequently, we propose a risk sensitivity-based formulation for device selection to minimize the AI training delays during its convergence period while ensuring that the operational requirements of the URLLC service are met. To address this challenging coexistence problem, we transform it into a deep reinforcement learning problem and address it via a framework based on soft actor-critic algorithm. We evaluate our solution with a realistic and 3GPP-compliant simulator for factory automation use cases. Our simulation results confirm that our solution can significantly decrease the training delay of the distributed AI service while keeping the URLLC availability above its required threshold and close to the scenario where URLLC solely consumes all network resources.  ( 2 min )
    Synopsis: Sequential Decision Problems with Weak Feedback. (arXiv:2212.11599v1 [cs.LG])
    This thesis considers sequential decision problems, where the loss/reward incurred by selecting an action may not be inferred from observed feedback. A major part of this thesis focuses on the unsupervised sequential selection problem, where one can not infer the loss incurred for selecting an action from observed feedback. We also introduce a new setup named Censored Semi Bandits, where the loss incurred for selecting an action can be observed under certain conditions. Finally, we study the channel selection problem in the communication networks, where the reward for an action is only observed when no other player selects that action to play in the round. These problems find applications in many fields like healthcare, crowd-sourcing, security, adaptive resource allocation, among many others. This thesis aims to address the above-described sequential decision problems by exploiting specific structures these problems exhibit. We develop provably optimal algorithms for each of these setups with weak feedback and validate their empirical performance on different problem instances derived from synthetic and real datasets.
    Robust Meta-Representation Learning via Global Label Inference and Classification. (arXiv:2212.11702v1 [cs.LG])
    Few-shot learning (FSL) is a central problem in meta-learning, where learners must efficiently learn from few labeled examples. Within FSL, feature pre-training has recently become an increasingly popular strategy to significantly improve generalization performance. However, the contribution of pre-training is often overlooked and understudied, with limited theoretical understanding of its impact on meta-learning performance. Further, pre-training requires a consistent set of global labels shared across training tasks, which may be unavailable in practice. In this work, we address the above issues by first showing the connection between pre-training and meta-learning. We discuss why pre-training yields more robust meta-representation and connect the theoretical analysis to existing works and empirical results. Secondly, we introduce Meta Label Learning (MeLa), a novel meta-learning algorithm that learns task relations by inferring global labels across tasks. This allows us to exploit pre-training for FSL even when global labels are unavailable or ill-defined. Lastly, we introduce an augmented pre-training procedure that further improves the learned meta-representation. Empirically, MeLa outperforms existing methods across a diverse range of benchmarks, in particular under a more challenging setting where the number of training tasks is limited and labels are task-specific. We also provide extensive ablation study to highlight its key properties.
    CatlNet: Learning Communication and Coordination Policies from CaTL+ Specifications. (arXiv:2212.11792v1 [cs.LG])
    In this paper, we propose a learning-based framework to simultaneously learn the communication and distributed control policies for a heterogeneous multi-agent system (MAS) under complex mission requirements from Capability Temporal Logic plus (CaTL+) specifications. Both policies are trained, implemented, and deployed using a novel neural network model called CatlNet. Taking advantage of the robustness measure of CaTL+, we train CatlNet centrally to maximize it where network parameters are shared among all agents, allowing CatlNet to scale to large teams easily. CatlNet can then be deployed distributedly. A plan repair algorithm is also introduced to guide CatlNet's training and improve both training efficiency and the overall performance of CatlNet. The CatlNet approach is tested in simulation and results show that, after training, CatlNet can steer the decentralized MAS system online to satisfy a CaTL+ specification with a high success rate.  ( 2 min )
    Goal-Conditioned Q-Learning as Knowledge Distillation. (arXiv:2208.13298v2 [cs.LG] UPDATED)
    Many applications of reinforcement learning can be formalized as goal-conditioned environments, where, in each episode, there is a "goal" that affects the rewards obtained during that episode but does not affect the dynamics. Various techniques have been proposed to improve performance in goal-conditioned environments, such as automatic curriculum generation and goal relabeling. In this work, we explore a connection between off-policy reinforcement learning in goal-conditioned settings and knowledge distillation. In particular: the current Q-value function and the target Q-value estimate are both functions of the goal, and we would like to train the Q-value function to match its target for all goals. We therefore apply Gradient-Based Attention Transfer (Zagoruyko and Komodakis 2017), a knowledge distillation technique, to the Q-function update. We empirically show that this can improve the performance of goal-conditioned off-policy reinforcement learning when the space of goals is high-dimensional. We also show that this technique can be adapted to allow for efficient learning in the case of multiple simultaneous sparse goals, where the agent can attain a reward by achieving any one of a large set of objectives, all specified at test time. Finally, to provide theoretical support, we give examples of classes of environments where (under some assumptions) standard off-policy algorithms such as DDPG require at least O(d^2) replay buffer transitions to learn an optimal policy, while our proposed technique requires only O(d) transitions, where d is the dimensionality of the goal and state space. Code is available at https://github.com/alevine0/ReenGAGE.
    Fixed-budget online adaptive mesh learning for physics-informed neural networks. Towards parameterized problem inference. (arXiv:2212.11776v1 [cs.LG])
    Physics-Informed Neural Networks (PINNs) have gained much attention in various fields of engineering thanks to their capability of incorporating physical laws into the models. PINNs integrate the physical constraints by minimizing the partial differential equations (PDEs) residuals on a set of collocation points. The distribution of these collocation points appears to have a huge impact on the performance of PINNs and the assessment of the sampling methods for these points is still an active topic. In this paper, we propose a Fixed-Budget Online Adaptive Mesh Learning (FBOAML) method, which decomposes the domain into sub-domains, for training collocation points based on local maxima and local minima of the PDEs residuals. The stopping criterion is based on a data set of reference, which leads to an adaptive number of iterations for each specific problem. The effectiveness of FBOAML is demonstrated in the context of non-parameterized and parameterized problems. The impact of the hyper-parameters in FBOAML is investigated in this work. The comparison with other adaptive sampling methods is also illustrated. The numerical results demonstrate important gains in terms of accuracy of PINNs with FBOAML over the classical PINNs with non-adaptive collocation points. We also apply FBOAML in a complex industrial application involving coupling between mechanical and thermal fields. We show that FBOAML is able to identify the high-gradient location and even give better prediction for some physical fields than the classical PINNs with collocation points taken on a pre-adapted finite element mesh.
    Novel Deep Learning Framework For Bovine Iris Segmentation. (arXiv:2212.11439v1 [eess.IV])
    Iris segmentation is the initial step to identify biometric of animals to establish a traceability system of livestock. In this study, we propose a novel deep learning framework for pixel-wise segmentation with minimum use of annotation labels using BovineAAEyes80 public dataset. In the experiment, U-Net with VGG16 backbone was selected as the best combination of encoder and decoder model, demonstrating a 99.50% accuracy and a 98.35% Dice coefficient score. Remarkably, the selected model accurately segmented corrupted images even without proper annotation data. This study contributes to the advancement of the iris segmentation and the development of a reliable DNNs training framework.
    Efficient Induction of Language Models Via Probabilistic Concept Formation. (arXiv:2212.11937v1 [cs.CL])
    This paper presents a novel approach to the acquisition of language models from corpora. The framework builds on Cobweb, an early system for constructing taxonomic hierarchies of probabilistic concepts that used a tabular, attribute-value encoding of training cases and concepts, making it unsuitable for sequential input like language. In response, we explore three new extensions to Cobweb -- the Word, Leaf, and Path variants. These systems encode each training case as an anchor word and surrounding context words, and they store probabilistic descriptions of concepts as distributions over anchor and context information. As in the original Cobweb, a performance element sorts a new instance downward through the hierarchy and uses the final node to predict missing features. Learning is interleaved with performance, updating concept probabilities and hierarchy structure as classification occurs. Thus, the new approaches process training cases in an incremental, online manner that it very different from most methods for statistical language learning. We examine how well the three variants place synonyms together and keep homonyms apart, their ability to recall synonyms as a function of training set size, and their training efficiency. Finally, we discuss related work on incremental learning and directions for further research.  ( 2 min )
    The Quantum Path Kernel: a Generalized Quantum Neural Tangent Kernel for Deep Quantum Machine Learning. (arXiv:2212.11826v1 [quant-ph])
    Building a quantum analog of classical deep neural networks represents a fundamental challenge in quantum computing. A key issue is how to address the inherent non-linearity of classical deep learning, a problem in the quantum domain due to the fact that the composition of an arbitrary number of quantum gates, consisting of a series of sequential unitary transformations, is intrinsically linear. This problem has been variously approached in the literature, principally via the introduction of measurements between layers of unitary transformations. In this paper, we introduce the Quantum Path Kernel, a formulation of quantum machine learning capable of replicating those aspects of deep machine learning typically associated with superior generalization performance in the classical domain, specifically, hierarchical feature learning. Our approach generalizes the notion of Quantum Neural Tangent Kernel, which has been used to study the dynamics of classical and quantum machine learning models. The Quantum Path Kernel exploits the parameter trajectory, i.e. the curve delineated by model parameters as they evolve during training, enabling the representation of differential layer-wise convergence behaviors, or the formation of hierarchical parametric dependencies, in terms of their manifestation in the gradient space of the predictor function. We evaluate our approach with respect to variants of the classification of Gaussian XOR mixtures - an artificial but emblematic problem that intrinsically requires multilevel learning in order to achieve optimal class separation.  ( 2 min )
    Decoding surface codes with deep reinforcement learning and probabilistic policy reuse. (arXiv:2212.11890v1 [quant-ph])
    Quantum computing (QC) promises significant advantages on certain hard computational tasks over classical computers. However, current quantum hardware, also known as noisy intermediate-scale quantum computers (NISQ), are still unable to carry out computations faithfully mainly because of the lack of quantum error correction (QEC) capability. A significant amount of theoretical studies have provided various types of QEC codes; one of the notable topological codes is the surface code, and its features, such as the requirement of only nearest-neighboring two-qubit control gates and a large error threshold, make it a leading candidate for scalable quantum computation. Recent developments of machine learning (ML)-based techniques especially the reinforcement learning (RL) methods have been applied to the decoding problem and have already made certain progress. Nevertheless, the device noise pattern may change over time, making trained decoder models ineffective. In this paper, we propose a continual reinforcement learning method to address these decoding challenges. Specifically, we implement double deep Q-learning with probabilistic policy reuse (DDQN-PPR) model to learn surface code decoding strategies for quantum environments with varying noise patterns. Through numerical simulations, we show that the proposed DDQN-PPR model can significantly reduce the computational complexity. Moreover, increasing the number of trained policies can further improve the agent's performance. Our results open a way to build more capable RL agents which can leverage previously gained knowledge to tackle QEC challenges.  ( 2 min )
    CSI: Contrastive Data Stratification for Interaction Prediction and its Application to Compound-Protein Interaction Prediction. (arXiv:2111.09467v2 [cs.LG] UPDATED)
    Accurately predicting the likelihood of interaction between two objects (compound-protein sequence, user-item, author-paper, etc.) is a fundamental problem in Computer Science. Current deep-learning models rely on learning accurate representations of the interacting objects. Importantly, relationships between the interacting objects, or features of the interaction, offer an opportunity to partition the data to create multi-views of the interacting objects. The resulting congruent and non-congruent views can then be exploited via contrastive learning techniques to learn enhanced representations of the objects.  ( 2 min )
    Scalable Adaptive Computation for Iterative Generation. (arXiv:2212.11972v1 [cs.LG])
    We present the Recurrent Interface Network (RIN), a neural net architecture that allocates computation adaptively to the input according to the distribution of information, allowing it to scale to iterative generation of high-dimensional data. Hidden units of RINs are partitioned into the interface, which is locally connected to inputs, and latents, which are decoupled from inputs and can exchange information globally. The RIN block selectively reads from the interface into latents for high-capacity processing, with incremental updates written back to the interface. Stacking multiple blocks enables effective routing across local and global levels. While routing adds overhead, the cost can be amortized in recurrent computation settings where inputs change gradually while more global context persists, such as iterative generation using diffusion models. To this end, we propose a latent self-conditioning technique that "warm-starts" the latents at each iteration of the generation process. When applied to diffusion models operating directly on pixels, RINs yield state-of-the-art image and video generation without cascades or guidance, while being domain-agnostic and up to 10$\times$ more efficient compared to specialized 2D and 3D U-Nets.  ( 2 min )
    Normalized Contrastive Learning for Text-Video Retrieval. (arXiv:2212.11790v1 [cs.IR])
    Cross-modal contrastive learning has led the recent advances in multimodal retrieval with its simplicity and effectiveness. In this work, however, we reveal that cross-modal contrastive learning suffers from incorrect normalization of the sum retrieval probabilities of each text or video instance. Specifically, we show that many test instances are either over- or under-represented during retrieval, significantly hurting the retrieval performance. To address this problem, we propose Normalized Contrastive Learning (NCL) which utilizes the Sinkhorn-Knopp algorithm to compute the instance-wise biases that properly normalize the sum retrieval probabilities of each instance so that every text and video instance is fairly represented during cross-modal retrieval. Empirical study shows that NCL brings consistent and significant gains in text-video retrieval on different model architectures, with new state-of-the-art multimodal retrieval metrics on the ActivityNet, MSVD, and MSR-VTT datasets without any architecture engineering.  ( 2 min )
    A machine learning framework for neighbor generation in metaheuristic search. (arXiv:2212.11451v1 [math.OC])
    This paper presents a methodology for integrating machine learning techniques into metaheuristics for solving combinatorial optimization problems. Namely, we propose a general machine learning framework for neighbor generation in metaheuristic search. We first define an efficient neighborhood structure constructed by applying a transformation to a selected subset of variables from the current solution. Then, the key of the proposed methodology is to generate promising neighbors by selecting a proper subset of variables that contains a descent of the objective in the solution space. To learn a good variable selection strategy, we formulate the problem as a classification task that exploits structural information from the characteristics of the problem and from high-quality solutions. We validate our methodology on two metaheuristic applications: a Tabu Search scheme for solving a Wireless Network Optimization problem and a Large Neighborhood Search heuristic for solving Mixed-Integer Programs. The experimental results show that our approach is able to achieve a satisfactory trade-off between the exploration of a larger solution space and the exploitation of high-quality solution regions on both applications.
    Mind the Retrosynthesis Gap: Bridging the divide between Single-step and Multi-step Retrosynthesis Prediction. (arXiv:2212.11809v1 [physics.chem-ph])
    Retrosynthesis is the task of breaking down a chemical compound recursively step-by-step into molecular precursors until a set of commercially available molecules is found. Consequently, the goal is to provide a valid synthesis route for a molecule. As more single-step models develop, we see increasing accuracy in the prediction of molecular disconnections, potentially improving the creation of synthetic paths. Multi-step approaches repeatedly apply the chemical information stored in single-step retrosynthesis models. However, this connection is not reflected in contemporary research, fixing either the single-step model or the multi-step algorithm in the process. In this work, we establish a bridge between both tasks by benchmarking the performance and transfer of different single-step retrosynthesis models to the multi-step domain by leveraging two common search algorithms, Monte Carlo Tree Search and Retro*. We show that models designed for single-step retrosynthesis, when extended to multi-step, can have a tremendous impact on the route finding capabilities of current multi-step methods, improving performance by up to +30% compared to the most widely used model. Furthermore, we observe no clear link between contemporary single-step and multi-step evaluation metrics, showing that single-step models need to be developed and tested for the multi-step domain and not as an isolated task to find synthesis routes for molecules of interest.  ( 2 min )
    Unlocking the potential of two-point cells for energy-efficient and resilient training of deep nets. (arXiv:2211.01950v3 [cs.NE] UPDATED)
    Context-sensitive two-point layer 5 pyramidal cells (L5PCs) were discovered as long ago as 1999. However, the potential of this discovery to provide useful neural computation has yet to be demonstrated. Here we show for the first time how a transformative L5PCs-driven deep neural network (DNN), termed the multisensory cooperative computing (MCC) architecture, can effectively process large amounts of heterogeneous real-world audio-visual (AV) data, using far less energy compared to best available 'point' neuron-driven DNNs. A novel highly-distributed parallel implementation on a Xilinx UltraScale+ MPSoC device estimates energy savings up to 245759 $ \times $ 50000 $\mu$J (i.e., 62% less than the baseline model in a semi-supervised learning setup) where a single synapse consumes $8e^{-5}\mu$J. In a supervised learning setup, the energy-saving can potentially reach up to 1250x less (per feedforward transmission) than the baseline model. The significantly reduced neural activity in MCC leads to inherently fast learning and resilience against sudden neural damage. This remarkable performance in pilot experiments demonstrates the embodied neuromorphic intelligence of our proposed cooperative L5PC that receives input from diverse neighbouring neurons as context to amplify the transmission of most salient and relevant information for onward transmission, from overwhelmingly large multimodal information utilised at the early stages of on-chip training. Our proposed approach opens new cross-disciplinary avenues for future on-chip DNN training implementations and posits a radical shift in current neuromorphic computing paradigms.
    Scalable Multi-Agent Reinforcement Learning for Warehouse Logistics with Robotic and Human Co-Workers. (arXiv:2212.11498v1 [cs.LG])
    This project leverages advances in multi-agent reinforcement learning (MARL) to improve the efficiency and flexibility of order-picking systems for commercial warehouses. We envision a warehouse of the future in which dozens of mobile robots and human pickers work together to collect and deliver items within the warehouse. The fundamental problem we tackle, called the order-picking problem, is how these worker agents must coordinate their movement and actions in the warehouse to maximise performance (e.g. order throughput) under given resource constraints. Established industry methods using heuristic approaches require large engineering efforts to optimise for innately variable warehouse configurations. In contrast, the MARL framework can be flexibly applied to any warehouse configuration (e.g. size, layout, number/types of workers, item replenishment frequency) and the agents learn via a process of trial-and-error how to optimally cooperate with one another. This paper details the current status of the R&D effort initiated by Dematic and the University of Edinburgh towards a general-purpose and scalable MARL solution for the order-picking problem in realistic warehouses.
    Is Out-of-Distribution Detection Learnable?. (arXiv:2210.14707v2 [cs.LG] UPDATED)
    Supervised learning aims to train a classifier under the assumption that training and test data are from the same distribution. To ease the above assumption, researchers have studied a more realistic setting: out-of-distribution (OOD) detection, where test data may come from classes that are unknown during training (i.e., OOD data). Due to the unavailability and diversity of OOD data, good generalization ability is crucial for effective OOD detection algorithms. To study the generalization of OOD detection, in this paper, we investigate the probably approximately correct (PAC) learning theory of OOD detection, which is proposed by researchers as an open problem. First, we find a necessary condition for the learnability of OOD detection. Then, using this condition, we prove several impossibility theorems for the learnability of OOD detection under some scenarios. Although the impossibility theorems are frustrating, we find that some conditions of these impossibility theorems may not hold in some practical scenarios. Based on this observation, we next give several necessary and sufficient conditions to characterize the learnability of OOD detection in some practical scenarios. Lastly, we also offer theoretical supports for several representative OOD detection works based on our OOD theory.
    Theoretically Motivated Data Augmentation and Regularization for Portfolio Construction. (arXiv:2106.04114v3 [cs.LG] UPDATED)
    The task we consider is portfolio construction in a speculative market, a fundamental problem in modern finance. While various empirical works now exist to explore deep learning in finance, the theory side is almost non-existent. In this work, we focus on developing a theoretical framework for understanding the use of data augmentation for deep-learning-based approaches to quantitative finance. The proposed theory clarifies the role and necessity of data augmentation for finance; moreover, our theory implies that a simple algorithm of injecting a random noise of strength $\sqrt{|r_{t-1}|}$ to the observed return $r_{t}$ is better than not injecting any noise and a few other financially irrelevant data augmentation techniques.
    Towards Futuristic Autonomous Experimentation--A Surprise-Reacting Sequential Experiment Policy. (arXiv:2112.00600v2 [cs.LG] UPDATED)
    An autonomous experimentation platform in manufacturing is supposedly capable of conducting a sequential search for finding suitable manufacturing conditions for advanced materials by itself or even for discovering new materials with minimal human intervention. The core of the intelligent control of such platforms is the policy directing sequential experiments, namely, to decide where to conduct the next experiment based on what has been done thus far. Such policy inevitably trades off exploitation versus exploration and the current practice is under the Bayesian optimization framework using the expected improvement criterion or its variants. We discuss whether it is beneficial to trade off exploitation versus exploration by measuring the element and degree of surprise associated with the immediate past observation. We devise a surprise-reacting policy using two existing surprise metrics, known as the Shannon surprise and Bayesian surprise. Our analysis shows that the surprise-reacting policy appears to be better suited for quickly characterizing the overall landscape of a response surface or a design place under resource constraints. We argue that such capability is much needed for futuristic autonomous experimentation platforms. We do not claim that we have a fully autonomous experimentation platform, but believe that our current effort sheds new lights or provides a different view angle as researchers are racing to elevate the autonomy of various primitive autonomous experimentation systems.
    Black-box Error Diagnosis in Deep Neural Networks for Computer Vision: a Survey of Tools. (arXiv:2201.06444v4 [cs.LG] UPDATED)
    The application of Deep Neural Networks (DNNs) to a broad variety of tasks demands methods for coping with the complex and opaque nature of these architectures. When a gold standard is available, performance assessment treats the DNN as a black box and computes standard metrics based on the comparison of the predictions with the ground truth. A deeper understanding of performances requires going beyond such evaluation metrics to diagnose the model behavior and the prediction errors. This goal can be pursued in two complementary ways. On one side, model interpretation techniques "open the box" and assess the relationship between the input, the inner layers and the output, so as to identify the architecture modules most likely to cause the performance loss. On the other hand, black-box error diagnosis techniques study the correlation between the model response and some properties of the input not used for training, so as to identify the features of the inputs that make the model fail. Both approaches give hints on how to improve the architecture and/or the training process. This paper focuses on the application of DNNs to Computer Vision (CV) tasks and presents a survey of the tools that support the black-box performance diagnosis paradigm. It illustrates the features and gaps of the current proposals, discusses the relevant research directions and provides a brief overview of the diagnosis tools in sectors other than CV.
    Dataset of Pathloss and ToA Radio Maps With Localization Application. (arXiv:2212.11777v1 [cs.NI])
    In this article, we present a collection of radio map datasets in dense urban setting, which we generated and made publicly available. The datasets include simulated pathloss/received signal strength (RSS) and time of arrival (ToA) radio maps over a large collection of realistic dense urban setting in real city maps. The two main applications of the presented dataset are 1) learning methods that predict the pathloss from input city maps (namely, deep learning-based simulations), and, 2) wireless localization. The fact that the RSS and ToA maps are computed by the same simulations over the same city maps allows for a fair comparison of the RSS and ToA-based localization methods.  ( 2 min )
    A3GC-IP: Attention-Oriented Adjacency Adaptive Recurrent Graph Convolutions for Human Pose Estimation from Sparse Inertial Measurements. (arXiv:2107.11214v4 [cs.CV] UPDATED)
    Conventional methods for human pose estimation either require a high degree of instrumentation, by relying on many inertial measurement units (IMUs), or constraint the recording space, by relying on extrinsic cameras. These deficits are tackled through the approach of human pose estimation from sparse IMU data. We define attention-oriented adjacency adaptive graph convolutional long-short term memory networks (A3GC-LSTM), to tackle human pose estimation based on six IMUs, through incorporating the human body graph structure directly into the network. The A3GC-LSTM combines both spatial and temporal dependency in a single network operation, more memory efficiently than previous approaches. The recurrent graph learning on arbitrarily long sequences is made possible by equipping graph convolutions with adjacency adaptivity, which eliminates the problem of information loss in deep or recurrent graph networks, while it also allows for learning unknown dependencies between the human body joints. To further boost accuracy, a spatial attention formalism is incorporated into the recurrent LSTM cell. With our presented approach, we are able to utilize the inherent graph nature of the human body, and thus can outperform the state of the art for human pose estimation from sparse IMU data.
    MetaFormer Baselines for Vision. (arXiv:2210.13452v2 [cs.CV] UPDATED)
    MetaFormer, the abstracted architecture of Transformer, has been found to play a significant role in achieving competitive performance. In this paper, we further explore the capacity of MetaFormer, again, without focusing on token mixer design: we introduce several baseline models under MetaFormer using the most basic or common mixers, and summarize our observations as follows: (1) MetaFormer ensures solid lower bound of performance. By merely adopting identity mapping as the token mixer, the MetaFormer model, termed IdentityFormer, achieves >80% accuracy on ImageNet-1K. (2) MetaFormer works well with arbitrary token mixers. When specifying the token mixer as even a random matrix to mix tokens, the resulting model RandFormer yields an accuracy of >81%, outperforming IdentityFormer. Rest assured of MetaFormer's results when new token mixers are adopted. (3) MetaFormer effortlessly offers state-of-the-art results. With just conventional token mixers dated back five years ago, the models instantiated from MetaFormer already beat state of the art. (a) ConvFormer outperforms ConvNeXt. Taking the common depthwise separable convolutions as the token mixer, the model termed ConvFormer, which can be regarded as pure CNNs, outperforms the strong CNN model ConvNeXt. (b) CAFormer sets new record on ImageNet-1K. By simply applying depthwise separable convolutions as token mixer in the bottom stages and vanilla self-attention in the top stages, the resulting model CAFormer sets a new record on ImageNet-1K: it achieves an accuracy of 85.5% at 224x224 resolution, under normal supervised training without external data or distillation. In our expedition to probe MetaFormer, we also find that a new activation, StarReLU, reduces 71% FLOPs of activation compared with GELU yet achieves better performance. We expect StarReLU to find great potential in MetaFormer-like models alongside other neural networks.
    Compositional generalization in semantic parsing with pretrained transformers. (arXiv:2109.15101v3 [cs.CL] UPDATED)
    Large-scale pretraining instills large amounts of knowledge in deep neural networks. This, in turn, improves the generalization behavior of these models in downstream tasks. What exactly are the limits to the generalization benefits of large-scale pretraining? Here, we report observations from some simple experiments aimed at addressing this question in the context of two semantic parsing tasks involving natural language, SCAN and COGS. We show that language models pretrained exclusively with non-English corpora, or even with programming language corpora, significantly improve out-of-distribution generalization in these benchmarks, compared with models trained from scratch, even though both benchmarks are English-based. This demonstrates the surprisingly broad transferability of pretrained representations and knowledge. Pretraining with a large-scale protein sequence prediction task, on the other hand, mostly deteriorates the generalization performance in SCAN and COGS, suggesting that pretrained representations do not transfer universally and that there are constraints on the similarity between the pretraining and downstream domains for successful transfer. Finally, we show that larger models are harder to train from scratch and their generalization accuracy is lower when trained up to convergence on the relatively small SCAN and COGS datasets, but the benefits of large-scale pretraining become much clearer with larger models.
    MissDAG: Causal Discovery in the Presence of Missing Data with Continuous Additive Noise Models. (arXiv:2205.13869v2 [cs.LG] UPDATED)
    State-of-the-art causal discovery methods usually assume that the observational data is complete. However, the missing data problem is pervasive in many practical scenarios such as clinical trials, economics, and biology. One straightforward way to address the missing data problem is first to impute the data using off-the-shelf imputation methods and then apply existing causal discovery methods. However, such a two-step method may suffer from suboptimality, as the imputation algorithm may introduce bias for modeling the underlying data distribution. In this paper, we develop a general method, which we call MissDAG, to perform causal discovery from data with incomplete observations. Focusing mainly on the assumptions of ignorable missingness and the identifiable additive noise models (ANMs), MissDAG maximizes the expected likelihood of the visible part of observations under the expectation-maximization (EM) framework. In the E-step, in cases where computing the posterior distributions of parameters in closed-form is not feasible, Monte Carlo EM is leveraged to approximate the likelihood. In the M-step, MissDAG leverages the density transformation to model the noise distributions with simpler and specific formulations by virtue of the ANMs and uses a likelihood-based causal discovery algorithm with directed acyclic graph constraint. We demonstrate the flexibility of MissDAG for incorporating various causal discovery algorithms and its efficacy through extensive simulations and real data experiments.
    Sparsely-gated MoE Layers for CNN Interpretability. (arXiv:2204.10598v2 [cs.CV] UPDATED)
    Sparsely-gated Mixture of Expert (MoE) layers have been recently successfully applied for scaling large transformers, especially for language modeling tasks. An intriguing side effect of sparse MoE layers is that they convey inherent interpretability to a model via natural expert specialization. In this work, we apply sparse MoE layers to CNNs for computer vision tasks and analyze the resulting effect on model interpretability. To stabilize MoE training, we present both soft and hard constraint-based approaches. With hard constraints, the weights of certain experts are allowed to become zero, while soft constraints balance the contribution of experts with an additional auxiliary loss. As a result, soft constraints handle expert utilization better and support the expert specialization process, while hard constraints maintain more generalized experts and increase overall model performance. Our findings demonstrate that experts can implicitly focus on individual sub-domains of the input space. For example, experts trained for CIFAR-100 image classification specialize in recognizing different domains such as flowers or animals without previous data clustering. Experiments with RetinaNet and the COCO dataset further indicate that object detection experts can also specialize in detecting objects of distinct sizes.
    Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems. (arXiv:2210.15432v2 [cs.LG] UPDATED)
    Deep Neural Networks (DNNs) have been widely used to perform real-world tasks in cyber-physical systems such as Autonomous Driving Systems (ADS). Ensuring the correct behavior of such DNN-Enabled Systems (DES) is a crucial topic. Online testing is one of the promising modes for testing such systems with their application environments (simulated or real) in a closed loop taking into account the continuous interaction between the systems and their environments. However, the environmental variables (e.g., lighting conditions) that might change during the systems' operation in the real world, causing the DES to violate requirements (safety, functional), are often kept constant during the execution of an online test scenario due to the two major challenges: (1) the space of all possible scenarios to explore would become even larger if they changed and (2) there are typically many requirements to test simultaneously. In this paper, we present MORLOT (Many-Objective Reinforcement Learning for Online Testing), a novel online testing approach to address these challenges by combining Reinforcement Learning (RL) and many-objective search. MORLOT leverages RL to incrementally generate sequences of environmental changes while relying on many-objective search to determine the changes so that they are more likely to achieve any of the uncovered objectives. We empirically evaluate MORLOT using CARLA, a high-fidelity simulator widely used for autonomous driving research, integrated with Transfuser, a DNN-enabled ADS for end-to-end driving. The evaluation results show that MORLOT is significantly more effective and efficient than alternatives with a large effect size. In other words, MORLOT is a good option to test DES with dynamically changing environments while accounting for multiple safety requirements.
    Adversarial Machine Learning and Defense Game for NextG Signal Classification with Deep Learning. (arXiv:2212.11778v1 [cs.NI])
    This paper presents a game-theoretic framework to study the interactions of attack and defense for deep learning-based NextG signal classification. NextG systems such as the one envisioned for a massive number of IoT devices can employ deep neural networks (DNNs) for various tasks such as user equipment identification, physical layer authentication, and detection of incumbent users (such as in the Citizens Broadband Radio Service (CBRS) band). By training another DNN as the surrogate model, an adversary can launch an inference (exploratory) attack to learn the behavior of the victim model, predict successful operation modes (e.g., channel access), and jam them. A defense mechanism can increase the adversary's uncertainty by introducing controlled errors in the victim model's decisions (i.e., poisoning the adversary's training data). This defense is effective against an attack but reduces the performance when there is no attack. The interactions between the defender and the adversary are formulated as a non-cooperative game, where the defender selects the probability of defending or the defense level itself (i.e., the ratio of falsified decisions) and the adversary selects the probability of attacking. The defender's objective is to maximize its reward (e.g., throughput or transmission success ratio), whereas the adversary's objective is to minimize this reward and its attack cost. The Nash equilibrium strategies are determined as operation modes such that no player can unilaterally improve its utility given the other's strategy is fixed. A fictitious play is formulated for each player to play the game repeatedly in response to the empirical frequency of the opponent's actions. The performance in Nash equilibrium is compared to the fixed attack and defense cases, and the resilience of NextG signal classification against attacks is quantified.  ( 2 min )
    SlimFL: Federated Learning with Superposition Coding over Slimmable Neural Networks. (arXiv:2203.14094v3 [cs.LG] UPDATED)
    Federated learning (FL) is a key enabler for efficient communication and computing, leveraging devices' distributed computing capabilities. However, applying FL in practice is challenging due to the local devices' heterogeneous energy, wireless channel conditions, and non-independently and identically distributed (non-IID) data distributions. To cope with these issues, this paper proposes a novel learning framework by integrating FL and width-adjustable slimmable neural networks (SNN). Integrating FL with SNNs is challenging due to time-varying channel conditions and data distributions. In addition, existing multi-width SNN training algorithms are sensitive to the data distributions across devices, which makes SNN ill-suited for FL. Motivated by this, we propose a communication and energy-efficient SNN-based FL (named SlimFL) that jointly utilizes superposition coding (SC) for global model aggregation and superposition training (ST) for updating local models. By applying SC, SlimFL exchanges the superposition of multiple-width configurations decoded as many times as possible for a given communication throughput. Leveraging ST, SlimFL aligns the forward propagation of different width configurations while avoiding inter-width interference during backpropagation. We formally prove the convergence of SlimFL. The result reveals that SlimFL is not only communication-efficient but also deals with non-IID data distributions and poor channel conditions, which is also corroborated by data-intensive simulations.
    Fast expansion into harmonics on the disk: a steerable basis with fast radial convolutions. (arXiv:2207.13674v2 [math.NA] UPDATED)
    We present a fast and numerically accurate method for expanding digitized $L \times L$ images representing functions on $[-1,1]^2$ supported on the disk $\{x \in \mathbb{R}^2 : |x|<1\}$ in the harmonics (Dirichlet Laplacian eigenfunctions) on the disk. Our method, which we refer to as the Fast Disk Harmonics Transform (FDHT), runs in $O(L^2 \log L)$ operations. This basis is also known as the Fourier-Bessel basis, and it has several computational advantages: it is orthogonal, ordered by frequency, and steerable in the sense that images expanded in the basis can be rotated by applying a diagonal transform to the coefficients. Moreover, we show that convolution with radial functions can also be efficiently computed by applying a diagonal transform to the coefficients.
    Truncated Matrix Power Iteration for Differentiable DAG Learning. (arXiv:2208.14571v2 [cs.LG] UPDATED)
    Recovering underlying Directed Acyclic Graph (DAG) structures from observational data is highly challenging due to the combinatorial nature of the DAG-constrained optimization problem. Recently, DAG learning has been cast as a continuous optimization problem by characterizing the DAG constraint as a smooth equality one, generally based on polynomials over adjacency matrices. Existing methods place very small coefficients on high-order polynomial terms for stabilization, since they argue that large coefficients on the higher-order terms are harmful due to numeric exploding. On the contrary, we discover that large coefficients on higher-order terms are beneficial for DAG learning, when the spectral radiuses of the adjacency matrices are small, and that larger coefficients for higher-order terms can approximate the DAG constraints much better than the small counterparts. Based on this, we propose a novel DAG learning method with efficient truncated matrix power iteration to approximate geometric series based DAG constraints. Empirically, our DAG learning method outperforms the previous state-of-the-arts in various settings, often by a factor of $3$ or more in terms of structural Hamming distance.
    Impossibility Theorems for Feature Attribution. (arXiv:2212.11870v1 [cs.LG])
    Despite a sea of interpretability methods that can produce plausible explanations, the field has also empirically seen many failure cases of such methods. In light of these results, it remains unclear for practitioners how to use these methods and choose between them in a principled way. In this paper, we show that for even moderately rich model classes (easily satisfied by neural networks), any feature attribution method that is complete and linear--for example, Integrated Gradients and SHAP--can provably fail to improve on random guessing for inferring model behaviour. Our results apply to common end-tasks such as identifying local model behaviour, spurious feature identification, and algorithmic recourse. One takeaway from our work is the importance of concretely defining end-tasks. In particular, we show that once such an end-task is defined, a simple and direct approach of repeated model evaluations can outperform many other complex feature attribution methods.  ( 2 min )
    Convergence of Invariant Graph Networks. (arXiv:2201.10129v3 [cs.LG] UPDATED)
    Although theoretical properties such as expressive power and over-smoothing of graph neural networks (GNN) have been extensively studied recently, its convergence property is a relatively new direction. In this paper, we investigate the convergence of one powerful GNN, Invariant Graph Network (IGN) over graphs sampled from graphons. We first prove the stability of linear layers for general $k$-IGN (of order $k$) based on a novel interpretation of linear equivariant layers. Building upon this result, we prove the convergence of $k$-IGN under the model of \citet{ruiz2020graphon}, where we access the edge weight but the convergence error is measured for graphon inputs. Under the more natural (and more challenging) setting of \citet{keriven2020convergence} where one can only access 0-1 adjacency matrix sampled according to edge probability, we first show a negative result that the convergence of any IGN is not possible. We then obtain the convergence of a subset of IGNs, denoted as IGN-small, after the edge probability estimation. We show that IGN-small still contains function class rich enough that can approximate spectral GNNs arbitrarily well. Lastly, we perform experiments on various graphon models to verify our statements.
    Bit-Metric Decoding Rate in Multi-User MIMO Systems: Theory. (arXiv:2203.06271v4 [cs.IT] UPDATED)
    Link-adaptation (LA) is one of the most important aspects of wireless communications where the modulation and coding scheme (MCS) used by the transmitter is adapted to the channel conditions in order to meet a certain target error-rate. In a single-user SISO (SU-SISO) system with out-of-cell interference, LA is performed by computing the post-equalization signal-to-interference-noise ratio (SINR) at the receiver. The same technique can be employed in multi-user MIMO (MU-MIMO) receivers that use linear detectors. Another important use of post-equalization SINR is for physical layer (PHY) abstraction, where several PHY blocks like the channel encoder, the detector, and the channel decoder are replaced by an abstraction model in order to speed up system-level simulations. However, for MU-MIMO systems with non-linear receivers, there is no known equivalent of post-equalization SINR which makes both LA and PHY abstraction extremely challenging. This important issue is addressed in this two-part paper. In this part, a metric called the bit-metric decoding rate (BMDR) of a detector, which is the proposed equivalent of post-equalization SINR, is presented. Since BMDR does not have a closed form expression that would enable its instantaneous calculation, a machine-learning approach to predict it is presented along with extensive simulation results.
    Self-Supervised Contrastive Representation Learning for 3D Mesh Segmentation. (arXiv:2208.04278v2 [cs.CV] UPDATED)
    3D deep learning is a growing field of interest due to the vast amount of information stored in 3D formats. Triangular meshes are an efficient representation for irregular, non-uniform 3D objects. However, meshes are often challenging to annotate due to their high geometrical complexity. Specifically, creating segmentation masks for meshes is tedious and time-consuming. Therefore, it is desirable to train segmentation networks with limited-labeled data. Self-supervised learning (SSL), a form of unsupervised representation learning, is a growing alternative to fully-supervised learning which can decrease the burden of supervision for training. We propose SSL-MeshCNN, a self-supervised contrastive learning method for pre-training CNNs for mesh segmentation. We take inspiration from traditional contrastive learning frameworks to design a novel contrastive learning algorithm specifically for meshes. Our preliminary experiments show promising results in reducing the heavy labeled data requirement needed for mesh segmentation by at least 33%.
    Renormalization in the neural network-quantum field theory correspondence. (arXiv:2212.11811v1 [hep-th])
    A statistical ensemble of neural networks can be described in terms of a quantum field theory (NN-QFT correspondence). The infinite-width limit is mapped to a free field theory, while finite N corrections are mapped to interactions. After reviewing the correspondence, we will describe how to implement renormalization in this context and discuss preliminary numerical results for translation-invariant kernels. A major outcome is that changing the standard deviation of the neural network weight distribution corresponds to a renormalization flow in the space of networks.  ( 2 min )
    Yes We Care! -- Certification for Machine Learning Methods through the Care Label Framework. (arXiv:2105.10197v2 [cs.LG] UPDATED)
    Machine learning applications have become ubiquitous. Their applications range from embedded control in production machines over process optimization in diverse areas (e.g., traffic, finance, sciences) to direct user interactions like advertising and recommendations. This has led to an increased effort of making machine learning trustworthy. Explainable and fair AI have already matured. They address the knowledgeable user and the application engineer. However, there are users that want to deploy a learned model in a similar way as their washing machine. These stakeholders do not want to spend time in understanding the model, but want to rely on guaranteed properties. What are the relevant properties? How can they be expressed to the stakeholder without presupposing machine learning knowledge? How can they be guaranteed for a certain implementation of a machine learning model? These questions move far beyond the current state of the art and we want to address them here. We propose a unified framework that certifies learning methods via care labels. They are easy to understand and draw inspiration from well-known certificates like textile labels or property cards of electronic devices. Our framework considers both, the machine learning theory and a given implementation. We test the implementation's compliance with theoretical properties and bounds.
    ReViSe: Remote Vital Signs Measurement Using Smartphone Camera. (arXiv:2206.08748v2 [cs.CV] UPDATED)
    We propose an end-to-end framework to measure people's vital signs including Heart Rate (HR), Heart Rate Variability (HRV), Oxygen Saturation (SpO2) and Blood Pressure (BP) based on the rPPG methodology from the video of a user's face captured with a smartphone camera. We extract face landmarks with a deep learning-based neural network model in real-time. Multiple face patches also called Regions-of-Interest (RoIs) are extracted by using the predicted face landmarks. Several filters are applied to reduce the noise from the RoIs in the extracted cardiac signals called Blood Volume Pulse (BVP) signal. The measurements of HR, HRV and SpO2 are validated on two public rPPG datasets namely the TokyoTech rPPG and the Pulse Rate Detection (PURE) datasets, on which our models achieved the following Mean Absolute Errors (MAE): a) for HR, 1.73Beats-Per-Minute (bpm) and 3.95bpm respectively; b) for HRV, 18.55ms and 25.03ms respectively, and c) for SpO2, an MAE of 1.64% on the PURE dataset. We validated our end-to-end rPPG framework, ReViSe, in daily living environment, and thereby created the Video-HR dataset. Our HR estimation model achieved an MAE of 2.49bpm on this dataset. Since no publicly available rPPG datasets existed for BP measurement with face videos, we used a dataset with signals from fingertip sensor to train our deep learning-based BP estimation model and also created our own video dataset, Video-BP. On our Video-BP dataset, our BP estimation model achieved an MAE of 6.7mmHg for Systolic Blood Pressure (SBP), and an MAE of 9.6mmHg for Diastolic Blood Pressure (DBP). ReViSe framework has been validated on datasets with videos recorded in daily living environment as opposed to less noisy laboratory environment as reported by most state-of-the-art techniques.
    Automatically Bounding the Taylor Remainder Series: Tighter Bounds and New Applications. (arXiv:2212.11429v1 [cs.LG])
    We present a new algorithm for automatically bounding the Taylor remainder series. In the special case of a scalar function $f: \mathbb{R} \mapsto \mathbb{R}$, our algorithm takes as input a reference point $x_0$, trust region $[a, b]$, and integer $k \ge 0$, and returns an interval $I$ such that $f(x) - \sum_{i=0}^k \frac {f^{(i)}(x_0)} {i!} (x - x_0)^i \in I (x - x_0)^{k+1}$ for all $x \in [a, b]$. As in automatic differentiation, the function $f$ is provided to the algorithm in symbolic form, and must be composed of known elementary functions. At a high level, our algorithm has two steps. First, for a variety of commonly-used elementary functions (e.g., $\exp$, $\log$), we derive sharp polynomial upper and lower bounds on the Taylor remainder series. We then recursively combine the bounds for the elementary functions using an interval arithmetic variant of Taylor-mode automatic differentiation. Our algorithm can make efficient use of machine learning hardware accelerators, and we provide an open source implementation in JAX. We then turn our attention to applications. Most notably, we use our new machinery to create the first universal majorization-minimization optimization algorithms: algorithms that iteratively minimize an arbitrary loss using a majorizer that is derived automatically, rather than by hand. Applied to machine learning, this leads to architecture-specific optimizers for training deep networks that converge from any starting point, without hyperparameter tuning. Our experiments show that for some optimization problems, these hyperparameter-free optimizers outperform tuned versions of gradient descent, Adam, and AdaGrad. We also show that our automatically-derived bounds can be used for verified global optimization and numerical integration, and to prove sharper versions of Jensen's inequality.
    Co-clustering based exploratory analysis of mixed-type data tables. (arXiv:2212.11728v1 [cs.LG])
    Co-clustering is a class of unsupervised data analysis techniques that extract the existing underlying dependency structure between the instances and variables of a data table as homogeneous blocks. Most of those techniques are limited to variables of the same type. In this paper, we propose a mixed data co-clustering method based on a two-step methodology. In the first step, all the variables are binarized according to a number of bins chosen by the analyst, by equal frequency discretization in the numerical case, or keeping the most frequent values in the categorical case. The second step applies a co-clustering to the instances and the binary variables, leading to groups of instances and groups of variable parts. We apply this methodology on several data sets and compare with the results of a Multiple Correspondence Analysis applied to the same data.
    Fully 3D Implementation of the End-to-end Deep Image Prior-based PET Image Reconstruction Using Block Iterative Algorithm. (arXiv:2212.11844v1 [physics.med-ph])
    Deep image prior (DIP) has recently attracted attention owing to its unsupervised positron emission tomography (PET) image reconstruction, which does not require any prior training dataset. In this paper, we present the first attempt to implement an end-to-end DIP-based fully 3D PET image reconstruction method that incorporates a forward-projection model into a loss function. To implement a practical fully 3D PET image reconstruction, which could not be performed due to a graphics processing unit memory limitation, we modify the DIP optimization to block-iteration and sequentially learn an ordered sequence of block sinograms. Furthermore, the relative difference penalty (RDP) term was added to the loss function to enhance the quantitative PET image accuracy. We evaluated our proposed method using Monte Carlo simulation with [$^{18}$F]FDG PET data of a human brain and a preclinical study on monkey brain [$^{18}$F]FDG PET data. The proposed method was compared with the maximum-likelihood expectation maximization (EM), maximum-a-posterior EM with RDP, and hybrid DIP-based PET reconstruction methods. The simulation results showed that the proposed method improved the PET image quality by reducing statistical noise and preserved a contrast of brain structures and inserted tumor compared with other algorithms. In the preclinical experiment, finer structures and better contrast recovery were obtained by the proposed method. This indicated that the proposed method can produce high-quality images without a prior training dataset. Thus, the proposed method is a key enabling technology for the straightforward and practical implementation of end-to-end DIP-based fully 3D PET image reconstruction.
    Dataset Condensation with Distribution Matching. (arXiv:2110.04181v3 [cs.LG] UPDATED)
    Computational cost of training state-of-the-art deep models in many learning problems is rapidly increasing due to more sophisticated models and larger datasets. A recent promising direction for reducing training cost is dataset condensation that aims to replace the original large training set with a significantly smaller learned synthetic set while preserving the original information. While training deep models on the small set of condensed images can be extremely fast, their synthesis remains computationally expensive due to the complex bi-level optimization and second-order derivative computation. In this work, we propose a simple yet effective method that synthesizes condensed images by matching feature distributions of the synthetic and original training images in many sampled embedding spaces. Our method significantly reduces the synthesis cost while achieving comparable or better performance. Thanks to its efficiency, we apply our method to more realistic and larger datasets with sophisticated neural architectures and obtain a significant performance boost. We also show promising practical benefits of our method in continual learning and neural architecture search.
    Feature Acquisition using Monte Carlo Tree Search. (arXiv:2212.11360v1 [cs.LG])
    Feature acquisition algorithms address the problem of acquiring informative features while balancing the costs of acquisition to improve the learning performances of ML models. Previous approaches have focused on calculating the expected utility values of features to determine the acquisition sequences. Other approaches formulated the problem as a Markov Decision Process (MDP) and applied reinforcement learning based algorithms. In comparison to previous approaches, we focus on 1) formulating the feature acquisition problem as a MDP and applying Monte Carlo Tree Search, 2) calculating the intermediary rewards for each acquisition step based on model improvements and acquisition costs and 3) simultaneously optimizing model improvement and acquisition costs with multi-objective Monte Carlo Tree Search. With Proximal Policy Optimization and Deep Q-Network algorithms as benchmark, we show the effectiveness of our proposed approach with experimental study.  ( 2 min )
    Towards Quantum-Enabled 6G Slicing. (arXiv:2212.11755v1 [cs.NI])
    The quantum machine learning (QML) paradigms and their synergies with network slicing can be envisioned to be a disruptive technology on the cusp of entering to era of sixth-generation (6G), where the mobile communication systems are underpinned in the form of advanced tenancy-based digital use-cases to meet different service requirements. To overcome the challenges of massive slices such as handling the increased dynamism, heterogeneity, amount of data, extended training time, and variety of security levels for slice instances, the power of quantum computing pursuing a distributed computation and learning can be deemed as a promising prerequisite. In this intent, we propose a cloud-native federated learning framework based on quantum deep reinforcement learning (QDRL) where distributed decision agents deployed as micro-services at the edge and cloud through Kubernetes infrastructure then are connected dynamically to the radio access network (RAN). Specifically, the decision agents leverage the remold of classical deep reinforcement learning (DRL) algorithm into variational quantum circuits (VQCs) to obtain the optimal cooperative control on slice resources. The initial numerical results show that the proposed federated QDRL (FQDRL) scheme provides comparable performance than benchmark solutions and reveals the quantum advantage in parameter reduction. To the best of our knowledge, this is the first exploratory study considering an FQDRL approach for 6G communication network.  ( 2 min )
    When and Why Test Generators for Deep Learning Produce Invalid Inputs: an Empirical Study. (arXiv:2212.11368v1 [cs.SE])
    Testing Deep Learning (DL) based systems inherently requires large and representative test sets to evaluate whether DL systems generalise beyond their training datasets. Diverse Test Input Generators (TIGs) have been proposed to produce artificial inputs that expose issues of the DL systems by triggering misbehaviours. Unfortunately, such generated inputs may be invalid, i.e., not recognisable as part of the input domain, thus providing an unreliable quality assessment. Automated validators can ease the burden of manually checking the validity of inputs for human testers, although input validity is a concept difficult to formalise and, thus, automate. In this paper, we investigate to what extent TIGs can generate valid inputs, according to both automated and human validators. We conduct a large empirical study, involving 2 different automated validators, 220 human assessors, 5 different TIGs and 3 classification tasks. Our results show that 84% artificially generated inputs are valid, according to automated validators, but their expected label is not always preserved. Automated validators reach a good consensus with humans (78% accuracy), but still have limitations when dealing with feature-rich datasets.  ( 2 min )
    MM811 Project Report: Cloud Detection and Removal in Satellite Images. (arXiv:2212.11369v1 [cs.CV])
    For satellite images, the presence of clouds presents a problem as clouds obscure more than half to two-thirds of the ground information. This problem causes many issues for reliability in a noise-free environment to communicate data and other applications that need seamless monitoring. Removing the clouds from the images while keeping the background pixels intact can help address the mentioned issues. Recently, deep learning methods have become popular for researching cloud removal by demonstrating promising results, among which Generative Adversarial Networks (GAN) have shown considerably better performance. In this project, we aim to address cloud removal from satellite images using AttentionGAN and then compare our results by reproducing the results obtained using traditional GANs and auto-encoders. We use RICE dataset. The outcome of this project can be used to develop applications that require cloud-free satellite images. Moreover, our results could be helpful for making further research improvements.  ( 2 min )
    Decision-making and control with metasurface-based diffractive neural networks. (arXiv:2212.11278v1 [cs.LG])
    The ultimate goal of artificial intelligence is to mimic the human brain to perform decision-making and control directly from high-dimensional sensory input. All-optical diffractive neural networks provide a promising solution for realizing artificial intelligence with high-speed and low-power consumption. To date, most of the reported diffractive neural networks focus on single or multiple tasks that do not involve interaction with the environment, such as object recognition and image classification, while the networks that can perform decision-making and control, to our knowledge, have not been developed yet. Here, we propose to use deep reinforcement learning to realize diffractive neural networks that enable imitating the human-level capability of decision-making and control. Such networks allow for finding optimal control policies through interaction with the environment and can be readily realized with the dielectric metasurfaces. The superior performances of these networks are verified by engaging three types of classic games, Tic-Tac-Toe, Super Mario Bros., and Car Racing, and achieving the same or even higher levels comparable to human players. Our work represents a solid step of advancement in diffractive neural networks, which promises a fundamental shift from the target-driven control of a pre-designed state for simple recognition or classification tasks to the high-level sensory capability of artificial intelligence. It may find exciting applications in autonomous driving, intelligent robots, and intelligent manufacturing.  ( 2 min )
    Few-shot human motion prediction for heterogeneous sensors. (arXiv:2212.11771v1 [cs.LG])
    Human motion prediction is a complex task as it involves forecasting variables over time on a graph of connected sensors. This is especially true in the case of few-shot learning, where we strive to forecast motion sequences for previously unseen actions based on only a few examples. Despite this, almost all related approaches for few-shot motion prediction do not incorporate the underlying graph, while it is a common component in classical motion prediction. Furthermore, state-of-the-art methods for few-shot motion prediction are restricted to motion tasks with a fixed output space meaning these tasks are all limited to the same sensor graph. In this work, we propose to extend recent works on few-shot time-series forecasting with heterogeneous attributes with graph neural networks to introduce the first few-shot motion approach that explicitly incorporates the spatial graph while also generalizing across motion tasks with heterogeneous sensors. In our experiments on motion tasks with heterogeneous sensors, we demonstrate significant performance improvements with lifts from 10.4% up to 39.3% compared to best state-of-the-art models. Moreover, we show that our model can perform on par with the best approach so far when evaluating on tasks with a fixed output space while maintaining two magnitudes fewer parameters.  ( 2 min )
    What do LLMs Know about Financial Markets? A Case Study on Reddit Market Sentiment Analysis. (arXiv:2212.11311v1 [cs.CL])
    Market sentiment analysis on social media content requires knowledge of both financial markets and social media jargon, which makes it a challenging task for human raters. The resulting lack of high-quality labeled data stands in the way of conventional supervised learning methods. Instead, we approach this problem using semi-supervised learning with a large language model (LLM). Our pipeline generates weak financial sentiment labels for Reddit posts with an LLM and then uses that data to train a small model that can be served in production. We find that prompting the LLM to produce Chain-of-Thought summaries and forcing it through several reasoning paths helps generate more stable and accurate labels, while using a regression loss further improves distillation quality. With only a handful of prompts, the final model performs on par with existing supervised models. Though production applications of our model are limited by ethical considerations, the model's competitive performance points to the great potential of using LLMs for tasks that otherwise require skill-intensive annotation.  ( 2 min )
    DExT: Detector Explanation Toolkit. (arXiv:2212.11409v1 [cs.CV])
    State-of-the-art object detectors are treated as black boxes due to their highly non-linear internal computations. Even with unprecedented advancements in detector performance, the inability to explain how their outputs are generated limits their use in safety-critical applications. Previous work fails to produce explanations for both bounding box and classification decisions, and generally make individual explanations for various detectors. In this paper, we propose an open-source Detector Explanation Toolkit (DExT) which implements the proposed approach to generate a holistic explanation for all detector decisions using certain gradient-based explanation methods. We suggests various multi-object visualization methods to merge the explanations of multiple objects detected in an image as well as the corresponding detections in a single image. The quantitative evaluation show that the Single Shot MultiBox Detector (SSD) is more faithfully explained compared to other detectors regardless of the explanation methods. Both quantitative and human-centric evaluations identify that SmoothGrad with Guided Backpropagation (GBP) provides more trustworthy explanations among selected methods across all detectors. We expect that DExT will motivate practitioners to evaluate object detectors from the interpretability perspective by explaining both bounding box and classification decisions.  ( 2 min )
    Sensitivity analysis of biological washout and depth selection for a machine learning based dose verification framework in proton therapy. (arXiv:2212.11352v1 [physics.med-ph])
    Dose verification based on proton-induced positron emitters is a promising quality assurance tool and may leverage the strength of artificial intelligence. To move a step closer towards practical application, the sensitivity analysis of two factors needs to be performed: biological washout and depth selection. selection. A bi-directional recurrent neural network (RNN) model was developed. The training dataset was generated based upon a CT image-based phantom (abdomen region) and multiple beam energies/pathways, using Monte-Carlo simulation (1 mm spatial resolution, no biological washout). For the modeling of biological washout, a simplified analytical model was applied to change raw activity profiles over a period of 5 minutes, incorporating both physical decay and biological washout. For the study of depth selection (a challenge linked to multi field/angle irradiation), truncations were applied at different window lengths (100, 125, 150 mm) to raw activity profiles. Finally, the performance of a worst-case scenario was examined by combining both factors (depth selection: 125 mm, biological washout: 5 mins). The accuracy was quantitatively evaluated in terms of range uncertainty, mean absolute error (MAE) and mean relative errors (MRE). Our proposed AI framework shows good immunity to the perturbation associated with two factors. The detection of proton-induced positron emitters, combined with machine learning, has great potential to implement online patient-specific verification in proton therapy.  ( 2 min )
    Semi-supervised GAN for Bladder Tissue Classification in Multi-Domain Endoscopic Images. (arXiv:2212.11375v1 [eess.IV])
    Objective: Accurate visual classification of bladder tissue during Trans-Urethral Resection of Bladder Tumor (TURBT) procedures is essential to improve early cancer diagnosis and treatment. During TURBT interventions, White Light Imaging (WLI) and Narrow Band Imaging (NBI) techniques are used for lesion detection. Each imaging technique provides diverse visual information that allows clinicians to identify and classify cancerous lesions. Computer vision methods that use both imaging techniques could improve endoscopic diagnosis. We address the challenge of tissue classification when annotations are available only in one domain, in our case WLI, and the endoscopic images correspond to an unpaired dataset, i.e. there is no exact equivalent for every image in both NBI and WLI domains. Method: We propose a semi-surprised Generative Adversarial Network (GAN)-based method composed of three main components: a teacher network trained on the labeled WLI data; a cycle-consistency GAN to perform unpaired image-to-image translation, and a multi-input student network. To ensure the quality of the synthetic images generated by the proposed GAN we perform a detailed quantitative, and qualitative analysis with the help of specialists. Conclusion: The overall average classification accuracy, precision, and recall obtained with the proposed method for tissue classification are 0.90, 0.88, and 0.89 respectively, while the same metrics obtained in the unlabeled domain (NBI) are 0.92, 0.64, and 0.94 respectively. The quality of the generated images is reliable enough to deceive specialists. Significance: This study shows the potential of using semi-supervised GAN-based classification to improve bladder tissue classification when annotations are limited in multi-domain data.  ( 2 min )
    Target Conditioned Representation Independence (TCRI); From Domain-Invariant to Domain-General Representations. (arXiv:2212.11342v1 [cs.LG])
    We propose a Target Conditioned Representation Independence (TCRI) objective for domain generalization. TCRI addresses the limitations of existing domain generalization methods due to incomplete constraints. Specifically, TCRI implements regularizers motivated by conditional independence constraints that are sufficient to strictly learn complete sets of invariant mechanisms, which we show are necessary and sufficient for domain generalization. Empirically, we show that TCRI is effective on both synthetic and real-world data. TCRI is competitive with baselines in average accuracy while outperforming them in worst-domain accuracy, indicating desired cross-domain stability.  ( 2 min )
    Contrastive Distillation Is a Sample-Efficient Self-Supervised Loss Policy for Transfer Learning. (arXiv:2212.11353v1 [cs.CL])
    Traditional approaches to RL have focused on learning decision policies directly from episodic decisions, while slowly and implicitly learning the semantics of compositional representations needed for generalization. While some approaches have been adopted to refine representations via auxiliary self-supervised losses while simultaneously learning decision policies, learning compositional representations from hand-designed and context-independent self-supervised losses (multi-view) still adapts relatively slowly to the real world, which contains many non-IID subspaces requiring rapid distribution shift in both time and spatial attention patterns at varying levels of abstraction. In contrast, supervised language model cascades have shown the flexibility to adapt to many diverse manifolds, and hints of self-learning needed for autonomous task transfer. However, to date, transfer methods for language models like few-shot learning and fine-tuning still require human supervision and transfer learning using self-learning methods has been underexplored. We propose a self-supervised loss policy called contrastive distillation which manifests latent variables with high mutual information with both source and target tasks from weights to tokens. We show how this outperforms common methods of transfer learning and suggests a useful design axis of trading off compute for generalizability for online transfer. Contrastive distillation is improved through sampling from memory and suggests a simple algorithm for more efficiently sampling negative examples for contrastive losses than random sampling.  ( 2 min )
    Language models are better than humans at next-token prediction. (arXiv:2212.11281v1 [cs.CL])
    Current language models are considered to have sub-human capabilities at natural language tasks like question-answering or writing code. However, language models are not trained to perform well at these tasks, they are trained to accurately predict the next token given previous tokes in tokenized text. It is not clear whether language models are better or worse than humans at next token prediction. To try to answer this question, we performed two distinct experiments to directly compare humans and language models on this front: one measuring top-1 accuracy and the other measuring perplexity. In both experiments, we find humans to be consistently \emph{worse} than even relatively small language models like GPT3-Ada at next-token prediction.  ( 2 min )
    ABODE-Net: An Attention-based Deep Learning Model for Non-intrusive Building Occupancy Detection Using Smart Meter Data. (arXiv:2212.11396v1 [cs.LG])
    Occupancy information is useful for efficient energy management in the building sector. The massive high-resolution electrical power consumption data collected by smart meters in the advanced metering infrastructure (AMI) network make it possible to infer buildings' occupancy status in a non-intrusive way. In this paper, we propose a deep leaning model called ABODE-Net which employs a novel Parallel Attention (PA) block for building occupancy detection using smart meter data. The PA block combines the temporal, variable, and channel attention modules in a parallel way to signify important features for occupancy detection. We adopt two smart meter datasets widely used for building occupancy detection in our performance evaluation. A set of state-of-the-art shallow machine learning and deep learning models are included for performance comparison. The results show that ABODE-Net significantly outperforms other models in all experimental cases, which proves its validity as a solution for non-intrusive building occupancy detection.  ( 2 min )
    Improving Automated Program Repair with Domain Adaptation. (arXiv:2212.11414v1 [cs.SE])
    Automated Program Repair (APR) is defined as the process of fixing a bug/defect in the source code, by an automated tool. APR tools have recently experienced promising results by leveraging state-of-the-art Neural Language Processing (NLP) techniques. APR tools such as TFix and CodeXGLUE combine text-to-text transformers with software-specific techniques are outperforming alternatives, these days. However, in most APR studies the train and test sets are chosen from the same set of projects. In reality, however, APR models are meant to be generalizable to new and different projects. Therefore, there is a potential threat that reported APR models with high effectiveness perform poorly when the characteristics of the new project or its bugs are different than the training set's(Domain Shift). In this study, we first define and measure the domain shift problem in automated program repair. Then, we then propose a domain adaptation framework that can adapt an APR model for a given target project. We conduct an empirical study with three domain adaptation methods FullFineTuning, TuningWithLightWeightAdapterLayers, and CurriculumLearning using two state-of-the-art domain adaptation tools (TFix and CodeXGLUE) and two APR models on 611 bugs from 19 projects. The results show that our proposed framework can improve the effectiveness of TFix by 13.05% and CodeXGLUE by 23.4%. Another contribution of this study is the proposal of a data synthesis method to address the lack of labelled data in APR. We leverage transformers to create a bug generator model. We use the generated synthetic data to domain adapt TFix and CodeXGLUE on the projects with no data (Zero-shot learning), which results in an average improvement of 5.76% and 24.42% for TFix and CodeXGLUE, respectively.  ( 2 min )
    ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement. (arXiv:2212.11377v1 [eess.AS])
    Prior works on improving speech quality with visual input typically study each type of auditory distortion separately (e.g., separation, inpainting, video-to-speech) and present tailored algorithms. This paper proposes to unify these subjects and study Generalized Speech Enhancement, where the goal is not to reconstruct the exact reference clean signal, but to focus on improving certain aspects of speech. In particular, this paper concerns intelligibility, quality, and video synchronization. We cast the problem as audio-visual speech resynthesis, which is composed of two steps: pseudo audio-visual speech recognition (P-AVSR) and pseudo text-to-speech synthesis (P-TTS). P-AVSR and P-TTS are connected by discrete units derived from a self-supervised speech model. Moreover, we utilize self-supervised audio-visual speech model to initialize P-AVSR. The proposed model is coined ReVISE. ReVISE is the first high-quality model for in-the-wild video-to-speech synthesis and achieves superior performance on all LRS3 audio-visual enhancement tasks with a single model. To demonstrates its applicability in the real world, ReVISE is also evaluated on EasyCom, an audio-visual benchmark collected under challenging acoustic conditions with only 1.6 hours of training data. Similarly, ReVISE greatly suppresses noise and improves quality. Project page: https://wnhsu.github.io/ReVISE.  ( 2 min )
    Forecasting West Nile Virus with Graph Neural Networks: Harnessing Spatial Dependence in Irregularly Sampled Geospatial Data. (arXiv:2212.11367v1 [q-bio.PE])
    Machine learning methods have seen increased application to geospatial environmental problems, such as precipitation nowcasting, haze forecasting, and crop yield prediction. However, many of the machine learning methods applied to mosquito population and disease forecasting do not inherently take into account the underlying spatial structure of the given data. In our work, we apply a spatially aware graph neural network model consisting of GraphSAGE layers to forecast the presence of West Nile virus in Illinois, to aid mosquito surveillance and abatement efforts within the state. More generally, we show that graph neural networks applied to irregularly sampled geospatial data can exceed the performance of a range of baseline methods including logistic regression, XGBoost, and fully-connected neural networks.  ( 2 min )
    Deep Unfolded Tensor Robust PCA with Self-supervised Learning. (arXiv:2212.11346v1 [stat.ML])
    Tensor robust principal component analysis (RPCA), which seeks to separate a low-rank tensor from its sparse corruptions, has been crucial in data science and machine learning where tensor structures are becoming more prevalent. While powerful, existing tensor RPCA algorithms can be difficult to use in practice, as their performance can be sensitive to the choice of additional hyperparameters, which are not straightforward to tune. In this paper, we describe a fast and simple self-supervised model for tensor RPCA using deep unfolding by only learning four hyperparameters. Despite its simplicity, our model expunges the need for ground truth labels while maintaining competitive or even greater performance compared to supervised deep unfolding. Furthermore, our model is capable of operating in extreme data-starved scenarios. We demonstrate these claims on a mix of synthetic data and real-world tasks, comparing performance against previously studied supervised deep unfolding methods and Bayesian optimization baselines.  ( 2 min )
    KL Regularized Normalization Framework for Low Resource Tasks. (arXiv:2212.11275v1 [cs.CL])
    Large pre-trained models, such as Bert, GPT, and Wav2Vec, have demonstrated great potential for learning representations that are transferable to a wide variety of downstream tasks . It is difficult to obtain a large quantity of supervised data due to the limited availability of resources and time. In light of this, a significant amount of research has been conducted in the area of adopting large pre-trained datasets for diverse downstream tasks via fine tuning, linear probing, or prompt tuning in low resource settings. Normalization techniques are essential for accelerating training and improving the generalization of deep neural networks and have been successfully used in a wide variety of applications. A lot of normalization techniques have been proposed but the success of normalization in low resource downstream NLP and speech tasks is limited. One of the reasons is the inability to capture expressiveness by rescaling parameters of normalization. We propose KullbackLeibler(KL) Regularized normalization (KL-Norm) which make the normalized data well behaved and helps in better generalization as it reduces over-fitting, generalises well on out of domain distributions and removes irrelevant biases and features with negligible increase in model parameters and memory overheads. Detailed experimental evaluation on multiple low resource NLP and speech tasks, demonstrates the superior performance of KL-Norm as compared to other popular normalization and regularization techniques.  ( 2 min )
    Circumventing interpretability: How to defeat mind-readers. (arXiv:2212.11415v1 [cs.LG])
    The increasing capabilities of artificial intelligence (AI) systems make it ever more important that we interpret their internals to ensure that their intentions are aligned with human values. Yet there is reason to believe that misaligned artificial intelligence will have a convergent instrumental incentive to make its thoughts difficult for us to interpret. In this article, I discuss many ways that a capable AI might circumvent scalable interpretability methods and suggest a framework for thinking about these potential future risks.  ( 2 min )
    Online Statistical Inference for Matrix Contextual Bandit. (arXiv:2212.11385v1 [stat.ML])
    Contextual bandit has been widely used for sequential decision-making based on the current contextual information and historical feedback data. In modern applications, such context format can be rich and can often be formulated as a matrix. Moreover, while existing bandit algorithms mainly focused on reward-maximization, less attention has been paid to the statistical inference. To fill in these gaps, in this work we consider a matrix contextual bandit framework where the true model parameter is a low-rank matrix, and propose a fully online procedure to simultaneously make sequential decision-making and conduct statistical inference. The low-rank structure of the model parameter and the adaptivity nature of the data collection process makes this difficult: standard low-rank estimators are not fully online and are biased, while existing inference approaches in bandit algorithms fail to account for the low-rankness and are also biased. To address these, we introduce a new online doubly-debiasing inference procedure to simultaneously handle both sources of bias. In theory, we establish the asymptotic normality of the proposed online doubly-debiased estimator and prove the validity of the constructed confidence interval. Our inference results are built upon a newly developed low-rank stochastic gradient descent estimator and its non-asymptotic convergence result, which is also of independent interest.  ( 2 min )
    Towards Neural Variational Monte Carlo That Scales Linearly with System Size. (arXiv:2212.11296v1 [quant-ph])
    Quantum many-body problems are some of the most challenging problems in science and are central to demystifying some exotic quantum phenomena, e.g., high-temperature superconductors. The combination of neural networks (NN) for representing quantum states, coupled with the Variational Monte Carlo (VMC) algorithm, has been shown to be a promising method for solving such problems. However, the run-time of this approach scales quadratically with the number of simulated particles, constraining the practically usable NN to - in machine learning terms - minuscule sizes (<10M parameters). Considering the many breakthroughs brought by extreme NN in the +1B parameters scale to other domains, lifting this constraint could significantly expand the set of quantum systems we can accurately simulate on classical computers, both in size and complexity. We propose a NN architecture called Vector-Quantized Neural Quantum States (VQ-NQS) that utilizes vector-quantization techniques to leverage redundancies in the local-energy calculations of the VMC algorithm - the source of the quadratic scaling. In our preliminary experiments, we demonstrate VQ-NQS ability to reproduce the ground state of the 2D Heisenberg model across various system sizes, while reporting a significant reduction of about ${\times}10$ in the number of FLOPs in the local-energy calculation.  ( 2 min )
    Adaptive and Dynamic Multi-Resolution Hashing for Pairwise Summations. (arXiv:2212.11408v1 [cs.DS])
    In this paper, we propose Adam-Hash: an adaptive and dynamic multi-resolution hashing data-structure for fast pairwise summation estimation. Given a data-set $X \subset \mathbb{R}^d$, a binary function $f:\mathbb{R}^d\times \mathbb{R}^d\to \mathbb{R}$, and a point $y \in \mathbb{R}^d$, the Pairwise Summation Estimate $\mathrm{PSE}_X(y) := \frac{1}{|X|} \sum_{x \in X} f(x,y)$. For any given data-set $X$, we need to design a data-structure such that given any query point $y \in \mathbb{R}^d$, the data-structure approximately estimates $\mathrm{PSE}_X(y)$ in time that is sub-linear in $|X|$. Prior works on this problem have focused exclusively on the case where the data-set is static, and the queries are independent. In this paper, we design a hashing-based PSE data-structure which works for the more practical \textit{dynamic} setting in which insertions, deletions, and replacements of points are allowed. Moreover, our proposed Adam-Hash is also robust to adaptive PSE queries, where an adversary can choose query $q_j \in \mathbb{R}^d$ depending on the output from previous queries $q_1, q_2, \dots, q_{j-1}$.  ( 2 min )
    Audio Denoising for Robust Audio Fingerprinting. (arXiv:2212.11277v1 [cs.SD])
    Music discovery services let users identify songs from short mobile recordings. These solutions are often based on Audio Fingerprinting, and rely more specifically on the extraction of spectral peaks in order to be robust to a number of distortions. Few works have been done to study the robustness of these algorithms to background noise captured in real environments. In particular, AFP systems still struggle when the signal to noise ratio is low, i.e when the background noise is strong. In this project, we tackle this problematic with Deep Learning. We test a new hybrid strategy which consists of inserting a denoising DL model in front of a peak-based AFP algorithm. We simulate noisy music recordings using a realistic data augmentation pipeline, and train a DL model to denoise them. The denoising model limits the impact of background noise on the AFP system's extracted peaks, improving its robustness to noise. We further propose a novel loss function to adapt the DL model to the considered AFP system, increasing its precision in terms of retrieved spectral peaks. To the best of our knowledge, this hybrid strategy has not been tested before.  ( 2 min )
    Debiased machine learning for estimating the causal effect of urban traffic on pedestrian crossing behaviour. (arXiv:2212.11322v1 [cs.LG])
    Before the transition of AVs to urban roads and subsequently unprecedented changes in traffic conditions, evaluation of transportation policies and futuristic road design related to pedestrian crossing behavior is of vital importance. Recent studies analyzed the non-causal impact of various variables on pedestrian waiting time in the presence of AVs. However, we mainly investigate the causal effect of traffic density on pedestrian waiting time. We develop a Double/Debiased Machine Learning (DML) model in which the impact of confounders variable influencing both a policy and an outcome of interest is addressed, resulting in unbiased policy evaluation. Furthermore, we try to analyze the effect of traffic density by developing a copula-based joint model of two main components of pedestrian crossing behavior, pedestrian stress level and waiting time. The copula approach has been widely used in the literature, for addressing self-selection problems, which can be classified as a causality analysis in travel behavior modeling. The results obtained from copula approach and DML are compared based on the effect of traffic density. In DML model structure, the standard error term of density parameter is lower than copula approach and the confidence interval is considerably more reliable. In addition, despite the similar sign of effect, the copula approach estimates the effect of traffic density lower than DML, due to the spurious effect of confounders. In short, the DML model structure can flexibly adjust the impact of confounders by using machine learning algorithms and is more reliable for planning future policies.  ( 2 min )
    End-to-end AI Framework for Hyperparameter Optimization, Model Training, and Interpretable Inference for Molecules and Crystals. (arXiv:2212.11317v1 [cond-mat.mtrl-sci])
    We introduce an end-to-end computational framework that enables hyperparameter optimization with the DeepHyper library, accelerated training, and interpretable AI inference with a suite of state-of-the-art AI models, including CGCNN, PhysNet, SchNet, MPNN, MPNN-transformer, and TorchMD-Net. We use these AI models and the benchmark QM9, hMOF, and MD17 datasets to showcase the prediction of user-specified materials properties in modern computing environments, and to demonstrate translational applications for the modeling of small molecules, crystals and metal organic frameworks with a unified, stand-alone framework. We deployed and tested this framework in the ThetaGPU supercomputer at the Argonne Leadership Computing Facility, and the Delta supercomputer at the National Center for Supercomputing Applications to provide researchers with modern tools to conduct accelerated AI-driven discovery in leadership class computing environments.  ( 2 min )
  • Open

    Is Out-of-Distribution Detection Learnable?. (arXiv:2210.14707v2 [cs.LG] UPDATED)
    Supervised learning aims to train a classifier under the assumption that training and test data are from the same distribution. To ease the above assumption, researchers have studied a more realistic setting: out-of-distribution (OOD) detection, where test data may come from classes that are unknown during training (i.e., OOD data). Due to the unavailability and diversity of OOD data, good generalization ability is crucial for effective OOD detection algorithms. To study the generalization of OOD detection, in this paper, we investigate the probably approximately correct (PAC) learning theory of OOD detection, which is proposed by researchers as an open problem. First, we find a necessary condition for the learnability of OOD detection. Then, using this condition, we prove several impossibility theorems for the learnability of OOD detection under some scenarios. Although the impossibility theorems are frustrating, we find that some conditions of these impossibility theorems may not hold in some practical scenarios. Based on this observation, we next give several necessary and sufficient conditions to characterize the learnability of OOD detection in some practical scenarios. Lastly, we also offer theoretical supports for several representative OOD detection works based on our OOD theory.  ( 2 min )
    On the Sparse DAG Structure Learning Based on Adaptive Lasso. (arXiv:2209.02946v2 [stat.ML] UPDATED)
    Learning the underlying Bayesian Networks (BNs), represented by directed acyclic graphs (DAGs), of the concerned events from purely-observational data is a crucial part of evidential reasoning. This task remains challenging due to the large and discrete search space. A recent flurry of developments followed NOTEARS[1] recast this combinatorial problem into a continuous optimization problem by leveraging an algebraic equality characterization of acyclicity. However, the continuous optimization methods suffer from obtaining non-spare graphs after the numerical optimization, which leads to the inflexibility to rule out the potentially cycle-inducing edges or false discovery edges with small values. To address this issue, in this paper, we develop a completely data-driven DAG structure learning method without a predefined value to post-threshold small values. We name our method NOTEARS with adaptive Lasso (NOTEARS-AL), which is achieved by applying the adaptive penalty method to ensure the sparsity of the estimated DAG. Moreover, we show that NOTEARS-AL also inherits the oracle properties under some specific conditions. Extensive experiments on both synthetic and a real-world dataset verify the efficacy of the proposed method.  ( 2 min )
    Generalized Stable Weights via Neural Gibbs Density. (arXiv:2211.07533v2 [stat.ML] UPDATED)
    We present a generalized balancing method -- stable weights via Neural Gibbs Density -- fully available for estimating causal effects for an arbitrary mixture of discrete and continuous interventions. Our weights are trainable through back-propagation and can be obtained with neural network algorithms. In addition, we also provide a method to measure the performance of our weights by estimating the mutual information for the balanced distribution. Our method is easy to implement with any present deep learning libraries, and the weights from it can be used in most state-of-art supervised algorithms.  ( 2 min )
    Variable Selection with the Knockoffs: Composite Null Hypotheses. (arXiv:2203.02849v3 [math.ST] UPDATED)
    The Fixed-X knockoff filter is a flexible framework for variable selection with false discovery rate (FDR) control in linear models with arbitrary (non-singular) design matrices and it allows for finite-sample selective inference via the LASSO estimates. In this paper, we extend the theory of the knockoff procedure to tests with composite null hypotheses, which are usually more relevant to real-world problems. The main technical challenge lies in handling composite nulls in tandem with dependent features from arbitrary designs. We develop two methods for composite inference with the knockoffs, namely, shifted ordinary least-squares (S-OLS) and feature-response product perturbation (FRPP), building on new structural properties of test statistics under composite nulls. We also propose two heuristic variants of the S-OLS method that outperform the celebrated Benjamini-Hochberg (BH) procedure for composite nulls, which serves as a heuristic baseline under dependent test statistics. Finally, we analyze the loss in FDR when the original knockoff procedure is naively applied on composite tests.  ( 2 min )
    Learning-based Optimal Admission Control in a Single Server Queuing System. (arXiv:2212.11316v1 [math.OC])
    We consider a long-term average profit maximizing admission control problem in an M/M/1 queuing system with a known arrival rate but an unknown service rate. With a fixed reward collected upon service completion and a cost per unit of time enforced on customers waiting in the queue, a dispatcher decides upon arrivals whether to admit the arriving customer or not based on the full history of observations of the queue-length of the system. \cite[Econometrica]{Naor} showed that if all the parameters of the model are known, then it is optimal to use a static threshold policy - admit if the queue-length is less than a predetermined threshold and otherwise not. We propose a learning-based dispatching algorithm and characterize its regret with respect to optimal dispatch policies for the full information model of \cite{Naor}. We show that the algorithm achieves an $O(1)$ regret when all optimal thresholds with full information are non-zero, and achieves an $O(\ln^{3+\epsilon}(N))$ regret in the case that an optimal threshold with full information is $0$ (i.e., an optimal policy is to reject all arrivals), where $N$ is the number of arrivals and $\epsilon>0$.  ( 2 min )
    A Mathematical Framework for Learning Probability Distributions. (arXiv:2212.11481v1 [stat.ML])
    The modeling of probability distributions, specifically generative modeling and density estimation, has become an immensely popular subject in recent years by virtue of its outstanding performance on sophisticated data such as images and texts. Nevertheless, a theoretical understanding of its success is still incomplete. One mystery is the paradox between memorization and generalization: In theory, the model is trained to be exactly the same as the empirical distribution of the finite samples, whereas in practice, the trained model can generate new samples or estimate the likelihood of unseen samples. Likewise, the overwhelming diversity of distribution learning models calls for a unified perspective on this subject. This paper provides a mathematical framework such that all the well-known models can be derived based on simple principles. To demonstrate its efficacy, we present a survey of our results on the approximation error, training error and generalization error of these models, which can all be established based on this framework. In particular, the aforementioned paradox is resolved by proving that these models enjoy implicit regularization during training, so that the generalization error at early-stopping avoids the curse of dimensionality. Furthermore, we provide some new results on landscape analysis and the mode collapse phenomenon.  ( 2 min )
    Deep Unfolded Tensor Robust PCA with Self-supervised Learning. (arXiv:2212.11346v1 [stat.ML])
    Tensor robust principal component analysis (RPCA), which seeks to separate a low-rank tensor from its sparse corruptions, has been crucial in data science and machine learning where tensor structures are becoming more prevalent. While powerful, existing tensor RPCA algorithms can be difficult to use in practice, as their performance can be sensitive to the choice of additional hyperparameters, which are not straightforward to tune. In this paper, we describe a fast and simple self-supervised model for tensor RPCA using deep unfolding by only learning four hyperparameters. Despite its simplicity, our model expunges the need for ground truth labels while maintaining competitive or even greater performance compared to supervised deep unfolding. Furthermore, our model is capable of operating in extreme data-starved scenarios. We demonstrate these claims on a mix of synthetic data and real-world tasks, comparing performance against previously studied supervised deep unfolding methods and Bayesian optimization baselines.  ( 2 min )
    Truncated Matrix Power Iteration for Differentiable DAG Learning. (arXiv:2208.14571v2 [cs.LG] UPDATED)
    Recovering underlying Directed Acyclic Graph (DAG) structures from observational data is highly challenging due to the combinatorial nature of the DAG-constrained optimization problem. Recently, DAG learning has been cast as a continuous optimization problem by characterizing the DAG constraint as a smooth equality one, generally based on polynomials over adjacency matrices. Existing methods place very small coefficients on high-order polynomial terms for stabilization, since they argue that large coefficients on the higher-order terms are harmful due to numeric exploding. On the contrary, we discover that large coefficients on higher-order terms are beneficial for DAG learning, when the spectral radiuses of the adjacency matrices are small, and that larger coefficients for higher-order terms can approximate the DAG constraints much better than the small counterparts. Based on this, we propose a novel DAG learning method with efficient truncated matrix power iteration to approximate geometric series based DAG constraints. Empirically, our DAG learning method outperforms the previous state-of-the-arts in various settings, often by a factor of $3$ or more in terms of structural Hamming distance.  ( 2 min )
    The Cosmic Graph: Optimal Information Extraction from Large-Scale Structure using Catalogues. (arXiv:2207.05202v3 [astro-ph.CO] UPDATED)
    We present an implicit likelihood approach to quantifying cosmological information over discrete catalogue data, assembled as graphs. To do so, we explore cosmological parameter constraints using mock dark matter halo catalogues. We employ Information Maximising Neural Networks (IMNNs) to quantify Fisher information extraction as a function of graph representation. We a) demonstrate the high sensitivity of modular graph structure to the underlying cosmology in the noise-free limit, b) show that graph neural network summaries automatically combine mass and clustering information through comparisons to traditional statistics, c) demonstrate that networks can still extract information when catalogues are subject to noisy survey cuts, and d) illustrate how nonlinear IMNN summaries can be used as asymptotically optimal compressed statistics for Bayesian simulation-based inference. We reduce the area of joint $\Omega_m, \sigma_8$ parameter constraints with small ($\sim$100 object) halo catalogues by a factor of 42 over the two-point correlation function, and demonstrate that the networks automatically combine mass and clustering information. This work utilises a new IMNN implementation over graph data in Jax, which can take advantage of either numerical or auto-differentiability. We also show that graph IMNNs successfully compress simulations away from the fiducial model at which the network is fitted, indicating a promising alternative to n-point statistics in catalogue simulation-based analyses.  ( 2 min )
    Time to Market Reduction for Hydrogen Fuel Cell Stacks using Generative Adversarial Networks. (arXiv:2212.11733v1 [cs.AI])
    To face the dependency on fossil fuels and limit carbon emissions, fuel cells are a very promising technology and appear to be a key candidate to tackle the increase of the energy demand and promote the energy transition. To meet future needs for both transport and stationary applications, the time to market of fuel cell stacks must be drastically reduced. Here, a new concept to shorten their development time by introducing a disruptive and highefficiency data augmentation approach based on artificial intelligence is presented. Our results allow reducing the testing time before introducing a product on the market from a thousand to a few hours. The innovative concept proposed here can support engineering and research tasks during the fuel cell development process to achieve decreased development costs alongside a reduced time to market.  ( 2 min )
    Renormalization in the neural network-quantum field theory correspondence. (arXiv:2212.11811v1 [hep-th])
    A statistical ensemble of neural networks can be described in terms of a quantum field theory (NN-QFT correspondence). The infinite-width limit is mapped to a free field theory, while finite N corrections are mapped to interactions. After reviewing the correspondence, we will describe how to implement renormalization in this context and discuss preliminary numerical results for translation-invariant kernels. A major outcome is that changing the standard deviation of the neural network weight distribution corresponds to a renormalization flow in the space of networks.  ( 2 min )
    MissDAG: Causal Discovery in the Presence of Missing Data with Continuous Additive Noise Models. (arXiv:2205.13869v2 [cs.LG] UPDATED)
    State-of-the-art causal discovery methods usually assume that the observational data is complete. However, the missing data problem is pervasive in many practical scenarios such as clinical trials, economics, and biology. One straightforward way to address the missing data problem is first to impute the data using off-the-shelf imputation methods and then apply existing causal discovery methods. However, such a two-step method may suffer from suboptimality, as the imputation algorithm may introduce bias for modeling the underlying data distribution. In this paper, we develop a general method, which we call MissDAG, to perform causal discovery from data with incomplete observations. Focusing mainly on the assumptions of ignorable missingness and the identifiable additive noise models (ANMs), MissDAG maximizes the expected likelihood of the visible part of observations under the expectation-maximization (EM) framework. In the E-step, in cases where computing the posterior distributions of parameters in closed-form is not feasible, Monte Carlo EM is leveraged to approximate the likelihood. In the M-step, MissDAG leverages the density transformation to model the noise distributions with simpler and specific formulations by virtue of the ANMs and uses a likelihood-based causal discovery algorithm with directed acyclic graph constraint. We demonstrate the flexibility of MissDAG for incorporating various causal discovery algorithms and its efficacy through extensive simulations and real data experiments.  ( 2 min )
    End-to-End Learned Early Classification of Time Series for In-Season Crop Type Mapping. (arXiv:1901.10681v2 [cs.LG] UPDATED)
    Remote sensing satellites capture the cyclic dynamics of our Planet in regular time intervals recorded in satellite time series data. End-to-end trained deep learning models use this time series data to make predictions at a large scale, for instance, to produce up-to-date crop cover maps. Most time series classification approaches focus on the accuracy of predictions. However, the earliness of the prediction is also of great importance since coming to an early decision can make a crucial difference in time-sensitive applications. In this work, we present an End-to-End Learned Early Classification of Time Series (ELECTS) model that estimates a classification score and a probability of whether sufficient data has been observed to come to an early and still accurate decision. ELECTS is modular: any deep time series classification model can adopt the ELECTS conceptual idea by adding a second prediction head that outputs a probability of stopping the classification. The ELECTS loss function then optimizes the overall model on a balanced objective of earliness and accuracy. Our experiments on four crop classification datasets from Europe and Africa show that ELECTS allows reaching state-of-the-art accuracy while reducing the quantity of data massively to be downloaded, stored, and processed. The source code is available at https://github.com/marccoru/elects.  ( 2 min )
    Missing Data Imputation and Acquisition with Deep Hierarchical Models and Hamiltonian Monte Carlo. (arXiv:2202.04599v5 [cs.LG] UPDATED)
    Variational Autoencoders (VAEs) have recently been highly successful at imputing and acquiring heterogeneous missing data. However, within this specific application domain, existing VAE methods are restricted by using only one layer of latent variables and strictly Gaussian posterior approximations. To address these limitations, we present HH-VAEM, a Hierarchical VAE model for mixed-type incomplete data that uses Hamiltonian Monte Carlo with automatic hyper-parameter tuning for improved approximate inference. Our experiments show that HH-VAEM outperforms existing baselines in the tasks of missing data imputation and supervised learning with missing features. Finally, we also present a sampling-based approach for efficiently computing the information gain when missing features are to be acquired with HH-VAEM. Our experiments show that this sampling-based approach is superior to alternatives based on Gaussian approximations.  ( 2 min )
    Robust Meta-Representation Learning via Global Label Inference and Classification. (arXiv:2212.11702v1 [cs.LG])
    Few-shot learning (FSL) is a central problem in meta-learning, where learners must efficiently learn from few labeled examples. Within FSL, feature pre-training has recently become an increasingly popular strategy to significantly improve generalization performance. However, the contribution of pre-training is often overlooked and understudied, with limited theoretical understanding of its impact on meta-learning performance. Further, pre-training requires a consistent set of global labels shared across training tasks, which may be unavailable in practice. In this work, we address the above issues by first showing the connection between pre-training and meta-learning. We discuss why pre-training yields more robust meta-representation and connect the theoretical analysis to existing works and empirical results. Secondly, we introduce Meta Label Learning (MeLa), a novel meta-learning algorithm that learns task relations by inferring global labels across tasks. This allows us to exploit pre-training for FSL even when global labels are unavailable or ill-defined. Lastly, we introduce an augmented pre-training procedure that further improves the learned meta-representation. Empirically, MeLa outperforms existing methods across a diverse range of benchmarks, in particular under a more challenging setting where the number of training tasks is limited and labels are task-specific. We also provide extensive ablation study to highlight its key properties.  ( 2 min )
    Co-clustering based exploratory analysis of mixed-type data tables. (arXiv:2212.11728v1 [cs.LG])
    Co-clustering is a class of unsupervised data analysis techniques that extract the existing underlying dependency structure between the instances and variables of a data table as homogeneous blocks. Most of those techniques are limited to variables of the same type. In this paper, we propose a mixed data co-clustering method based on a two-step methodology. In the first step, all the variables are binarized according to a number of bins chosen by the analyst, by equal frequency discretization in the numerical case, or keeping the most frequent values in the categorical case. The second step applies a co-clustering to the instances and the binary variables, leading to groups of instances and groups of variable parts. We apply this methodology on several data sets and compare with the results of a Multiple Correspondence Analysis applied to the same data.  ( 2 min )
    The State of the Art in Enhancing Trust in Machine Learning Models with the Use of Visualizations. (arXiv:2212.11737v1 [cs.LG])
    Machine learning (ML) models are nowadays used in complex applications in various domains, such as medicine, bioinformatics, and other sciences. Due to their black box nature, however, it may sometimes be hard to understand and trust the results they provide. This has increased the demand for reliable visualization tools related to enhancing trust in ML models, which has become a prominent topic of research in the visualization community over the past decades. To provide an overview and present the frontiers of current research on the topic, we present a State-of-the-Art Report (STAR) on enhancing trust in ML models with the use of interactive visualization. We define and describe the background of the topic, introduce a categorization for visualization techniques that aim to accomplish this goal, and discuss insights and opportunities for future research directions. Among our contributions is a categorization of trust against different facets of interactive ML, expanded and improved from previous research. Our results are investigated from different analytical perspectives: (a) providing a statistical overview, (b) summarizing key findings, (c) performing topic analyses, and (d) exploring the data sets used in the individual papers, all with the support of an interactive web-based survey browser. We intend this survey to be beneficial for visualization researchers whose interests involve making ML models more trustworthy, as well as researchers and practitioners from other disciplines in their search for effective visualization techniques suitable for solving their tasks with confidence and conveying meaning to their data.  ( 2 min )
    Mean-field neural networks-based algorithms for McKean-Vlasov control problems *. (arXiv:2212.11518v1 [math.OC])
    This paper is devoted to the numerical resolution of McKean-Vlasov control problems via the class of mean-field neural networks introduced in our companion paper [25] in order to learn the solution on the Wasserstein space. We propose several algorithms either based on dynamic programming with control learning by policy or value iteration, or backward SDE from stochastic maximum principle with global or local loss functions. Extensive numerical results on different examples are presented to illustrate the accuracy of each of our eight algorithms. We discuss and compare the pros and cons of all the tested methods.  ( 2 min )
    Inference of Nonlinear Partial Differential Equations via Constrained Gaussian Processes. (arXiv:2212.11880v1 [math.NA])
    Partial differential equations (PDEs) are widely used for description of physical and engineering phenomena. Some key parameters involved in PDEs, which represents certain physical properties with important scientific interpretations, are difficult or even impossible to be measured directly. Estimation of these parameters from noisy and sparse experimental data of related physical quantities is an important task. Many methods for PDE parameter inference involve a large number of evaluations of numerical solution of PDE through algorithms such as finite element method, which can be time-consuming especially for nonlinear PDEs. In this paper, we propose a novel method for estimating unknown parameters in PDEs, called PDE-Informed Gaussian Process Inference (PIGPI). Through modeling the PDE solution as a Gaussian process (GP), we derive the manifold constraints induced by the (linear) PDE structure such that under the constraints, the GP satisfies the PDE. For nonlinear PDEs, we propose an augmentation method that transfers the nonlinear PDE into an equivalent PDE system linear in all derivatives that our PIGPI can handle. PIGPI can be applied to multi-dimensional PDE systems and PDE systems with unobserved components. The method completely bypasses the numerical solver for PDE, thus achieving drastic savings in computation time, especially for nonlinear PDEs. Moreover, the PIGPI method can give the uncertainty quantification for both the unknown parameters and the PDE solution. The proposed method is demonstrated by several application examples from different areas.  ( 2 min )
    Federated Learning -- Methods, Applications and beyond. (arXiv:2212.11729v1 [cs.LG])
    In recent years the applications of machine learning models have increased rapidly, due to the large amount of available data and technological progress.While some domains like web analysis can benefit from this with only minor restrictions, other fields like in medicine with patient data are strongerregulated. In particular \emph{data privacy} plays an important role as recently highlighted by the trustworthy AI initiative of the EU or general privacy regulations in legislation. Another major challenge is, that the required training \emph{data is} often \emph{distributed} in terms of features or samples and unavailable for classicalbatch learning approaches. In 2016 Google came up with a framework, called \emph{Federated Learning} to solve both of these problems. We provide a brief overview on existing Methods and Applications in the field of vertical and horizontal \emph{Federated Learning}, as well as \emph{Fderated Transfer Learning}.  ( 2 min )
    Model Based Co-clustering of Mixed Numerical and Binary Data. (arXiv:2212.11725v1 [cs.LG])
    Co-clustering is a data mining technique used to extract the underlying block structure between the rows and columns of a data matrix. Many approaches have been studied and have shown their capacity to extract such structures in continuous, binary or contingency tables. However, very little work has been done to perform co-clustering on mixed type data. In this article, we extend the latent block models based co-clustering to the case of mixed data (continuous and binary variables). We then evaluate the effectiveness of the proposed approach on simulated data and we discuss its advantages and potential limits.  ( 2 min )
    Online Statistical Inference for Matrix Contextual Bandit. (arXiv:2212.11385v1 [stat.ML])
    Contextual bandit has been widely used for sequential decision-making based on the current contextual information and historical feedback data. In modern applications, such context format can be rich and can often be formulated as a matrix. Moreover, while existing bandit algorithms mainly focused on reward-maximization, less attention has been paid to the statistical inference. To fill in these gaps, in this work we consider a matrix contextual bandit framework where the true model parameter is a low-rank matrix, and propose a fully online procedure to simultaneously make sequential decision-making and conduct statistical inference. The low-rank structure of the model parameter and the adaptivity nature of the data collection process makes this difficult: standard low-rank estimators are not fully online and are biased, while existing inference approaches in bandit algorithms fail to account for the low-rankness and are also biased. To address these, we introduce a new online doubly-debiasing inference procedure to simultaneously handle both sources of bias. In theory, we establish the asymptotic normality of the proposed online doubly-debiased estimator and prove the validity of the constructed confidence interval. Our inference results are built upon a newly developed low-rank stochastic gradient descent estimator and its non-asymptotic convergence result, which is also of independent interest.  ( 2 min )

  • Open

    Seeking Feedback on a new YouTube Video Series I'm Working on Called "AI News"
    Hey r/artificial! I'm new here, and also fairly new to video content creation, so please be kind! I don't have many close friends that are very interested in AI so I figured I might turn to Reddit for some constructive feedback. The intent of this series will be to cover AI news once every few weeks, and to have it be centered around content that appeals to a diverse audience (from fellow tech nerds, to your parents alike). I know the text scrolling is not great so I will work on figuring out something better for the next video, but any other suggestions or advice on how I can improve this are greatly appreciated! AI News - Episode 1 - Dec. 2022 - https://www.youtube.com/watch?v=QtjY_dINLyU submitted by /u/Kitten-Smuggler [link] [comments]  ( 50 min )
    Artificial neural network language model. Bad logic and false claims in compelling language?
    I asked chatgpt which animal was the biggest; a whale shark or an elephant (inspired by a LinkedIn-post). Chatgpt claims the elephant is bigger even though it’s wrong. I therefore started wondering how reliable a (C)NN (or other subsymbolic method) can become as a AI. I think many people regard chatgpt as a kind of AGI and not as amodel who can put words nicely together (which in itself is super impressive). Do we expect reasoning and logic to arrive with more parameters and model layers? Or should we always except these models to make mistakes? Doesn’t mathematical logic require hard coded rules and not (gradient descent) optimized models? What I’m asking is: are the NN approach ever gonna achieve reliable logic? (I’m really not a gpt hater! I got huge respect for the work of openai!!) submitted by /u/Keepitsimpleistaken [link] [comments]  ( 59 min )
    OPT-IML: Meta releases open source language model optimized for tasks
    submitted by /u/Number_5_alive [link] [comments]  ( 49 min )
    ChatGPT plays tic-tac-toe
    submitted by /u/ArdArt [link] [comments]  ( 49 min )
    AI Dream 138 - EPIC AI ANIMATION - TRUE MASTERPIECE
    submitted by /u/LordPewPew777 [link] [comments]  ( 48 min )
    Is This The End For AI Art?
    submitted by /u/PuppetHere [link] [comments]  ( 49 min )
    A question as a beginner
    Hi, guys. Could you please suggest to me any forum or a chat where I can find people who are truly beginners in AI with which I could learn AI and also try to create some pet projects as a team? submitted by /u/flexjump [link] [comments]  ( 49 min )
    I made a site that uses AI to generate storyboards
    I started working on this yesterday, so still very much a work in progress. The idea is that you can use AI art generation and save those images to panels to create either story boards or web comics. I plan on adding toggles for style and uniformity between the panels. Let me know what you think please! https://storaiboard.com/ submitted by /u/Iz4k4y4 [link] [comments]  ( 50 min )
    How artificial intelligence can help visually impaired students in future?
    Seeing artificial intelligence developing in an unstoppable way, I feel like it can make everything possible. Let me know your views regarding this question. submitted by /u/iTsForza [link] [comments]  ( 48 min )
    Notes of Diffusion Models and Self-supervised Computer Vision Papers
    I will collect some important generative model papers and make some notes and keep this post updated. Just for the self-summarization and welcome further discussions on this topic together. :D ​ Generative Models: High-Resolution Image Synthesis with Latent Diffusion Models: motivation and overview. [cvpr2022] {cite:300} Robin Rombach, Björn Ommer Traditional models are very space-consuming because they are applied to pixels. This paper applies diffusion models to the latent space, reducing computational requirements while maintaining the quality of the generated. Palette: Image-to-Image Diffusion Models [arxiv] {cite:100} Chitwan Saharia, Mohammad Norouzi This is an image-to-image paper based on diffusion, with a simple idea of adding a condition input in the denoise model. f_{\the…  ( 76 min )
    🤯 A.I. Can Help Detect Brain Tumor Boundaries
    submitted by /u/BackgroundResult [link] [comments]  ( 48 min )
    maybe not the right thread, but it would interesting to teach an AI to make a game with its learned examples being genuinely ground breaking games like BOTW,elden ring,mass effect 1-3, skyrim,etc. might not be good but it'd be interesting to see the result!
    submitted by /u/TheblcklistedX01 [link] [comments]  ( 48 min )
    A fictional story created by multiple AI with only a title (Blue Eye vs Goku)
    Well here's the story https://docs.google.com/document/d/1u5Am2j5koDBzss8kf5iJySG2dJ0Z1rhwa2hdTqSVeQ8/edit?usp=drivesdk Just remember this story was created by ten different AI each sentence created by a different one. I did give the first AI that created the first sentence a little bit of backstory about both characters. This was an experiment to see if different AI could work together to create something. submitted by /u/Competitive_Case1076 [link] [comments]  ( 51 min )
    The Universe That Ai Created so we posted a similar video here but we’ve cut it down to get straight to the story we put together with Ai programs such as Hotpot and ChatGpt. We think the story is definitely worth a view!
    submitted by /u/GFWaltz [link] [comments]  ( 48 min )
    What is Constitutional AI: Harmlessness from AI Feedback
    submitted by /u/BackgroundResult [link] [comments]  ( 55 min )
    Google Is Working on Some Amazing Artificial Intelligence Products
    submitted by /u/liquidocelotYT [link] [comments]  ( 47 min )
    How an AI is giving hackers and cyber criminals more ways to pull off heists focusing on the story of a $35 million dollar hack that was pulled off using artificial intelligence and deep voice software
    submitted by /u/deron666 [link] [comments]  ( 50 min )
    AI Dream 127 - AMAZING new AI Settings - Detailed & Smooth
    submitted by /u/LordPewPew777 [link] [comments]  ( 49 min )
    What is Constitutional AI: Harmlessness from AI Feedback
    submitted by /u/BackgroundResult [link] [comments]  ( 55 min )
    MIT's PoolText uses AI for making scientific publishing easier around the world
    submitted by /u/qptbook [link] [comments]  ( 47 min )
    Google trends graph showing the early impact of ChatGPT on public perception of AI
    submitted by /u/unununium333 [link] [comments]  ( 49 min )
    𖦹I asked AI to make a Music Video… the results are trippy𖦹
    submitted by /u/Prior_Appearance_44 [link] [comments]  ( 46 min )
    🚨 Google Issues "Code Red" Over ChatGPT
    submitted by /u/BackgroundResult [link] [comments]  ( 58 min )
  • Open

    [D] Fourth brain review
    Did anyone take this boot camp? How was it? thanks submitted by /u/cambridgecoder415 [link] [comments]  ( 66 min )
    I’m looking for resources for building a chatbot without libraries like chatterbot
    submitted by /u/ADumFuk [link] [comments]  ( 65 min )
    [D] Meta AI Residency 2023
    Creating this thread for AI Residency applicants at Meta 2023. Any new information or update on application/interview are welcome. Can also have general discussion about applicant backgrounds, backups and future plans. submitted by /u/Around-star [link] [comments]  ( 65 min )
    [P] I trained a model to tell if you were naughty this year
    https://mlem-nice-or-naughty.fly.dev Please enjoy! :) And, here is the blog post about DDoSing Santa's website and training a Christmas decision tree ^ https://medium.com/@mike0sv/i-trained-a-model-to-tell-if-you-were-naughty-this-year-11a36ca6d472 Have a good New Year's mood! submitted by /u/1aguschin [link] [comments]  ( 65 min )
    [P] Combining Kakaobrain's Karlo text-conditional diffusion model with Stable-Diffusion 2.1 (WebUI)
    Github Link: https://github.com/kpthedev/stable-karlo I made stable-karlo, an app that combines Kakaobrain's Karlo image generation model with Stable-Diffusion 2.1 in a nice webUI. Recently, Kakaobrain released Karlo, their own image generating diffusion model which uses OpenAI's unCLIP architecture. The model is great at understanding text and relationships, but it only outputs 256x256 pixel images. I had the idea to combine Karlo with the new Stable-Diffusion v2 upscaler to get large images and the results are very promising. Please check out the Github and share your thoughts! submitted by /u/kpthedev [link] [comments]  ( 73 min )
    [D] Has anyone integrated ChatGPT with scientific papers?
    A guy on Twitter shared a ChatGPT that is aware of all the podcasts from Andrew Huberman, which is great (https://huberman.rile.yt/?query=) Has anyone open sourced something like ChatGPT that it is easy to fine tune with external knowledge, potentially tested on scientific papers? It would be great for brainstorming, writing research proposal and exploring the literature in a different way. Maybe even integrating it with Zotero. As of now I talked about finetuning the model, but let’s say I take the easier path of few shot learning instead. Is there a way to save the state of ChatGPT? In other words, if I open a new chat and feed it all the papers by copy and paste for example, is there a way I can use it next week? Sometimes I have found the session to expire, but recently it seems past chats are saved. Will this last indefinitely you believe? TL;DR: best way to adapt ChatGPT to specific knowledge? submitted by /u/justrandomtourist [link] [comments]  ( 66 min )
    An Empirical Study of Training End-to-End Vision-and-Language Transformers: Are there any mathematical or logical proofs as to why pre trained embedding perform better or worse on downstream NLP tasks?
    submitted by /u/wise0807 [link] [comments]  ( 116 min )
    [D] Web scraping from Google scholar articles or journal articles
    Hi! I'm relatively new to machine learning and came up w a project of my own. I'm hoping to create a database to suit the needs of my project and was thinking whether there are any APIs available to assist me. The data that I am looking for are molecular data, mainly their optical properties and ADME-T. Please let me know if this is the wrong place to ask, thanks! submitted by /u/NotPaulDirac [link] [comments]  ( 66 min )
    [Discussion] Anyone else having a hard time not getting mad/cringing at the general public anthropomorphizing the hell out of chatGPT?
    It was one thing with DALLE-2, but at least it couldn’t talk back to them. I mean I have been in board meetings with powerful people in leadership positions that have nothing to do with tech have absolutely horrendous ideas about what ChatGPT is- I am not lying, I have genuinely heard them say they believe it’s basically conscious and using excerpt screenshots of it saying it hates humans as a basis to make business decisions about the future of AI in their company. Like….WHAT? Have other people heard absurd things like this too? I think it’s just hard to see the professional reality of machine learning, becoming extremely debased from the general public idea of machine learning. I’m sure as we all get even better at our jobs it’s only going to get much much worse. I wouldn’t be surprised if soon we are the new magical witches of the world. i’ll see you guys on the pyres in 20 years.( ok really I’m just joking on that last part) What do you all think? submitted by /u/gettheflyoffmycock [link] [comments]  ( 91 min )
  • Open

    Multi-Objective RL where I want to optimize ratio of two total returns
    I have an environment that produces two reward functions r1(s, a, s') and r2(s, a, s') and the task is continuing, i.e. there is no episode end. I want to find the policy that maximizes the ratio `R1 / R2` where R1, R2 are the total discounted reward for r1 and r2. In particular, the environment is such that there is a trade-off between r1 and r2, i.e. some action might increase r1 at the expense of r2. To the best of my knowledge, there is no way of breaking down R1 / R2 into a sum of single-step rewards. And if I create a linear combination alpha * r1 - beta * r2, it becomes very difficult for me to balance alpha and beta to achieve maximal R1 / R2. Do you know any work that deals with a similar problem? Would an on-policy algorithm like PPO with two value estimators V1 and V2, and the policy being updated with respect to V1/V2 work? I am going to try soon, but I'm unsure about its theoretical grounds. submitted by /u/fedetask [link] [comments]  ( 57 min )
    Is Monte Carlo good for problems which only terminate in rare cases with a positive reward? Because my simulation is taking a hell lot of episodes to even encounter that positive reward goal.
    submitted by /u/FailedMesh [link] [comments]  ( 59 min )
    ChatGPT's RLHF methodology described - and its shortcomings
    submitted by /u/mrx-ai [link] [comments]  ( 59 min )
    How to get started learning RL
    What resources do you recommend for someone just getting started in RL? I am a pretty familiar with supervised learning and different algorithms/tradeoffs but RL is a newer paradigm for me. I’m finishing the RL 2021 lecture series from DeepMind and fascinated in learning more, but just trying to find a good place to start. Thanks! submitted by /u/code4funle [link] [comments]  ( 59 min )
    [D] What are some fun RL hobby project ideas that don't require TOO much compute?
    Recently I've been really inspired by the superhuman self-driving AI that Polyphony Digital has made a few years ago for Gran Turismo, and ideally I would have loved to create a similar AI that performs as well on a different racing game, but looking into the paper it's clear it might be a little out of reach for me (4 PS4s x 20 cars simulated each + 4 1080s for training x several days of wallclock time = oof my poor i3 6100, not mentioning the features used that are going to be difficult having without access to the game's code). Looking into more general algorithms like MuZero and EfficientZero doesn't help much either as even a simple Atari game needs billions of frames and hundreds of GPUs to properly converge. So basically I'm looking for ideas that I could realistically implement, though it doesn't have to run locally only, maybe it could work like AlphaZero where I'd gather random data locally, train a network with the new data on Kaggle, gather new data using the new network and so on. Or maybe something that could run entirely on Kaggle, though that would mean no desktop environment which could be limiting. Other than self-driving AIs I've also been impressed by applications in the engineering sector, like that AI from a while back that could design chips or 3d topology optimization with "generative design". So I'm open to anything really. Thanks! submitted by /u/real_beary [link] [comments]  ( 59 min )
  • Open

    How to redact PII data in conversation transcripts
    Customer service interactions often contain personally identifiable information (PII) such as names, phone numbers, and dates of birth. As organizations incorporate machine learning (ML) and analytics into their applications, using this data can provide insights on how to create more seamless customer experiences. However, the presence of PII information often restricts the use of this […]  ( 7 min )
  • Open

    𖦹I asked AI to make a Music Video… the results are trippy𖦹
    submitted by /u/Prior_Appearance_44 [link] [comments]  ( 45 min )
  • Open

    Little-known Secrets About Synthetic Data
    For many people, synthetic data is synonymous with simulations, mock or fake data. The reality is very different. The purpose of this article is to explain what it is about. I discuss potential applications, benefits, limitations, and some little-known facts that synthetic data vendors don’t want you to know, mostly because they are unaware of… Read More »Little-known Secrets About Synthetic Data The post Little-known Secrets About Synthetic Data appeared first on Data Science Central.  ( 21 min )
  • Open

    Two-view Graph Neural Networks for Knowledge Graph Completion. (arXiv:2112.09231v3 [cs.CL] UPDATED)
    We present an effective graph neural network (GNN)-based knowledge graph embedding model, which we name WGE, to capture entity- and relation-focused graph structures. Given a knowledge graph, WGE builds a single undirected entity-focused graph that views entities as nodes. WGE also constructs another single undirected graph from relation-focused constraints, which views entities and relations as nodes. WGE then proposes a GNN-based architecture to better learn vector representations of entities and relations from these two single entity- and relation-focused graphs. WGE feeds the learned entity and relation representations into a weighted score function to return the triple scores for knowledge graph completion. Experimental results show that WGE outperforms competitive baselines, obtaining state-of-the-art performances on seven benchmark datasets for knowledge graph completion.  ( 2 min )
    Understanding Stereotypes in Language Models: Towards Robust Measurement and Zero-Shot Debiasing. (arXiv:2212.10678v1 [cs.CL])
    Generated texts from large pretrained language models have been shown to exhibit a variety of harmful, human-like biases about various demographics. These findings prompted large efforts aiming to understand and measure such effects, with the goal of providing benchmarks that can guide the development of techniques mitigating these stereotypical associations. However, as recent research has pointed out, the current benchmarks lack a robust experimental setup, consequently hindering the inference of meaningful conclusions from their evaluation metrics. In this paper, we extend these arguments and demonstrate that existing techniques and benchmarks aiming to measure stereotypes tend to be inaccurate and consist of a high degree of experimental noise that severely limits the knowledge we can gain from benchmarking language models based on them. Accordingly, we propose a new framework for robustly measuring and quantifying biases exhibited by generative language models. Finally, we use this framework to investigate GPT-3's occupational gender bias and propose prompting techniques for mitigating these biases without the need for fine-tuning.  ( 2 min )
    A Non-Asymptotic Analysis of Oversmoothing in Graph Neural Networks. (arXiv:2212.10701v1 [cs.LG])
    A central challenge of building more powerful Graph Neural Networks (GNNs) is the oversmoothing phenomenon, where increasing the network depth leads to homogeneous node representations and thus worse classification performance. While previous works have only demonstrated that oversmoothing is inevitable when the number of graph convolutions tends to infinity, in this paper, we precisely characterize the mechanism behind the phenomenon via a non-asymptotic analysis. Specifically, we distinguish between two different effects when applying graph convolutions -- an undesirable mixing effect that homogenizes node representations in different classes, and a desirable denoising effect that homogenizes node representations in the same class. By quantifying these two effects on random graphs sampled from the Contextual Stochastic Block Model (CSBM), we show that oversmoothing happens once the mixing effect starts to dominate the denoising effect, and the number of layers required for this transition is $O(\log N/\log (\log N))$ for sufficiently dense graphs with $N$ nodes. We also extend our analysis to study the effects of Personalized PageRank (PPR) on oversmoothing. Our results suggest that while PPR mitigates oversmoothing at deeper layers, PPR-based architectures still achieve their best performance at a shallow depth and are outperformed by the graph convolution approach on certain graphs. Finally, we support our theoretical results with numerical experiments, which further suggest that the oversmoothing phenomenon observed in practice may be exacerbated by the difficulty of optimizing deep GNN models.  ( 2 min )
    NADBenchmarks -- a compilation of Benchmark Datasets for Machine Learning Tasks related to Natural Disasters. (arXiv:2212.10735v1 [cs.LG])
    Climate change has increased the intensity, frequency, and duration of extreme weather events and natural disasters across the world. While the increased data on natural disasters improves the scope of machine learning (ML) in this field, progress is relatively slow. One bottleneck is the lack of benchmark datasets that would allow ML researchers to quantify their progress against a standard metric. The objective of this short paper is to explore the state of benchmark datasets for ML tasks related to natural disasters, categorizing them according to the disaster management cycle. We compile a list of existing benchmark datasets introduced in the past five years. We propose a web platform - NADBenchmarks - where researchers can search for benchmark datasets for natural disasters, and we develop a preliminary version of such a platform using our compiled list. This paper is intended to aid researchers in finding benchmark datasets to train their ML models on, and provide general directions for topics where they can contribute new benchmark datasets.  ( 2 min )
    Prompt-Augmented Linear Probing: Scaling Beyond The Limit of Few-shot In-Context Learners. (arXiv:2212.10873v1 [cs.CL])
    Through in-context learning (ICL), large-scale language models are effective few-shot learners without additional model fine-tuning. However, the ICL performance does not scale well with the number of available training samples as it is limited by the inherent input length constraint of the underlying language model. Meanwhile, many studies have revealed that language models are also powerful feature extractors, allowing them to be utilized in a black-box manner and enabling the linear probing paradigm, where lightweight discriminators are trained on top of the pre-extracted input representations. This paper proposes prompt-augmented linear probing (PALP), a hybrid of linear probing and ICL, which leverages the best of both worlds. PALP inherits the scalability of linear probing and the capability of enforcing language models to derive more meaningful representations via tailoring input into a more conceivable form. Throughout in-depth investigations on various datasets, we verified that PALP significantly enhances the input representations closing the gap between ICL in the data-hungry scenario and fine-tuning in the data-abundant scenario with little training overhead, potentially making PALP a strong alternative in a black-box scenario.  ( 2 min )
    PABAU: Privacy Analysis of Biometric API Usage. (arXiv:2212.10861v1 [cs.CR])
    Biometric data privacy is becoming a major concern for many organizations in the age of big data, particularly in the ICT sector, because it may be easily exploited in apps. Most apps utilize biometrics by accessing common application programming interfaces (APIs); hence, we aim to categorize their usage. The categorization based on behavior may be closely correlated with the sensitive processing of a user's biometric data, hence highlighting crucial biometric data privacy assessment concerns. We propose PABAU, Privacy Analysis of Biometric API Usage. PABAU learns semantic features of methods in biometric APIs and uses them to detect and categorize the usage of biometric API implementation in the software according to their privacy-related behaviors. This technique bridges the communication and background knowledge gap between technical and non-technical individuals in organizations by providing an automated method for both parties to acquire a rapid understanding of the essential behaviors of biometric API in apps, as well as future support to data protection officers (DPO) with legal documentation, such as conducting a Data Protection Impact Assessment (DPIA).
    Deconstructing Self-Supervised Monocular Reconstruction: The Design Decisions that Matter. (arXiv:2208.01489v4 [cs.CV] UPDATED)
    This paper presents an open and comprehensive framework to systematically evaluate state-of-the-art contributions to self-supervised monocular depth estimation. This includes pretraining, backbone, architectural design choices and loss functions. Many papers in this field claim novelty in either architecture design or loss formulation. However, simply updating the backbone of historical systems results in relative improvements of 25%, allowing them to outperform the majority of existing systems. A systematic evaluation of papers in this field was not straightforward. The need to compare like-with-like in previous papers means that longstanding errors in the evaluation protocol are ubiquitous in the field. It is likely that many papers were not only optimized for particular datasets, but also for errors in the data and evaluation criteria. To aid future research in this area, we release a modular codebase (https://github.com/jspenmar/monodepth_benchmark), allowing for easy evaluation of alternate design decisions against corrected data and evaluation criteria. We re-implement, validate and re-evaluate 16 state-of-the-art contributions and introduce a new dataset (SYNS-Patches) containing dense outdoor depth maps in a variety of both natural and urban scenes. This allows for the computation of informative metrics in complex regions such as depth boundaries.
    Exponentially Improving the Complexity of Simulating the Weisfeiler-Lehman Test with Graph Neural Networks. (arXiv:2211.03232v2 [cs.LG] UPDATED)
    Recent work shows that the expressive power of Graph Neural Networks (GNNs) in distinguishing non-isomorphic graphs is exactly the same as that of the Weisfeiler-Lehman (WL) graph test. In particular, they show that the WL test can be simulated by GNNs. However, those simulations involve neural networks for the 'combine' function of size polynomial or even exponential in the number of graph nodes $n$, as well as feature vectors of length linear in $n$. We present an improved simulation of the WL test on GNNs with \emph{exponentially} lower complexity. In particular, the neural network implementing the combine function in each node has only a polylogarithmic number of parameters in $n$, and the feature vectors exchanged by the nodes of GNN consists of only $O(\log n)$ bits. We also give logarithmic lower bounds for the feature vector length and the size of the neural networks, showing the (near)-optimality of our construction.
    The Ties that matter: From the perspective of Similarity Measure in Online Social Networks. (arXiv:2212.10960v1 [cs.SI])
    Online Social Networks have embarked on the importance of connection strength measures which has a broad array of applications such as, analyzing diffusion behaviors, community detection, link predictions, recommender systems. Though there are some existing connection strength measures, the density that a connection shares with it's neighbors and the directionality aspect has not received much attention. In this paper, we have proposed an asymmetric edge similarity measure namely, Neighborhood Density-based Edge Similarity (NDES) which provides a fundamental support to derive the strength of connection. The time complexity of NDES is $O(nk^2)$. An application of NDES for community detection in social network is shown. We have considered a similarity based community detection technique and substituted its similarity measure with NDES. The performance of NDES is evaluated on several small real-world datasets in terms of the effectiveness in detecting communities and compared with three widely used similarity measures. Empirical results show NDES enables detecting comparatively better communities both in terms of accuracy and quality.
    Time to augment self-supervised visual representation learning. (arXiv:2207.13492v2 [cs.LG] UPDATED)
    Biological vision systems are unparalleled in their ability to learn visual representations without supervision. In machine learning, self-supervised learning (SSL) has led to major advances in forming object representations in an unsupervised fashion. Such systems learn representations invariant to augmentation operations over images, like cropping or flipping. In contrast, biological vision systems exploit the temporal structure of the visual experience during natural interactions with objects. This gives access to "augmentations" not commonly used in SSL, like watching the same object from multiple viewpoints or against different backgrounds. Here, we systematically investigate and compare the potential benefits of such time-based augmentations during natural interactions for learning object categories. Our results show that time-based augmentations achieve large performance gains over state-of-the-art image augmentations. Specifically, our analyses reveal that: 1) 3-D object manipulations drastically improve the learning of object categories; 2) viewing objects against changing backgrounds is important for learning to discard background-related information from the latent representation. Overall, we conclude that time-based augmentations during natural interactions with objects can substantially improve self-supervised learning, narrowing the gap between artificial and biological vision systems.
    Predicting the Score of Atomic Candidate OWL Class Axioms. (arXiv:2212.10841v1 [cs.AI])
    Candidate axiom scoring is the task of assessing the acceptability of a candidate axiom against the evidence provided by known facts or data. The ability to score candidate axioms reliably is required for automated schema or ontology induction, but it can also be valuable for ontology and/or knowledge graph validation. Accurate axiom scoring heuristics are often computationally expensive, which is an issue if you wish to use them in iterative search techniques like level-wise generate-and-test or evolutionary algorithms, which require scoring a large number of candidate axioms. We address the problem of developing a predictive model as a substitute for reasoning that predicts the possibility score of candidate class axioms and is quick enough to be employed in such situations. We use a semantic similarity measure taken from an ontology's subsumption structure for this purpose. We show that the approach provided in this work can accurately learn the possibility scores of candidate OWL class axioms and that it can do so for a variety of OWL class axioms.
    FedDAG: Federated DAG Structure Learning. (arXiv:2112.03555v2 [cs.LG] UPDATED)
    To date, most directed acyclic graphs (DAGs) structure learning approaches require data to be stored in a central server. However, due to the consideration of privacy protection, data owners gradually refuse to share their personalized raw data to avoid private information leakage, making this task more troublesome by cutting off the first step. Thus, a puzzle arises: \textit{how do we discover the underlying DAG structure from decentralized data?} In this paper, focusing on the additive noise models (ANMs) assumption of data generation, we take the first step in developing a gradient-based learning framework named FedDAG, which can learn the DAG structure without directly touching the local data and also can naturally handle the data heterogeneity. Our method benefits from a two-level structure of each local model. The first level structure learns the edges and directions of the graph and communicates with the server to get the model information from other clients during the learning procedure, while the second level structure approximates the mechanisms among variables and personally updates on its own data to accommodate the data heterogeneity. Moreover, FedDAG formulates the overall learning task as a continuous optimization problem by taking advantage of an equality acyclicity constraint, which can be solved by gradient descent methods to boost the searching efficiency. Extensive experiments on both synthetic and real-world datasets verify the efficacy of the proposed method.
    On the Convergence of Momentum-Based Algorithms for Federated Bilevel Optimization Problems. (arXiv:2204.13299v2 [cs.LG] UPDATED)
    In this paper, we studied the federated bilevel optimization problem, which has widespread applications in machine learning. In particular, we developed two momentum-based algorithms for optimizing this kind of problem and established the convergence rate of our two algorithms, providing the sample and communication complexities. Importantly, to the best of our knowledge, our convergence rate is the first one achieving the linear speedup with respect to the number of devices for federated bilevel optimization algorithms. At last, our extensive experimental results confirm the effectiveness of our two algorithms.
    Why Deep Learning Generalizes. (arXiv:2211.09639v2 [cs.LG] UPDATED)
    Very large deep learning models trained using gradient descent are remarkably resistant to memorization given their huge capacity, but are at the same time capable of fitting large datasets of pure noise. Here methods are introduced by which models may be trained to memorize datasets that normally are generalized. We find that memorization is difficult relative to generalization, but that adding noise makes memorization easier. Increasing the dataset size exaggerates the characteristics of that dataset: model access to more training samples makes overfitting easier for random data, but somewhat harder for natural images. The bias of deep learning towards generalization is explored theoretically, and we show that generalization results from a model's parameters being attracted to points of maximal stability with respect to that model's inputs during gradient descent.
    Does human speech follow Benford's Law?. (arXiv:2203.13352v2 [cs.CL] UPDATED)
    Researchers have observed that the frequencies of leading digits in many man-made and naturally occurring datasets follow a logarithmic curve, with digits that start with the number 1 accounting for $\sim 30\%$ of all numbers in the dataset and digits that start with the number 9 accounting for $\sim 5\%$ of all numbers in the dataset. This phenomenon, known as Benford's Law, is highly repeatable and appears in lists of numbers from electricity bills, stock prices, tax returns, house prices, death rates, lengths of rivers, and naturally occurring images. In this paper we demonstrate that human speech spectra also follow Benford's Law on average. That is, when averaged over many speakers, the frequencies of leading digits in speech magnitude spectra follow this distribution, although with some variability at the individual sample level. We use this observation to motivate a new set of features that can be efficiently extracted from speech and demonstrate that these features can be used to classify between human speech and synthetic speech.
    FAIR principles for AI models with a practical application for accelerated high energy diffraction microscopy. (arXiv:2207.00611v3 [cs.AI] UPDATED)
    A concise and measurable set of FAIR (Findable, Accessible, Interoperable and Reusable) principles for scientific data is transforming the state-of-practice for data management and stewardship, supporting and enabling discovery and innovation. Learning from this initiative, and acknowledging the impact of artificial intelligence (AI) in the practice of science and engineering, we introduce a set of practical, concise, and measurable FAIR principles for AI models. We showcase how to create and share FAIR data and AI models within a unified computational framework combining the following elements: the Advanced Photon Source at Argonne National Laboratory, the Materials Data Facility, the Data and Learning Hub for Science, and funcX, and the Argonne Leadership Computing Facility (ALCF), in particular the ThetaGPU supercomputer and the SambaNova DataScale system at the ALCF AI Testbed. We describe how this domain-agnostic computational framework may be harnessed to enable autonomous AI-driven discovery.
    Multi-branch Cascaded Swin Transformers with Attention to k-space Sampling Pattern for Accelerated MRI Reconstruction. (arXiv:2207.08412v2 [eess.IV] UPDATED)
    Global correlations are widely seen in human anatomical structures due to similarity across tissues and bones. These correlations are reflected in magnetic resonance imaging (MRI) scans as a result of close-range proton density and T1/T2 parameters. Furthermore, to achieve accelerated MRI, k-space data are undersampled which causes global aliasing artifacts. Convolutional neural network (CNN) models are widely utilized for accelerated MRI reconstruction, but those models are limited in capturing global correlations due to the intrinsic locality of the convolution operation. The self-attention-based transformer models are capable of capturing global correlations among image features, however, the current contributions of transformer models for MRI reconstruction are minute. The existing contributions mostly provide CNN-transformer hybrid solutions and rarely leverage the physics of MRI. In this paper, we propose a physics-based stand-alone (convolution free) transformer model titled, the Multi-head Cascaded Swin Transformers (McSTRA) for accelerated MRI reconstruction. McSTRA combines several interconnected MRI physics-related concepts with the transformer networks: it exploits global MR features via the shifted window self-attention mechanism; it extracts MR features belonging to different spectral components separately using a multi-head setup; it iterates between intermediate de-aliasing and k-space correction via a cascaded network with data consistency in k-space and intermediate loss computations; furthermore, we propose a novel positional embedding generation mechanism to guide self-attention utilizing the point spread function corresponding to the undersampling mask. Our model significantly outperforms state-of-the-art MRI reconstruction methods both visually and quantitatively while depicting improved resolution and removal of aliasing artifacts.
    Provably Reliable Large-Scale Sampling from Gaussian Processes. (arXiv:2211.08036v2 [stat.ML] UPDATED)
    When comparing approximate Gaussian process (GP) models, it can be helpful to be able to generate data from any GP. If we are interested in how approximate methods perform at scale, we may wish to generate very large synthetic datasets to evaluate them. Na\"{i}vely doing so would cost \(\mathcal{O}(n^3)\) flops and \(\mathcal{O}(n^2)\) memory to generate a size \(n\) sample. We demonstrate how to scale such data generation to large \(n\) whilst still providing guarantees that, with high probability, the sample is indistinguishable from a sample from the desired GP.
    Can large language models reason about medical questions?. (arXiv:2207.08143v2 [cs.CL] UPDATED)
    Although large language models (LLMs) often produce impressive outputs, it remains unclear how they perform in real-world scenarios requiring strong reasoning skills and expert domain knowledge. We set out to investigate whether GPT-3.5 (Codex and InstructGPT) can be applied to answer and reason about difficult real-world-based questions. We utilize two multiple-choice medical exam questions (USMLE and MedMCQA) and a medical reading comprehension dataset (PubMedQA). We investigate multiple prompting scenarios: Chain-of-Thought (CoT, think step-by-step), zero- and few-shot (prepending the question with question-answer exemplars) and retrieval augmentation (injecting Wikipedia passages into the prompt). For a subset of the USMLE questions, a medical expert reviewed and annotated the model's CoT. We found that InstructGPT can often read, reason and recall expert knowledge. Failure are primarily due to lack of knowledge and reasoning errors and trivial guessing heuristics are observed, e.g.\ too often predicting labels A and D on USMLE. Sampling and combining many completions overcome some of these limitations. Using 100 samples, Codex 5-shot CoT not only gives close to well-calibrated predictive probability but also achieves human-level performances on the three datasets. USMLE: 60.2%, MedMCQA: 57.5% and PubMedQA: 78.2%.
    Diversity-Promoting Ensemble for Medical Image Segmentation. (arXiv:2210.12388v2 [eess.IV] UPDATED)
    Medical image segmentation is an actively studied task in medical imaging, where the precision of the annotations is of utter importance towards accurate diagnosis and treatment. In recent years, the task has been approached with various deep learning systems, among the most popular models being U-Net. In this work, we propose a novel strategy to generate ensembles of different architectures for medical image segmentation, by leveraging the diversity (decorrelation) of the models forming the ensemble. More specifically, we utilize the Dice score among model pairs to estimate the correlation between the outputs of the two models forming each pair. To promote diversity, we select models with low Dice scores among each other. We carry out gastro-intestinal tract image segmentation experiments to compare our diversity-promoting ensemble (DiPE) with another strategy to create ensembles based on selecting the top scoring U-Net models. Our empirical results show that DiPE surpasses both individual models as well as the ensemble creation strategy based on selecting the top scoring models.
    NP4G : Network Programming for Generalization. (arXiv:2212.11118v1 [cs.PL])
    Automatic programming has been actively studied for a long time by various approaches including genetic programming. In recent years, automatic programming using neural networks such as GPT-3 has been actively studied and is attracting a lot of attention. However, these methods are illogical inference based on experience by enormous learning, and their thinking process is unclear. Even using the method by logical inference with a clear thinking process, the system that automatically generates any programs has not yet been realized. Especially, the inductive inference generalized by logical inference from one example is an important issue that the artificial intelligence can acquire knowledge by itself. In this study, we propose NP4G: Network Programming for Generalization, which can automatically generate programs by inductive inference. Because the proposed method can realize "sequence", "selection", and "iteration" in programming and can satisfy the conditions of the structured program theorem, it is expected that NP4G is a method automatically acquire any programs by inductive inference. As an example, we automatically construct a bitwise NOT operation program from several training data by generalization using NP4G. Although NP4G only randomly selects and connects nodes, by adjusting the number of nodes and the number of phase of "Phased Learning", we show the bitwise NOT operation programs are acquired in a comparatively short time and at a rate of about 7 in 10 running. The source code of NP4G is available on GitHub as a public repository.
    The challenges of HTR model training: Feedbacks from the project Donner le gout de l'archive a l'ere numerique. (arXiv:2212.11146v1 [cs.CV])
    The arrival of handwriting recognition technologies offers new possibilities to research in heritage studies. However, it is now necessary to reflect on the experiences and the practices developed by research teams. Our use of the Transkribus platform since 2018 has led us to search for the most significant ways to improve the performance of our handwritten recognition models (HTR) which are made to transcribe French handwriting dating from the 17th century. This article therefore reports on the impacts of creating transcribing protocols, using the lexical elements at full scale and determining the best way to use base model in order to help to increase the performance of HTR models. Combining all of these elements can indeed increase the performance of a single model by more than 20% (reaching a Character Error Rate below 5%). It also discusses some challenges regarding the collaborative nature of HTR platforms such as Transkribus and the way researchers can share their data generated in the process of creating or training handwritten text recognition models.
    Improving Narrative Relationship Embeddings by Training with Additional Inverse-Relationship Constraints. (arXiv:2212.11234v1 [cs.CL])
    We consider the problem of embedding character-entity relationships from the reduced semantic space of narratives, proposing and evaluating the assumption that these relationships hold under a reflection operation. We analyze this assumption and compare the approach to a baseline state-of-the-art model with a unique evaluation that simulates efficacy on a downstream clustering task with human-created labels. Although our model creates clusters that achieve Silhouette scores of -.084, outperforming the baseline -.227, our analysis reveals that the models approach the task much differently and perform well on very different examples. We conclude that our assumption might be useful for specific types of data and should be evaluated on a wider range of tasks.
    Learning Spectral Unions of Partial Deformable 3D Shapes. (arXiv:2104.00514v3 [cs.GR] UPDATED)
    Spectral geometric methods have brought revolutionary changes to the field of geometry processing. Of particular interest is the study of the Laplacian spectrum as a compact, isometry and permutation-invariant representation of a shape. Some recent works show how the intrinsic geometry of a full shape can be recovered from its spectrum, but there are approaches that consider the more challenging problem of recovering the geometry from the spectral information of partial shapes. In this paper, we propose a possible way to fill this gap. We introduce a learning-based method to estimate the Laplacian spectrum of the union of partial non-rigid 3D shapes, without actually computing the 3D geometry of the union or any correspondence between those partial shapes. We do so by operating purely in the spectral domain and by defining the union operation between short sequences of eigenvalues. We show that the approximated union spectrum can be used as-is to reconstruct the complete geometry [MRC*19], perform region localization on a template [RTO*19] and retrieve shapes from a database, generalizing ShapeDNA [RWP06] to work with partialities. Working with eigenvalues allows us to deal with unknown correspondence, different sampling, and different discretizations (point clouds and meshes alike), making this operation especially robust and general. Our approach is data-driven and can generalize to isometric and non-isometric deformations of the surface, as long as these stay within the same semantic class (e.g., human bodies or horses), as well as to partiality artifacts not seen at training time.
    Adjoint-Matching Neural Network Surrogates for Fast 4D-Var Data Assimilation. (arXiv:2111.08626v2 [cs.LG] UPDATED)
    Data assimilation is the process of fusing information from imperfect computer simulations with noisy, sparse measurements of reality to obtain improved estimates of the state or parameters of a dynamical system of interest. The data assimilation procedures used in many geoscience applications, such as numerical weather forecasting, are variants of the our-dimensional variational (4D-Var) algorithm. The cost of solving the underlying 4D-Var optimization problem is dominated by the cost of repeated forward and adjoint model runs. This motivates substituting the evaluations of the physical model and its adjoint by fast, approximate surrogate models. Neural networks offer a promising approach for the data-driven creation of surrogate models. The accuracy of the surrogate 4D-Var solution depends on the accuracy with each the surrogate captures both the forward and the adjoint model dynamics. We formulate and analyze several approaches to incorporate adjoint information into the construction of neural network surrogates. The resulting networks are tested on unseen data and in a sequential data assimilation problem using the Lorenz-63 system. Surrogates constructed using adjoint information demonstrate superior performance on the 4D-Var data assimilation problem compared to a standard neural network surrogate that uses only forward dynamics information.
    Adapting to Latent Subgroup Shifts via Concepts and Proxies. (arXiv:2212.11254v1 [stat.ML])
    We address the problem of unsupervised domain adaptation when the source domain differs from the target domain because of a shift in the distribution of a latent subgroup. When this subgroup confounds all observed data, neither covariate shift nor label shift assumptions apply. We show that the optimal target predictor can be non-parametrically identified with the help of concept and proxy variables available only in the source domain, and unlabeled data from the target. The identification results are constructive, immediately suggesting an algorithm for estimating the optimal predictor in the target. For continuous observations, when this algorithm becomes impractical, we propose a latent variable model specific to the data generation process at hand. We show how the approach degrades as the size of the shift changes, and verify that it outperforms both covariate and label shift adjustment.
    Contrastive Language-Vision AI Models Pretrained on Web-Scraped Multimodal Data Exhibit Sexual Objectification Bias. (arXiv:2212.11261v1 [cs.CY])
    Nine language-vision AI models trained on web scrapes with the Contrastive Language-Image Pretraining (CLIP) objective are evaluated for evidence of a bias studied by psychologists: the sexual objectification of girls and women, which occurs when a person's human characteristics are disregarded and the person is treated as a body or a collection of body parts. A first experiment uses standardized images of women from the Sexual OBjectification and EMotion Database, and finds that, commensurate with prior research in psychology, human characteristics are disassociated from images of objectified women: the model's recognition of emotional state is mediated by whether the subject is fully or partially clothed. Embedding association tests (EATs) return significant effect sizes for both anger (d >.8) and sadness (d >.5). A second experiment measures the effect in a representative application: an automatic image captioner (Antarctic Captions) includes words denoting emotion less than 50% as often for images of partially clothed women than for images of fully clothed women. A third experiment finds that images of female professionals (scientists, doctors, executives) are likely to be associated with sexual descriptions relative to images of male professionals. A fourth experiment shows that a prompt of "a [age] year old girl" generates sexualized images (as determined by an NSFW classifier) up to 73% of the time for VQGAN-CLIP (age 17), and up to 40% of the time for Stable Diffusion (ages 14 and 18); the corresponding rate for boys never surpasses 9%. The evidence indicates that language-vision AI models trained on automatically collected web scrapes learn biases of sexual objectification, which propagate to downstream applications.
    Functional Linear Regression of Cumulative Distribution Functions. (arXiv:2205.14545v2 [cs.LG] UPDATED)
    The estimation of cumulative distribution functions (CDFs) is an important learning task with a great variety of downstream applications, such as risk assessments in predictions and decision making. In this paper, we study functional regression of contextual CDFs where each data point is sampled from a linear combination of context dependent CDF basis functions. We propose functional ridge-regression-based estimation methods that estimate CDFs accurately everywhere. In particular, given $n$ samples with $d$ basis functions, we show estimation error upper bounds of $\widetilde{O}(\sqrt{d/n})$ for fixed design, random design, and adversarial context cases. We also derive matching information theoretic lower bounds, establishing minimax optimality for CDF functional regression. Furthermore, we remove the burn-in time in the random design setting using an alternative penalized estimator. Then, we consider agnostic settings where there is a mismatch in the data generation process. We characterize the error of the proposed estimators in terms of the mismatched error, and show that the estimators are well-behaved under model mismatch. Finally, to complete our study, we formalize infinite dimensional models where the parameter space is an infinite dimensional Hilbert space, and establish self-normalized estimation error upper bounds for this setting.
    MammoDL: Mammographic Breast Density Estimation using Federated Learning. (arXiv:2206.05575v3 [eess.IV] UPDATED)
    Assessing breast cancer risk from imaging remains a subjective process, in which radiologists employ simple computer aided detection (CAD) systems or qualitative visual assessment to estimate breast percent density (PD). Machine learning (ML) models have become the most promising way to quantify breast cancer risk for early, accurate, and equitable diagnoses, but training such models in medical research is often restricted to small, single-institution data. Since patient demographics and imaging characteristics may vary considerably across imaging sites, models trained on single-institution data tend not to generalize well. In response to this problem, MammoDL is proposed, an open-source software tool that leverages a U-Net architecture to accurately estimate breast PD and complexity from mammography. With the Open Federated Learning (OpenFL) library, this solution enables secure training on datasets across multiple institutions. MammoDL is a leaner, more flexible model than its predecessors, boasting improved generalization due to federation-enabled training on larger, more representative datasets.
    Polynomial-Time Reachability for LTI Systems with Two-Level Lattice Neural Network Controllers. (arXiv:2209.09400v2 [cs.LG] UPDATED)
    In this paper, we consider the computational complexity of bounding the reachable set of a Linear Time-Invariant (LTI) system controlled by a Rectified Linear Unit (ReLU) Two-Level Lattice (TLL) Neural Network (NN) controller. In particular, we show that for such a system and controller, it is possible to compute the exact one-step reachable set in polynomial time in the size of the TLL NN controller (number of neurons). Additionally, we show that a tight bounding box of the reachable set is computable via two polynomial-time methods: one with polynomial complexity in the size of the TLL and the other with polynomial complexity in the Lipschitz constant of the controller and other problem parameters. Finally, we propose a pragmatic algorithm that adaptively combines the benefits of (semi-)exact reachability and approximate reachability, which we call L-TLLBox. We evaluate L-TLLBox with an empirical comparison to a state-of-the-art NN controller reachability tool. In our experiments, L-TLLBox completed reachability analysis as much as 5000x faster than this tool on the same network/system, while producing reach boxes that were from 0.08 to 1.42 times the area.
    A Tunable Loss Function for Robust Classification: Calibration, Landscape, and Generalization. (arXiv:1906.02314v6 [cs.LG] UPDATED)
    We introduce a tunable loss function called $\alpha$-loss, parameterized by $\alpha \in (0,\infty]$, which interpolates between the exponential loss ($\alpha = 1/2$), the log-loss ($\alpha = 1$), and the 0-1 loss ($\alpha = \infty$), for the machine learning setting of classification. Theoretically, we illustrate a fundamental connection between $\alpha$-loss and Arimoto conditional entropy, verify the classification-calibration of $\alpha$-loss in order to demonstrate asymptotic optimality via Rademacher complexity generalization techniques, and build-upon a notion called strictly local quasi-convexity in order to quantitatively characterize the optimization landscape of $\alpha$-loss. Practically, we perform class imbalance, robustness, and classification experiments on benchmark image datasets using convolutional-neural-networks. Our main practical conclusion is that certain tasks may benefit from tuning $\alpha$-loss away from log-loss ($\alpha = 1$), and to this end we provide simple heuristics for the practitioner. In particular, navigating the $\alpha$ hyperparameter can readily provide superior model robustness to label flips ($\alpha > 1$) and sensitivity to imbalanced classes ($\alpha < 1$).
    MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation. (arXiv:2211.05719v3 [cs.CL] UPDATED)
    Responding with multi-modal content has been recognized as an essential capability for an intelligent conversational agent. In this paper, we introduce the MMDialog dataset to better facilitate multi-modal conversation. MMDialog is composed of a curated set of 1.08 million real-world dialogues with 1.53 million unique images across 4,184 topics. MMDialog has two main and unique advantages. First, it is the largest multi-modal conversation dataset by the number of dialogues by 88x. Second, it contains massive topics to generalize the open-domain. To build engaging dialogue system with this dataset, we propose and normalize two response producing tasks based on retrieval and generative scenarios. In addition, we build two baselines for above tasks with state-of-the-art techniques and report their experimental performance. We also propose a novel evaluation metric MM-Relevance to measure the multi-modal responses. Our dataset and scripts are available in https://github.com/victorsungo/MMDialog.
    A Memetic Algorithm with Reinforcement Learning for Sociotechnical Production Scheduling. (arXiv:2212.10936v1 [cs.LG])
    The following article presents a memetic algorithm with applying deep reinforcement learning (DRL) for solving practically oriented dual resource constrained flexible job shop scheduling problems (DRC-FJSSP). In recent years, there has been extensive research on DRL techniques, but without considering realistic, flexible and human-centered shopfloors. A research gap can be identified in the context of make-to-order oriented discontinuous manufacturing as it is often represented in medium-size companies with high service levels. From practical industry projects in this domain, we recognize requirements to depict flexible machines, human workers and capabilities, setup and processing operations, material arrival times, complex job paths with parallel tasks for bill of material (BOM) manufacturing, sequence-depended setup times and (partially) automated tasks. On the other hand, intensive research has been done on metaheuristics in the context of DRC-FJSSP. However, there is a lack of suitable and generic scheduling methods that can be holistically applied in sociotechnical production and assembly processes. In this paper, we first formulate an extended DRC-FJSSP induced by the practical requirements mentioned. Then we present our proposed hybrid framework with parallel computing for multicriteria optimization. Through numerical experiments with real-world data, we confirm that the framework generates feasible schedules efficiently and reliably. Utilizing DRL instead of random operations leads to better results and outperforms traditional approaches.
    Diamond Abrasive Electroplated Surface Anomaly Detection using Convolutional Neural Networks for Industrial Quality Inspection. (arXiv:2212.11122v1 [cs.CV])
    Electroplated diamond abrasive tools require nickel coating on a metal surface for abrasive bonding and part functionality. The electroplated nickel-coated abrasive tool is expected to have a high-quality part performance by having a nickel coating thickness of between 50% to 60% of the abrasive median diameter, uniformity of the nickel layer, abrasive distribution over the electroplated surface, and bright gloss. Electroplating parameters are set accordingly for this purpose. Industrial quality inspection for defects of these abrasive electroplated parts with optical inspection instruments is extremely challenging due to the diamond's light refraction, dispersion nature, and reflective bright nickel surface. The difficulty posed by this challenge requires parts to be quality inspected manually with an eye loupe that is subjective and costly. In this study, we use a Convolutional Neural Network (CNN) model in the production line to detect abrasive electroplated part anomalies allowing us to fix or eliminate those parts or elements that are in bad condition from the production chain and ultimately reduce manual quality inspection cost. We used 744 samples to train our model. Our model successfully identified over 99% of the parts with an anomaly. Keywords: Artificial Intelligence, Anomaly Detection, Industrial Quality Inspection, Electroplating, Diamond Abrasive Tool
    A Seven-Layer Model for Standardising AI Fairness Assessment. (arXiv:2212.11207v1 [cs.AI])
    Problem statement: Standardisation of AI fairness rules and benchmarks is challenging because AI fairness and other ethical requirements depend on multiple factors such as context, use case, type of the AI system, and so on. In this paper, we elaborate that the AI system is prone to biases at every stage of its lifecycle, from inception to its usage, and that all stages require due attention for mitigating AI bias. We need a standardised approach to handle AI fairness at every stage. Gap analysis: While AI fairness is a hot research topic, a holistic strategy for AI fairness is generally missing. Most researchers focus only on a few facets of AI model-building. Peer review shows excessive focus on biases in the datasets, fairness metrics, and algorithmic bias. In the process, other aspects affecting AI fairness get ignored. The solution proposed: We propose a comprehensive approach in the form of a novel seven-layer model, inspired by the Open System Interconnection (OSI) model, to standardise AI fairness handling. Despite the differences in the various aspects, most AI systems have similar model-building stages. The proposed model splits the AI system lifecycle into seven abstraction layers, each corresponding to a well-defined AI model-building or usage stage. We also provide checklists for each layer and deliberate on potential sources of bias in each layer and their mitigation methodologies. This work will facilitate layer-wise standardisation of AI fairness rules and benchmarking parameters.
    Order Optimal Bounds for One-Shot Federated Learning over non-Convex Loss Functions. (arXiv:2108.08677v2 [cs.LG] UPDATED)
    We consider the problem of federated learning in a one-shot setting in which there are $m$ machines, each observing $n$ sample functions from an unknown distribution on non-convex loss functions. Let $F:[-1,1]^d\rightarrow\mathbb{R}$ be the expected loss function with respect to this unknown distribution. The goal is to find an estimate of the minimizer of $F$. Based on its observations, each machine generates a signal of bounded length $B$ and sends it to a server. The server collects signals of all machines and outputs an estimate of the minimizer of $F$. We show that the expected loss of any algorithm is lower bounded by $\max\big(1/(\sqrt{n}(mB)^{1/d}), 1/\sqrt{mn}\big)$, up to a logarithmic factor. We then prove that this lower bound is order optimal in $m$ and $n$ by presenting a distributed learning algorithm, called Multi-Resolution Estimator for Non-Convex loss function (MRE-NC), whose expected loss matches the lower bound for large $mn$ up to polylogarithmic factors.
    It is not "accuracy vs. explainability" -- we need both for trustworthy AI systems. (arXiv:2212.11136v1 [cs.LG])
    We are witnessing the emergence of an AI economy and society where AI technologies are increasingly impacting health care, business, transportation and many aspects of everyday life. Many successes have been reported where AI systems even surpassed the accuracy of human experts. However, AI systems may produce errors, can exhibit bias, may be sensitive to noise in the data, and often lack technical and judicial transparency resulting in reduction in trust and challenges in their adoption. These recent shortcomings and concerns have been documented in scientific but also in general press such as accidents with self driving cars, biases in healthcare, hiring and face recognition systems for people of color, seemingly correct medical decisions later found to be made due to wrong reasons etc. This resulted in emergence of many government and regulatory initiatives requiring trustworthy and ethical AI to provide accuracy and robustness, some form of explainability, human control and oversight, elimination of bias, judicial transparency and safety. The challenges in delivery of trustworthy AI systems motivated intense research on explainable AI systems (XAI). Aim of XAI is to provide human understandable information of how AI systems make their decisions. In this paper we first briefly summarize current XAI work and then challenge the recent arguments of accuracy vs. explainability for being mutually exclusive and being focused only on deep learning. We then present our recommendations for the use of XAI in full lifecycle of high stakes trustworthy AI systems delivery, e.g. development, validation and certification, and trustworthy production and maintenance.
    Simple Neighborhood Representative Pre-processing Boosts Outlier Detectors. (arXiv:2010.12061v3 [cs.LG] UPDATED)
    Over the decades, traditional outlier detectors have ignored the group-level factor when calculating outlier scores for objects in data by evaluating only the object-level factor, failing to capture the collective outliers. To mitigate this issue, we present a method called neighborhood representative (NR), which empowers all the existing outlier detectors to efficiently detect outliers, including collective outliers, while maintaining their computational integrity. It achieves this by selecting representative objects, scoring these objects, then applies the score of the representative objects to its collective objects. Without altering existing detectors, NR is compatible with existing detectors, while improving performance on real world datasets with +8% (0.72 to 0.78 AUC) relative to state-of-the-art outlier detectors.
    An Extensive Data Processing Pipeline for MIMIC-IV. (arXiv:2204.13841v5 [cs.LG] UPDATED)
    An increasing amount of research is being devoted to applying machine learning methods to electronic health record (EHR) data for various clinical purposes. This growing area of research has exposed the challenges of the accessibility of EHRs. MIMIC is a popular, public, and free EHR dataset in a raw format that has been used in numerous studies. The absence of standardized pre-processing steps can be, however, a significant barrier to the wider adoption of this rare resource. Additionally, this absence can reduce the reproducibility of the developed tools and limit the ability to compare the results among similar studies. In this work, we provide a greatly customizable pipeline to extract, clean, and pre-process the data available in the fourth version of the MIMIC dataset (MIMIC-IV). The pipeline also presents an end-to-end wizard-like package supporting predictive model creations and evaluations. The pipeline covers a range of clinical prediction tasks which can be broadly classified into four categories - readmission, length of stay, mortality, and phenotype prediction. The tool is publicly available at https://github.com/healthylaife/MIMIC-IV-Data-Pipeline.
    A Novel Plug-and-Play Approach for Adversarially Robust Generalization. (arXiv:2208.09449v2 [cs.LG] UPDATED)
    In this work, we propose a robust framework that employs adversarially robust training to safeguard the machine learning models against perturbed testing data. We achieve this by incorporating the worst-case additive adversarial error within a fixed budget for each sample during model estimation. Our main focus is to provide a plug-and-play solution that can be incorporated in the existing machine learning algorithms with minimal changes. To that end, we derive the ready-to-use solution for several widely used loss functions with a variety of norm constraints on adversarial perturbation for various supervised and unsupervised ML problems, including regression, classification, two-layer neural networks, graphical models, and matrix completion. The solutions are either in closed-form, 1-D optimization, semidefinite programming, difference of convex programming or a sorting-based algorithm. Finally, we validate our approach by showing significant performance improvement on real-world datasets for supervised problems such as regression and classification, as well as for unsupervised problems such as matrix completion and learning graphical models, with very little computational overhead.
    Score-based denoising for atomic structure identification. (arXiv:2212.02421v2 [cond-mat.mtrl-sci] UPDATED)
    We propose an accurate method for removing thermal vibrations that complicate the task of analyzing complex dynamics in atomistic simulation of condensed matter. Our method iteratively subtracts thermal noises or perturbations in atomic positions using a denoising score function trained on synthetically noised but otherwise perfect crystal lattices. The resulting denoised structures clearly reveal underlying crystal order while retaining disorder associated with crystal defects. Purely geometric, agnostic to interatomic potentials, and trained without inputs from explicit simulations, our denoiser can be applied to simulation data generated from vastly different interatomic interactions. Followed by a simple phase classification tool such as the Common Neighbor Analysis, the denoiser outperforms other existing methods and reaches perfect classification accuracy on a recently proposed benchmark dataset consisting of perturbed crystal structures (DC3). Demonstrated here in a wide variety of atomistic simulation contexts, the denoiser is general, robust, and readily extendable to delineate order from disorder in structurally and chemically complex materials.
    Deep Reinforcement Learning for Trajectory Path Planning and Distributed Inference in Resource-Constrained UAV Swarms. (arXiv:2212.11201v1 [cs.DC])
    The deployment flexibility and maneuverability of Unmanned Aerial Vehicles (UAVs) increased their adoption in various applications, such as wildfire tracking, border monitoring, etc. In many critical applications, UAVs capture images and other sensory data and then send the captured data to remote servers for inference and data processing tasks. However, this approach is not always practical in real-time applications due to the connection instability, limited bandwidth, and end-to-end latency. One promising solution is to divide the inference requests into multiple parts (layers or segments), with each part being executed in a different UAV based on the available resources. Furthermore, some applications require the UAVs to traverse certain areas and capture incidents; thus, planning their paths becomes critical particularly, to reduce the latency of making the collaborative inference process. Specifically, planning the UAVs trajectory can reduce the data transmission latency by communicating with devices in the same proximity while mitigating the transmission interference. This work aims to design a model for distributed collaborative inference requests and path planning in a UAV swarm while respecting the resource constraints due to the computational load and memory usage of the inference requests. The model is formulated as an optimization problem and aims to minimize latency. The formulated problem is NP-hard so finding the optimal solution is quite complex; thus, this paper introduces a real-time and dynamic solution for online applications using deep reinforcement learning. We conduct extensive simulations and compare our results to the-state-of-the-art studies demonstrating that our model outperforms the competing models.
    Compact Graph Representation of molecular crystals using Point-wise Distance Distributions. (arXiv:2212.11246v1 [physics.comp-ph])
    Use of graphs to represent molecular crystals has become popular in recent years as they provide a natural translation from atoms and bonds to nodes and edges. Graphs capture structure, while remaining invariant to the symmetries that crystals display. Several works in property prediction, including those with state-of-the-art results, make use of the Crystal Graph. The present work offers a graph based on Point-wise Distance Distributions which retains symmetrical invariance, decreases computational load, and yields similar or better prediction accuracy on both experimental and simulated crystals.
    On Characterizing the Trade-off in Invariant Representation Learning. (arXiv:2109.03386v3 [cs.LG] UPDATED)
    Many applications of representation learning, such as privacy preservation, algorithmic fairness, and domain adaptation, desire explicit control over semantic information being discarded. This goal is formulated as satisfying two objectives: maximizing utility for predicting a target attribute while simultaneously being invariant (independent) to a known semantic attribute. Solutions to invariant representation learning (IRepL) problems lead to a trade-off between utility and invariance when they are competing. While existing works study bounds on this trade-off, two questions remain outstanding: 1) What is the exact trade-off between utility and invariance? and 2) What are the encoders (mapping the data to a representation) that achieve the trade-off, and how can we estimate it from training data? This paper addresses these questions for IRepLs in reproducing kernel Hilbert spaces (RKHS)s. Under the assumption that the distribution of a low-dimensional projection of high-dimensional data is approximately normal, we derive a closed-form solution for the global optima of the underlying optimization problem for encoders in RKHSs. This yields closed formulae for a near-optimal trade-off, corresponding optimal representation dimensionality, and the corresponding encoder(s). We also numerically quantify the trade-off on representative problems and compare them to those achieved by baseline IRepL algorithms.
    Dynamic Budget Throttling in Repeated Second-Price Auctions. (arXiv:2207.04690v4 [cs.GT] UPDATED)
    In today's online advertising markets, an important demand for an advertiser (buyer) is to control her total expenditure within a time span under some budget. Among all budget control approaches, throttling stands out as a popular one, where the buyer chooses to participate in only a part of auctions. This paper gives a theoretical panorama of a single buyer's dynamic budget throttling process in repeated second-price auctions, which is lacking in the literature. We first establish a lower bound on the regret and an upper bound on the asymptotic competitive ratio for any algorithm, respectively, on whether the buyer's values are stochastic or adversarial. Second, on the algorithmic side, we consider two different information structures, with increasing difficulty in learning the stochastic distribution of the highest competing bid. We further propose the OGD-CB algorithm, which is oblivious to whether the values are stochastic or adversarial and has asymptotically equal results under these two information structures. Specifically, with stochastic values, we demonstrate that this algorithm guarantees a near-optimal expected regret. When values are adversarial, we prove that the proposed algorithm reaches the upper bound on the asymptotic competitive ratio. At last, we compare throttling with pacing, another widely adopted budget control method, in the dynamic setting. In the stochastic case, we illustrate that dynamic pacing is generally better than dynamic throttling for the buyer, which is an extension of known results that dynamic pacing is asymptotically optimal in this scenario. However, in the adversarial case, we give an exciting result indicating that dynamic throttling is the asymptotically optimal dynamic bidding strategy. Our results fill the gaps in the theoretical research of dynamic throttling and comprehensively reveal the ability of this popular budget-smoothing strategy.
    EZNAS: Evolving Zero Cost Proxies For Neural Architecture Scoring. (arXiv:2209.07413v3 [cs.LG] UPDATED)
    Neural Architecture Search (NAS) has significantly improved productivity in the design and deployment of neural networks (NN). As NAS typically evaluates multiple models by training them partially or completely, the improved productivity comes at the cost of significant carbon footprint. To alleviate this expensive training routine, zero-shot/cost proxies analyze an NN at initialization to generate a score, which correlates highly with its true accuracy. Zero-cost proxies are currently designed by experts conducting multiple cycles of empirical testing on possible algorithms, datasets, and neural architecture design spaces. This experimentation lowers productivity and is an unsustainable approach towards zero-cost proxy design as deep learning use-cases diversify in nature. Additionally, existing zero-cost proxies fail to generalize across neural architecture design spaces. In this paper, we propose a genetic programming framework to automate the discovery of zero-cost proxies for neural architecture scoring. Our methodology efficiently discovers an interpretable and generalizable zero-cost proxy that gives state of the art score-accuracy correlation on all datasets and search spaces of NASBench-201 and Network Design Spaces (NDS). We believe that this research indicates a promising direction towards automatically discovering zero-cost proxies that can work across network architecture design spaces, datasets, and tasks.
    MountNet: Learning an Inertial Sensor Mounting Angle with Deep Neural Networks. (arXiv:2212.11120v1 [cs.CV])
    Finding the mounting angle of a smartphone inside a car is crucial for navigation, motion detection, activity recognition, and other applications. It is a challenging task in several aspects: (i) the mounting angle at the drive start is unknown and may differ significantly between users; (ii) the user, or bad fixture, may change the mounting angle while driving; (iii) a rapid and computationally efficient real-time solution is required for most applications. To tackle these problems, a data-driven approach using deep neural networks (DNNs) is presented to learn the yaw mounting angle of a smartphone equipped with an inertial measurement unit (IMU) and strapped to a car. The proposed model, MountNet, uses only IMU readings as input and, in contrast to existing solutions, does not require inputs from global navigation satellite systems (GNSS). IMU data is collected for training and validation with the sensor mounted at a known yaw mounting angle and a range of ground truth labels is generated by applying a prescribed rotation to the measurements. Although the training data did not include recordings with real sensor rotations, tests on data with real and synthetic rotations show similar results. An algorithm is formulated for real-time deployment to detect and smooth transitions in device mounting angle estimated by MountNet. MountNet is shown to find the mounting angle rapidly which is critical in real-time applications. Our method converges in less than 30 seconds of driving to a mean error of 4 degrees allowing a fast calibration phase for other algorithms and applications. When the device is rotated in the middle of a drive, large changes converge in 5 seconds and small changes converge in less than 30 seconds.
    Towards dynamic stability analysis of sustainable power grids using graph neural networks. (arXiv:2212.11130v1 [cs.LG])
    To mitigate climate change, the share of renewable needs to be increased. Renewable energies introduce new challenges to power grids due to decentralization, reduced inertia and volatility in production. The operation of sustainable power grids with a high penetration of renewable energies requires new methods to analyze the dynamic stability. We provide new datasets of dynamic stability of synthetic power grids and find that graph neural networks (GNNs) are surprisingly effective at predicting the highly non-linear target from topological information only. To illustrate the potential to scale to real-sized power grids, we demonstrate the successful prediction on a Texan power grid model.
    Named Tensor Notation. (arXiv:2102.13196v2 [cs.LG] UPDATED)
    We propose a notation for tensors with named axes, which relieves the author, reader, and future implementers of machine learning models from the burden of keeping track of the order of axes and the purpose of each. The notation makes it easy to lift operations on low-order tensors to higher order ones, for example, from images to minibatches of images, or from an attention mechanism to multiple attention heads. After a brief overview and formal definition of the notation, we illustrate it through several examples from modern machine learning, from building blocks like attention and convolution to full models like Transformers and LeNet. We then discuss differential calculus in our notation and compare with some alternative notations. Our proposals build on ideas from many previous papers and software libraries. We hope that our notation will encourage more authors to use named tensors, resulting in clearer papers and more precise implementations.
    Towards biologically plausible Dreaming and Planning in recurrent spiking networks. (arXiv:2205.10044v2 [cs.LG] UPDATED)
    Humans and animals can learn new skills after practicing for a few hours, while current reinforcement learning algorithms require a large amount of data to achieve good performances. Recent model-based approaches show promising results by reducing the number of necessary interactions with the environment to learn a desirable policy. However, these methods require biological implausible ingredients, such as the detailed storage of older experiences, and long periods of offline learning. The optimal way to learn and exploit word-models is still an open question. Taking inspiration from biology, we suggest that dreaming might be an efficient expedient to use an inner model. We propose a two-module (agent and model) spiking neural network in which "dreaming" (living new experiences in a model-based simulated environment) significantly boosts learning. We also explore "planning", an online alternative to dreaming, that shows comparable performances. Importantly, our model does not require the detailed storage of experiences, and learns online the world-model and the policy. Moreover, we stress that our network is composed of spiking neurons, further increasing the biological plausibility and implementability in neuromorphic hardware.
    A Unified Experiment Design Approach for Cyclic and Acyclic Causal Models. (arXiv:2205.10083v2 [cs.LG] UPDATED)
    We study experiment design for unique identification of the causal graph of a system where the graph may contain cycles. The presence of cycles in the structure introduces major challenges for experiment design as, unlike acyclic graphs, learning the skeleton of causal graphs with cycles may not be possible from merely the observational distribution. Furthermore, intervening on a variable in such graphs does not necessarily lead to orienting all the edges incident to it. In this paper, we propose an experiment design approach that can learn both cyclic and acyclic graphs and hence, unifies the task of experiment design for both types of graphs. We provide a lower bound on the number of experiments required to guarantee the unique identification of the causal graph in the worst case, showing that the proposed approach is order-optimal in terms of the number of experiments up to an additive logarithmic term. Moreover, we extend our result to the setting where the size of each experiment is bounded by a constant. For this case, we show that our approach is optimal in terms of the size of the largest experiment required for uniquely identifying the causal graph in the worst case.
    MAViC: Multimodal Active Learning for Video Captioning. (arXiv:2212.11109v1 [cs.CV])
    A large number of annotated video-caption pairs are required for training video captioning models, resulting in high annotation costs. Active learning can be instrumental in reducing these annotation requirements. However, active learning for video captioning is challenging because multiple semantically similar captions are valid for a video, resulting in high entropy outputs even for less-informative samples. Moreover, video captioning algorithms are multimodal in nature with a visual encoder and language decoder. Further, the sequential and combinatorial nature of the output makes the problem even more challenging. In this paper, we introduce MAViC which leverages our proposed Multimodal Semantics Aware Sequential Entropy (M-SASE) based acquisition function to address the challenges of active learning approaches for video captioning. Our approach integrates semantic similarity and uncertainty of both visual and language dimensions in the acquisition function. Our detailed experiments empirically demonstrate the efficacy of M-SASE for active learning for video captioning and improve on the baselines by a large margin.
    MutexMatch: Semi-Supervised Learning with Mutex-Based Consistency Regularization. (arXiv:2203.14316v2 [cs.CV] UPDATED)
    The core issue in semi-supervised learning (SSL) lies in how to effectively leverage unlabeled data, whereas most existing methods tend to put a great emphasis on the utilization of high-confidence samples yet seldom fully explore the usage of low-confidence samples. In this paper, we aim to utilize low-confidence samples in a novel way with our proposed mutex-based consistency regularization, namely MutexMatch. Specifically, the high-confidence samples are required to exactly predict "what it is" by conventional True-Positive Classifier, while the low-confidence samples are employed to achieve a simpler goal -- to predict with ease "what it is not" by True-Negative Classifier. In this sense, we not only mitigate the pseudo-labeling errors but also make full use of the low-confidence unlabeled data by consistency of dissimilarity degree. MutexMatch achieves superior performance on multiple benchmark datasets, i.e., CIFAR-10, CIFAR-100, SVHN, STL-10, mini-ImageNet and Tiny-ImageNet. More importantly, our method further shows superiority when the amount of labeled data is scarce, e.g., 92.23% accuracy with only 20 labeled data on CIFAR-10. Our code and model weights have been released at https://github.com/NJUyued/MutexMatch4SSL.
    Synthesizing Informative Training Samples with GAN. (arXiv:2204.07513v2 [cs.LG] UPDATED)
    Remarkable progress has been achieved in synthesizing photo-realistic images with generative adversarial networks (GANs). Recently, GANs are utilized as the training sample generator when obtaining or storing real training data is expensive even infeasible. However, traditional GANs generated images are not as informative as the real training samples when being used to train deep neural networks. In this paper, we propose a novel method to synthesize Informative Training samples with GAN (IT-GAN). Specifically, we freeze a pre-trained GAN model and learn the informative latent vectors that correspond to informative training samples. The synthesized images are required to preserve information for training deep neural networks rather than visual reality or fidelity. Experiments verify that the deep neural networks can learn faster and achieve better performance when being trained with our IT-GAN generated images. We also show that our method is a promising solution to dataset condensation problem.
    Answering Complex Logical Queries on Knowledge Graphs via Query Computation Tree Optimization. (arXiv:2212.09567v2 [cs.LG] UPDATED)
    Answering complex logical queries on incomplete knowledge graphs is a challenging task, and has been widely studied. Embedding-based methods require training on complex queries, and cannot generalize well to out-of-distribution query structures. Recent work frames this task as an end-to-end optimization problem, and it only requires a pretrained link predictor. However, due to the exponentially large combinatorial search space, the optimal solution can only be approximated, limiting the final accuracy. In this work, we propose QTO (Query Computation Tree Optimization) that can efficiently find the exact optimal solution. QTO finds the optimal solution by a forward-backward propagation on the tree-like computation graph, i.e., query computation tree. In particular, QTO utilizes the independence encoded in the query computation tree to reduce the search space, where only local computations are involved during the optimization procedure. Experiments on 3 datasets show that QTO obtains state-of-the-art performance on complex query answering, outperforming previous best results by an average of 22%. Moreover, QTO can interpret the intermediate solutions for each of the one-hop atoms in the query with over 90% accuracy.
    Revisiting Residual Networks for Adversarial Robustness: An Architectural Perspective. (arXiv:2212.11005v1 [cs.CV])
    Efforts to improve the adversarial robustness of convolutional neural networks have primarily focused on developing more effective adversarial training methods. In contrast, little attention was devoted to analyzing the role of architectural elements (such as topology, depth, and width) on adversarial robustness. This paper seeks to bridge this gap and present a holistic study on the impact of architectural design on adversarial robustness. We focus on residual networks and consider architecture design at the block level, i.e., topology, kernel size, activation, and normalization, as well as at the network scaling level, i.e., depth and width of each block in the network. In both cases, we first derive insights through systematic ablative experiments. Then we design a robust residual block, dubbed RobustResBlock, and a compound scaling rule, dubbed RobustScaling, to distribute depth and width at the desired FLOP count. Finally, we combine RobustResBlock and RobustScaling and present a portfolio of adversarially robust residual networks, RobustResNets, spanning a broad spectrum of model capacities. Experimental validation across multiple datasets and adversarial attacks demonstrate that RobustResNets consistently outperform both the standard WRNs and other existing robust architectures, achieving state-of-the-art AutoAttack robust accuracy of 61.1% without additional data and 63.7% with 500K external data while being $2\times$ more compact in terms of parameters. Code is available at \url{ https://github.com/zhichao-lu/robust-residual-network}
    Machine Learning for Microcontroller-Class Hardware: A Review. (arXiv:2205.14550v5 [cs.LG] UPDATED)
    The advancements in machine learning opened a new opportunity to bring intelligence to the low-end Internet-of-Things nodes such as microcontrollers. Conventional machine learning deployment has high memory and compute footprint hindering their direct deployment on ultra resource-constrained microcontrollers. This paper highlights the unique requirements of enabling onboard machine learning for microcontroller class devices. Researchers use a specialized model development workflow for resource-limited applications to ensure the compute and latency budget is within the device limits while still maintaining the desired performance. We characterize a closed-loop widely applicable workflow of machine learning model development for microcontroller class devices and show that several classes of applications adopt a specific instance of it. We present both qualitative and numerical insights into different stages of model development by showcasing several use cases. Finally, we identify the open research challenges and unsolved questions demanding careful considerations moving forward.
    A survey on text generation using generative adversarial networks. (arXiv:2212.11119v1 [cs.CL])
    This work presents a thorough review concerning recent studies and text generation advancements using Generative Adversarial Networks. The usage of adversarial learning for text generation is promising as it provides alternatives to generate the so-called "natural" language. Nevertheless, adversarial text generation is not a simple task as its foremost architecture, the Generative Adversarial Networks, were designed to cope with continuous information (image) instead of discrete data (text). Thus, most works are based on three possible options, i.e., Gumbel-Softmax differentiation, Reinforcement Learning, and modified training objectives. All alternatives are reviewed in this survey as they present the most recent approaches for generating text using adversarial-based techniques. The selected works were taken from renowned databases, such as Science Direct, IEEEXplore, Springer, Association for Computing Machinery, and arXiv, whereas each selected work has been critically analyzed and assessed to present its objective, methodology, and experimental results.
    Balanced Split: A new train-test data splitting strategy for imbalanced datasets. (arXiv:2212.11116v1 [cs.LG])
    Classification data sets with skewed class proportions are called imbalanced. Class imbalance is a problem since most machine learning classification algorithms are built with an assumption of equal representation of all classes in the training dataset. Therefore to counter the class imbalance problem, many algorithm-level and data-level approaches have been developed. These mainly include ensemble learning and data augmentation techniques. This paper shows a new way to counter the class imbalance problem through a new data-splitting strategy called balanced split. Data splitting can play an important role in correctly classifying imbalanced datasets. We show that the commonly used data-splitting strategies have some disadvantages, and our proposed balanced split has solved those problems.
    Robust Path Selection in Software-defined WANs using Deep Reinforcement Learning. (arXiv:2212.11155v1 [cs.NI])
    In the context of an efficient network traffic engineering process where the network continuously measures a new traffic matrix and updates the set of paths in the network, an automated process is required to quickly and efficiently identify when and what set of paths should be used. Unfortunately, the burden of finding the optimal solution for the network updating process in each given time interval is high since the computation complexity of optimization approaches using linear programming increases significantly as the size of the network increases. In this paper, we use deep reinforcement learning to derive a data-driven algorithm that does the path selection in the network considering the overhead of route computation and path updates. Our proposed scheme leverages information about past network behavior to identify a set of robust paths to be used for multiple future time intervals to avoid the overhead of updating the forwarding behavior of routers frequently. We compare the results of our approach to other traffic engineering solutions through extensive simulations across real network topologies. Our results demonstrate that our scheme fares well by a factor of 40% with respect to reducing link utilization compared to traditional TE schemes such as ECMP. Our scheme provides a slightly higher link utilization (around 25%) compared to schemes that only minimize link utilization and do not care about path updating overhead.
    Adapting the Exploration Rate for Value-of-Information-Based Reinforcement Learning. (arXiv:2212.11083v1 [cs.LG])
    In this paper, we consider the problem of adjusting the exploration rate when using value-of-information-based exploration. We do this by converting the value-of-information optimization into a problem of finding equilibria of a flow for a changing exploration rate. We then develop an efficient path-following scheme for converging to these equilibria and hence uncovering optimal action-selection policies. Under this scheme, the exploration rate is automatically adapted according to the agent's experiences. Global convergence is theoretically assured. We first evaluate our exploration-rate adaptation on the Nintendo GameBoy games Centipede and Millipede. We demonstrate aspects of the search process. We show that our approach yields better policies in fewer episodes than conventional search strategies relying on heuristic, annealing-based exploration-rate adjustments. We then illustrate that these trends hold for deep, value-of-information-based agents that learn to play ten simple games and over forty more complicated games for the Nintendo GameBoy system. Performance either near or well above the level of human play is observed.
    Efficient First-order Methods for Convex Optimization with Strongly Convex Function Constraints. (arXiv:2212.11143v1 [math.OC])
    Convex function constrained optimization has received growing research interests lately. For a special convex problem which has strongly convex function constraints, we develop a new accelerated primal-dual first-order method that obtains an $\Ocal(1/\sqrt{\vep})$ complexity bound, improving the $\Ocal(1/{\vep})$ result for the state-of-the-art first-order methods. The key ingredient to our development is some novel techniques to progressively estimate the strong convexity of the Lagrangian function, which enables adaptive step-size selection and faster convergence performance. In addition, we show that the complexity is further improvable in terms of the dependence on some problem parameter, via a restart scheme that calls the accelerated method repeatedly. As an application, we consider sparsity-inducing constrained optimization which has a separable convex objective and a strongly convex loss constraint. In addition to achieving fast convergence, we show that the restarted method can effectively identify the sparsity pattern (active-set) of the optimal solution in finite steps. To the best of our knowledge, this is the first active-set identification result for sparsity-inducing constrained optimization.
    Gen\'eLive! Generating Rhythm Actions in Love Live!. (arXiv:2202.12823v2 [cs.LG] UPDATED)
    This article presents our generative model for rhythm action games together with applications in business operations. Rhythm action games are video games in which the player is challenged to issue commands at the right timings during a music session. The timings are rendered in the chart, which consists of visual symbols, called notes, flying through the screen. We introduce our deep generative model, Gen\'eLive!, which outperforms the state-of-the-art model by taking into account musical structures through beats and temporal scales. Thanks to its favorable performance, Gen\'eLive! was put into operation at KLab Inc., a Japan-based video game developer, and reduced the business cost of chart generation by as much as half. The application target included the phenomenal "Love Live!," which has more than 10 million users across Asia and beyond, and is one of the few rhythm action franchises that has led the online era of the genre. In this article, we evaluate the generative performance of Gen\'eLive! using production datasets at KLab as well as open datasets for reproducibility, while the model continues to operate in their business. Our code and the model, tuned and trained using a supercomputer, are publicly available.
    Multi-View Active Learning for Short Text Classification in User-Generated Data. (arXiv:2112.02611v2 [cs.CL] UPDATED)
    Mining user-generated data often suffers from the lack of enough labeled data, short document lengths, and the informal user language. In this paper, we propose a novel active learning model to overcome these obstacles in the tasks tailored for query phrases--e.g., detecting positive reports of natural disasters. Our model has three novelties: 1) It is the first approach to employ multi-view active learning in this domain. 2) It uses the Parzen-Rosenblatt window method to integrate the representativeness measure into multi-view active learning. 3) It employs a query-by-committee strategy, based on the agreement between predictors, to address the usually noisy language of the documents in this domain. We evaluate our model in four publicly available Twitter datasets with distinctly different applications. We also compare our model with a wide range of baselines including those with multiple classifiers. The experiments testify that our model is highly consistent and outperforms existing models.
    Rank4Class: A Ranking Formulation for Multiclass Classification. (arXiv:2112.09727v2 [cs.LG] UPDATED)
    Multiclass classification (MCC) is a fundamental machine learning problem of classifying each instance into one of a predefined set of classes. In the deep learning era, extensive efforts have been spent on developing more powerful neural embedding models to better represent the instance for improving MCC performance. In this paper, we do not aim to propose new neural models for instance representation learning, but to show that it is promising to boost MCC performance with a novel formulation through the lens of ranking. In particular, by viewing MCC as to rank classes for an instance, we first argue that ranking metrics, such as Normalized Discounted Cumulative Gain, can be more informative than the commonly used Top-$K$ metrics. We further demonstrate that the dominant neural MCC recipe can be transformed to a neural ranking framework. Based on such generalization, we show that it is intuitive to leverage advanced techniques from the learning to rank literature to improve the MCC performance out of the box. Extensive empirical results on both text and image classification tasks with diverse datasets and backbone neural models show the value of our proposed framework.
    A new weighted ensemble model for phishing detection based on feature selection. (arXiv:2212.11125v1 [cs.CR])
    A phishing attack is a sort of cyber assault in which the attacker sends fake communications to entice a human victim to provide personal information or credentials. Phishing website identification can assist visitors in avoiding becoming victims of these assaults. The phishing problem is increasing day by day, and there is no single solution that can properly mitigate all vulnerabilities, thus many techniques are used. In this paper, We have proposed an ensemble model that combines multiple base models with a voting technique based on the weights. Moreover, we applied feature selection methods and standardization on the dataset effectively and compared the result before and after applying any feature selection.
    Chatbots in a Botnet World. (arXiv:2212.11126v1 [cs.CR])
    Question-and-answer formats provide a novel experimental platform for investigating cybersecurity questions. Unlike previous chatbots, the latest ChatGPT model from OpenAI supports an advanced understanding of complex coding questions. The research demonstrates thirteen coding tasks that generally qualify as stages in the MITRE ATT&CK framework, ranging from credential access to defense evasion. With varying success, the experimental prompts generate examples of keyloggers, logic bombs, obfuscated worms, and payment-fulfilled ransomware. The empirical results illustrate cases that support the broad gain of functionality, including self-replication and self-modification, evasion, and strategic understanding of complex cybersecurity goals. One surprising feature of ChatGPT as a language-only model centers on its ability to spawn coding approaches that yield images that obfuscate or embed executable programming steps or links.
    Classification and mapping of low-statured 'shrubland' cover types in post-agricultural landscapes of the US Northeast. (arXiv:2205.05047v2 [cs.CV] UPDATED)
    Novel plant communities reshape landscapes and pose challenges for land cover classification and mapping that can constrain research and stewardship efforts. In the US Northeast, emergence of low-statured woody vegetation, or shrublands, instead of secondary forests in post-agricultural landscapes is well-documented by field studies, but poorly understood from a landscape perspective, which limits the ability to systematically study and manage these lands. To address gaps in classification/mapping of low-statured cover types where they have been historically rare, we developed models to predict shrubland distributions at 30m resolution across New York State (NYS), using a stacked ensemble combining a random forest, gradient boosting machine, and artificial neural network to integrate remote sensing of structural (airborne LIDAR) and optical (satellite imagery) properties of vegetation cover. We first classified a 1m canopy height model (CHM), derived from a patchwork of available LIDAR coverages, to define shrubland presence/absence. Next, these non-contiguous maps were used to train a model ensemble based on temporally-segmented imagery to predict shrubland probability for the entire study landscape (NYS). Approximately 2.5% of the CHM coverage area was classified as shrubland. Models using Landsat predictors trained on the classified CHM were effective at identifying shrubland (test set AUC=0.893, real-world AUC=0.904), in discriminating between shrub/young forest and other cover classes, and produced qualitatively sensible maps, even when extending beyond the original training data. Our results suggest that incorporation of airborne LiDAR, even from a discontinuous patchwork of coverages, can improve land cover classification of historically rare but increasingly prevalent shrubland habitats across broader areas.
    Planning with Diffusion for Flexible Behavior Synthesis. (arXiv:2205.09991v2 [cs.LG] UPDATED)
    Model-based reinforcement learning methods often use learning only for the purpose of estimating an approximate dynamics model, offloading the rest of the decision-making work to classical trajectory optimizers. While conceptually simple, this combination has a number of empirical shortcomings, suggesting that learned models may not be well-suited to standard trajectory optimization. In this paper, we consider what it would look like to fold as much of the trajectory optimization pipeline as possible into the modeling problem, such that sampling from the model and planning with it become nearly identical. The core of our technical approach lies in a diffusion probabilistic model that plans by iteratively denoising trajectories. We show how classifier-guided sampling and image inpainting can be reinterpreted as coherent planning strategies, explore the unusual and useful properties of diffusion-based planning methods, and demonstrate the effectiveness of our framework in control settings that emphasize long-horizon decision-making and test-time flexibility.
    Sequential Training of Neural Networks with Gradient Boosting. (arXiv:1909.12098v3 [cs.LG] UPDATED)
    This paper presents a novel technique based on gradient boosting to train the final layers of a neural network (NN). Gradient boosting is an additive expansion algorithm in which a series of models are trained sequentially to approximate a given function. A neural network can also be seen as an additive expansion where the scalar product of the responses of the last hidden layer and its weights provide the final output of the network. Instead of training the network as a whole, the proposed algorithm trains the network sequentially in $T$ steps. First, the bias term of the network is initialized with a constant approximation that minimizes the average loss of the data. Then, at each step, a portion of the network, composed of $J$ neurons, is trained to approximate the pseudo-residuals on the training data computed from the previous iterations. Finally, the $T$ partial models and bias are integrated as a single NN with $T \times J$ neurons in the hidden layer. Extensive experiments in classification and regression tasks, as well as in combination with deep neural networks, are carried out showing a competitive generalization performance with respect to neural networks trained with different standard solvers, such as Adam, L-BFGS, SGD and deep models. Furthermore, we show that the proposed method design permits to switch off a number of hidden units during test (the units that were last trained) without a significant reduction of its generalization ability. This permits the adaptation of the model to different classification speed requirements on the fly.
    Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing. (arXiv:2212.10789v1 [cs.LG])
    There is increasing adoption of artificial intelligence in drug discovery. However, existing works use machine learning to mainly utilize the chemical structures of molecules yet ignore the vast textual knowledge available in chemistry. Incorporating textual knowledge enables us to realize new drug design objectives, adapt to text-based instructions, and predict complex biological activities. We present a multi-modal molecule structure-text model, MoleculeSTM, by jointly learning molecule's chemical structures and textual descriptions via a contrastive learning strategy. To train MoleculeSTM, we construct the largest multi-modal dataset to date, namely PubChemSTM, with over 280K chemical structure-text pairs. To demonstrate the effectiveness and utility of MoleculeSTM, we design two challenging zero-shot tasks based on text instructions, including structure-text retrieval and molecule editing. MoleculeSTM possesses two main properties: open vocabulary and compositionality via natural language. In experiments, MoleculeSTM obtains the state-of-the-art generalization ability to novel biochemical concepts across various benchmarks.
    Reservoir Computing Using Complex Systems. (arXiv:2212.11141v1 [cs.LG])
    Reservoir Computing is an emerging machine learning framework which is a versatile option for utilising physical systems for computation. In this paper, we demonstrate how a single node reservoir, made of a simple electronic circuit, can be employed for computation and explore the available options to improve the computational capability of the physical reservoirs. We build a reservoir computing system using a memristive chaotic oscillator as the reservoir. We choose two of the available hyperparameters to find the optimal working regime for the reservoir, resulting in two reservoir versions. We compare the performance of both the reservoirs in a set of three non-temporal tasks: approximating two non-chaotic polynomials and a chaotic trajectory of the Lorenz time series. We also demonstrate how the dynamics of the physical system plays a direct role in the reservoir's hyperparameters and hence in the reservoir's prediction ability.
    Personalized Decentralized Multi-Task Learning Over Dynamic Communication Graphs. (arXiv:2212.11268v1 [cs.LG])
    Decentralized and federated learning algorithms face data heterogeneity as one of the biggest challenges, especially when users want to learn a specific task. Even when personalized headers are used concatenated to a shared network (PF-MTL), aggregating all the networks with a decentralized algorithm can result in performance degradation as a result of heterogeneity in the data. Our algorithm uses exchanged gradients to calculate the correlations among tasks automatically, and dynamically adjusts the communication graph to connect mutually beneficial tasks and isolate those that may negatively impact each other. This algorithm improves the learning performance and leads to faster convergence compared to the case where all clients are connected to each other regardless of their correlations. We conduct experiments on a synthetic Gaussian dataset and a large-scale celebrity attributes (CelebA) dataset. The experiment with the synthetic data illustrates that our proposed method is capable of detecting tasks that are positively and negatively correlated. Moreover, the results of the experiments with CelebA demonstrate that the proposed method may produce significantly faster training results than fully-connected networks.
    Scalable Hybrid Learning Techniques for Scientific Data Compression. (arXiv:2212.10733v1 [cs.LG])
    Data compression is becoming critical for storing scientific data because many scientific applications need to store large amounts of data and post process this data for scientific discovery. Unlike image and video compression algorithms that limit errors to primary data, scientists require compression techniques that accurately preserve derived quantities of interest (QoIs). This paper presents a physics-informed compression technique implemented as an end-to-end, scalable, GPU-based pipeline for data compression that addresses this requirement. Our hybrid compression technique combines machine learning techniques and standard compression methods. Specifically, we combine an autoencoder, an error-bounded lossy compressor to provide guarantees on raw data error, and a constraint satisfaction post-processing step to preserve the QoIs within a minimal error (generally less than floating point error). The effectiveness of the data compression pipeline is demonstrated by compressing nuclear fusion simulation data generated by a large-scale fusion code, XGC, which produces hundreds of terabytes of data in a single day. Our approach works within the ADIOS framework and results in compression by a factor of more than 150 while requiring only a few percent of the computational resources necessary for generating the data, making the overall approach highly effective for practical scenarios.
    An AI-Powered VVPAT Counter for Elections in India. (arXiv:2212.11124v1 [cs.CV])
    The Election Commission of India has introduced Voter Verified Paper Audit Trail since 2019. This mechanism has increased voter confidence at the time of casting the votes. However, physical verification of the VVPATs against the party level counts from the EVMs is done only in 5 (randomly selected) machines per constituency. The time required to conduct physical verification becomes a bottleneck in scaling this activity for 100% of machines in all constituencies. We proposed an automated counter powered by image processing and machine learning algorithms to speed up the process and address this issue.
    Nervus: A Comprehensive Deep Learning Classification, Regression, and Prognostication Tool for both Medical Image and Clinical Data Analysis. (arXiv:2212.11113v1 [eess.IV])
    The goal of our research is to create a comprehensive and flexible library that is easy to use for medical imaging research, and capable of handling grayscale images, multiple inputs (both images and tabular data), and multi-label tasks. We have named it Nervus. Based on the PyTorch library, which is suitable for AI for research purposes, we created a four-part model to handle comprehensive inputs and outputs. Nervus consists of four parts. First is the dataloader, then the feature extractor, the feature mixer, and finally the classifier. The dataloader preprocesses the input data, the feature extractor extracts the features between the training data and ground truth labels, feature mixer mixes the features of the extractors, and the classifier classifies the input data from feature mixer based on the task. We have created Nervus, which is a comprehensive and flexible model library that is easy to use for medical imaging research which can handle grayscale images, multi-inputs and multi-label tasks. This will be helpful for researchers in the field of radiology.
    BaCO: A Fast and Portable Bayesian Compiler Optimization Framework. (arXiv:2212.11142v1 [cs.PL])
    We introduce the Bayesian Compiler Optimization framework (BaCO), a general purpose autotuner for modern compilers targeting CPUs, GPUs, and FPGAs. BaCO provides the flexibility needed to handle the requirements of modern autotuning tasks. Particularly, it deals with permutation, ordered, and continuous parameter types along with both known and unknown parameter constraints. To reason about these parameter types and efficiently deliver high-quality code, BaCO uses Bayesian optimization algorithms specialized towards the autotuning domain. We demonstrate BaCO's effectiveness on three modern compiler systems: TACO, RISE & ELEVATE, and HPVM2FPGA for CPUs, GPUs, and FPGAs respectively. For these domains, BaCO outperforms current state-of-the-art autotuners by delivering on average 1.39x-1.89x faster code with a tiny search budget, and BaCO is able to reach expert-level performance 2.89x-8.77x faster.
    Relative Importance Sampling For Off-Policy Actor-Critic in Deep Reinforcement Learning. (arXiv:1810.12558v7 [cs.LG] UPDATED)
    Off-policy learning is more unstable compared to on-policy learning in reinforcement learning (RL). One reason for the instability of off-policy learning is a discrepancy between the target ($\pi$) and behavior (b) policy distributions. The discrepancy between $\pi$ and b distributions can be alleviated by employing a smooth variant of the importance sampling (IS), such as the relative importance sampling (RIS). RIS has parameter $\beta\in[0, 1]$ which controls smoothness. To cope with instability, we present the first relative importance sampling-off-policy actor-critic (RIS-Off-PAC) model-free algorithms in RL. In our method, the network yields a target policy (the actor), a value function (the critic) assessing the current policy ($\pi$) using samples drawn from behavior policy. We use action value generated from the behavior policy in reward function to train our algorithm rather than from the target policy. We also use deep neural networks to train both actor and critic. We evaluated our algorithm on a number of Open AI Gym benchmark problems and demonstrate better or comparable performance to several state-of-the-art RL baselines.
    LogAnMeta: Log Anomaly Detection Using Meta Learning. (arXiv:2212.10992v1 [cs.LG])
    Modern telecom systems are monitored with performance and system logs from multiple application layers and components. Detecting anomalous events from these logs is key to identify security breaches, resource over-utilization, critical/fatal errors, etc. Current supervised log anomaly detection frameworks tend to perform poorly on new types or signatures of anomalies with few or unseen samples in the training data. In this work, we propose a meta-learning-based log anomaly detection framework (LogAnMeta) for detecting anomalies from sequence of log events with few samples. LoganMeta train a hybrid few-shot classifier in an episodic manner. The experimental results demonstrate the efficacy of our proposed method
    Empirical Analysis of Limits for Memory Distance in Recurrent Neural Networks. (arXiv:2212.11085v1 [cs.LG])
    Common to all different kinds of recurrent neural networks (RNNs) is the intention to model relations between data points through time. When there is no immediate relationship between subsequent data points (like when the data points are generated at random, e.g.), we show that RNNs are still able to remember a few data points back into the sequence by memorizing them by heart using standard backpropagation. However, we also show that for classical RNNs, LSTM and GRU networks the distance of data points between recurrent calls that can be reproduced this way is highly limited (compared to even a loose connection between data points) and subject to various constraints imposed by the type and size of the RNN in question. This implies the existence of a hard limit (way below the information-theoretic one) for the distance between related data points within which RNNs are still able to recognize said relation.
    Generating music with sentiment using Transformer-GANs. (arXiv:2212.11134v1 [cs.SD])
    The field of Automatic Music Generation has seen significant progress thanks to the advent of Deep Learning. However, most of these results have been produced by unconditional models, which lack the ability to interact with their users, not allowing them to guide the generative process in meaningful and practical ways. Moreover, synthesizing music that remains coherent across longer timescales while still capturing the local aspects that make it sound ``realistic'' or ``human-like'' is still challenging. This is due to the large computational requirements needed to work with long sequences of data, and also to limitations imposed by the training schemes that are often employed. In this paper, we propose a generative model of symbolic music conditioned by data retrieved from human sentiment. The model is a Transformer-GAN trained with labels that correspond to different configurations of the valence and arousal dimensions that quantitatively represent human affective states. We try to tackle both of the problems above by employing an efficient linear version of Attention and using a Discriminator both as a tool to improve the overall quality of the generated music and its ability to follow the conditioning signals.
    In-context Learning Distillation: Transferring Few-shot Learning Ability of Pre-trained Language Models. (arXiv:2212.10670v1 [cs.CL])
    Given the success with in-context learning of large pre-trained language models, we introduce in-context learning distillation to transfer in-context few-shot learning ability from large models to smaller models. We propose to combine in-context learning objectives with language modeling objectives to distill both the ability to read in-context examples and task knowledge to the smaller models. We perform in-context learning distillation under two different few-shot learning paradigms: Meta In-context Tuning (Meta-ICT) and Multitask In-context Tuning (Multitask-ICT). Multitask-ICT performs better on multitask few-shot learning but also requires more computation than Meta-ICT. Our method shows consistent improvements for both Meta-ICT and Multitask-ICT on two benchmarks: LAMA and CrossFit. Our extensive experiments and analysis reveal that in-context learning objectives and language modeling objectives are complementary under the Multitask-ICT paradigm. In-context learning objectives achieve the best performance when combined with language modeling objectives.
    Holistic risk assessment of inference attacks in machine learning. (arXiv:2212.10628v1 [cs.CR])
    As machine learning expanding application, there are more and more unignorable privacy and safety issues. Especially inference attacks against Machine Learning models allow adversaries to infer sensitive information about the target model, such as training data, model parameters, etc. Inference attacks can lead to serious consequences, including violating individuals privacy, compromising the intellectual property of the owner of the machine learning model. As far as concerned, researchers have studied and analyzed in depth several types of inference attacks, albeit in isolation, but there is still a lack of a holistic rick assessment of inference attacks against machine learning models, such as their application in different scenarios, the common factors affecting the performance of these attacks and the relationship among the attacks. As a result, this paper performs a holistic risk assessment of different inference attacks against Machine Learning models. This paper focuses on three kinds of representative attacks: membership inference attack, attribute inference attack and model stealing attack. And a threat model taxonomy is established. A total of 12 target models using three model architectures, including AlexNet, ResNet18 and Simple CNN, are trained on four datasets, namely CelebA, UTKFace, STL10 and FMNIST.
    Ensemble learning techniques for intrusion detection system in the context of cybersecurity. (arXiv:2212.10913v1 [cs.CR])
    Recently, there has been an interest in improving the resources available in Intrusion Detection System (IDS) techniques. In this sense, several studies related to cybersecurity show that the environment invasions and information kidnapping are increasingly recurrent and complex. The criticality of the business involving operations in an environment using computing resources does not allow the vulnerability of the information. Cybersecurity has taken on a dimension within the universe of indispensable technology in corporations, and the prevention of risks of invasions into the environment is dealt with daily by Security teams. Thus, the main objective of the study was to investigate the Ensemble Learning technique using the Stacking method, supported by the Support Vector Machine (SVM) and k-Nearest Neighbour (kNN) algorithms aiming at an optimization of the results for DDoS attack detection. For this, the Intrusion Detection System concept was used with the application of the Data Mining and Machine Learning Orange tool to obtain better results
    Semi-Supervised Bifold Teacher-Student Learning for Indoor Presence Detection Under Time-Varying CSI. (arXiv:2212.10802v1 [cs.AI])
    In recent years, there have been abundant researches focused on indoor human presence detection based on laborious supervised learning (SL) and channel state information (CSI). These existing studies adopt spatial information of CSI to improve detection accuracy. However, channel is susceptible to arbitrary environmental changes in practice, such as the object movement, atmospheric factors and machine rebooting, which leads to degraded prediction accuracy. However, the existing SL-based methods require to re-train a new model with time-consuming labeling. Therefore, designing a semi-supervised learning (SSL) based scheme by continuously monitoring model "life-cycle" becomes compellingly imperative. In this paper, we propose bifold teacher-student (BTS) learning for presence detection system, which combines SSL by utilizing partial labeled and unlabeled dataset. The proposed primal-dual teacher-student network is capable of intelligently learning spatial and temporal features from labeled and unlabeled CSI. Additionally, the enhanced penalized loss function leveraging entropy and distance measure can distinguish the drifted data, i.e., features of new dataset are affected by time-varying effect and are alternated from the original distribution. The experimental results demonstrate that the proposed BTS system can sustain the asymptotic accuracy after retraining the model with unlabeled data. Moreover, label-free BTS outperforms the existing SSL-based models in terms of the highest detection accuracy, while achieving the similar performance of SL-based methods.
    Similarity Contrastive Estimation for Image and Video Soft Contrastive Self-Supervised Learning. (arXiv:2212.11187v1 [cs.CV])
    Contrastive representation learning has proven to be an effective self-supervised learning method for images and videos. Most successful approaches are based on Noise Contrastive Estimation (NCE) and use different views of an instance as positives that should be contrasted with other instances, called negatives, that are considered as noise. However, several instances in a dataset are drawn from the same distribution and share underlying semantic information. A good data representation should contain relations between the instances, or semantic similarity and dissimilarity, that contrastive learning harms by considering all negatives as noise. To circumvent this issue, we propose a novel formulation of contrastive learning using semantic similarity between instances called Similarity Contrastive Estimation (SCE). Our training objective is a soft contrastive one that brings the positives closer and estimates a continuous distribution to push or pull negative instances based on their learned similarities. We validate empirically our approach on both image and video representation learning. We show that SCE performs competitively with the state of the art on the ImageNet linear evaluation protocol for fewer pretraining epochs and that it generalizes to several downstream image tasks. We also show that SCE reaches state-of-the-art results for pretraining video representation and that the learned representation can generalize to video downstream tasks.
    A Survey of Mix-based Data Augmentation: Taxonomy, Methods, Applications, and Explainability. (arXiv:2212.10888v1 [cs.LG])
    Data augmentation (DA) is indispensable in modern machine learning and deep neural networks. The basic idea of DA is to construct new training data to improve the model's generalization by adding slightly disturbed versions of existing data or synthesizing new data. In this work, we review a small but essential subset of DA -- Mix-based Data Augmentation (MixDA) that generates novel samples by mixing multiple examples. Unlike conventional DA approaches based on a single-sample operation or requiring domain knowledge, MixDA is more general in creating a broad spectrum of new data and has received increasing attention in the community. We begin with proposing a new taxonomy classifying MixDA into, Mixup-based, Cutmix-based, and hybrid approaches according to a hierarchical view of the data mix. Various MixDA techniques are then comprehensively reviewed in a more fine-grained way. Owing to its generalization, MixDA has penetrated a variety of applications which are also completely reviewed in this work. We also examine why MixDA works from different aspects of improving model performance, generalization, and calibration while explaining the model behavior based on the properties of MixDA. Finally, we recapitulate the critical findings and fundamental challenges of current MixDA studies, and outline the potential directions for future works. Different from previous related works that summarize the DA approaches in a specific domain (e.g., images or natural language processing) or only review a part of MixDA studies, we are the first to provide a systematical survey of MixDA in terms of its taxonomy, methodology, applications, and explainability. This work can serve as a roadmap to MixDA techniques and application reviews while providing promising directions for researchers interested in this exciting area.
    Learning List-Level Domain-Invariant Representations for Ranking. (arXiv:2212.10764v1 [cs.IR])
    Domain adaptation aims to transfer the knowledge acquired by models trained on (data-rich) source domains to (low-resource) target domains, for which a popular method is invariant representation learning. While they have been studied extensively for classification and regression problems, how they apply to ranking problems, where the data and metrics have a list structure, is not well understood. Theoretically, we establish a domain adaptation generalization bound for ranking under listwise metrics such as MRR and NDCG. The bound suggests an adaptation method via learning list-level domain-invariant feature representations, whose benefits are empirically demonstrated by unsupervised domain adaptation experiments on real-world ranking tasks, including passage reranking. A key message is that for domain adaptation, the representations should be analyzed at the same level at which the metric is computed, as we show that learning invariant representations at the list level is most effective for adaptation on ranking problems.
    Reward Bonuses with Gain Scheduling Inspired by Iterative Deepening Search. (arXiv:2212.10765v1 [cs.LG])
    This paper introduces a novel method of adding intrinsic bonuses to task-oriented reward function in order to efficiently facilitate reinforcement learning search. While various bonuses have been designed to date, they are analogous to the depth-first and breadth-first search algorithms in graph theory. This paper, therefore, first designs two bonuses for each of them. Then, a heuristic gain scheduling is applied to the designed bonuses, inspired by the iterative deepening search, which is known to inherit the advantages of the two search algorithms. The proposed method is expected to allow agent to efficiently reach the best solution in deeper states by gradually exploring unknown states. In three locomotion tasks with dense rewards and three simple tasks with sparse rewards, it is shown that the two types of bonuses contribute to the performance improvement of the different tasks complementarily. In addition, by combining them with the proposed gain scheduling, all tasks can be accomplished with high performance.
    Crab: Learning Certifiably Fair Predictive Models in the Presence of Selection Bias. (arXiv:2212.10839v1 [cs.LG])
    A recent explosion of research focuses on developing methods and tools for building fair predictive models. However, most of this work relies on the assumption that the training and testing data are representative of the target population on which the model will be deployed. However, real-world training data often suffer from selection bias and are not representative of the target population for many reasons, including the cost and feasibility of collecting and labeling data, historical discrimination, and individual biases. In this paper, we introduce a new framework for certifying and ensuring the fairness of predictive models trained on biased data. We take inspiration from query answering over incomplete and inconsistent databases to present and formalize the problem of consistent range approximation (CRA) of answers to queries about aggregate information for the target population. We aim to leverage background knowledge about the data collection process, biased data, and limited or no auxiliary data sources to compute a range of answers for aggregate queries over the target population that are consistent with available information. We then develop methods that use CRA of such aggregate queries to build predictive models that are certifiably fair on the target population even when no external information about that population is available during training. We evaluate our methods on real data and demonstrate improvements over state of the art. Significantly, we show that enforcing fairness using our methods can lead to predictive models that are not only fair, but more accurate on the target population.
    SoK: Let The Privacy Games Begin! A Unified Treatment of Data Inference Privacy in Machine Learning. (arXiv:2212.10986v1 [cs.LG])
    Deploying machine learning models in production may allow adversaries to infer sensitive information about training data. There is a vast literature analyzing different types of inference risks, ranging from membership inference to reconstruction attacks. Inspired by the success of games (i.e., probabilistic experiments) to study security properties in cryptography, some authors describe privacy inference risks in machine learning using a similar game-based style. However, adversary capabilities and goals are often stated in subtly different ways from one presentation to the other, which makes it hard to relate and compose results. In this paper, we present a game-based framework to systematize the body of knowledge on privacy inference risks in machine learning.
    Vulnerabilities of Deep Learning-Driven Semantic Communications to Backdoor (Trojan) Attacks. (arXiv:2212.11205v1 [cs.CR])
    This paper highlights vulnerabilities of deep learning-driven semantic communications to backdoor (Trojan) attacks. Semantic communications aims to convey a desired meaning while transferring information from a transmitter to its receiver. An encoder-decoder pair that is represented by two deep neural networks (DNNs) as part of an autoencoder is trained to reconstruct signals such as images at the receiver by transmitting latent features of small size over a limited number of channel uses. In the meantime, another DNN of a semantic task classifier at the receiver is jointly trained with the autoencoder to check the meaning conveyed to the receiver. The complex decision space of the DNNs makes semantic communications susceptible to adversarial manipulations. In a backdoor (Trojan) attack, the adversary adds triggers to a small portion of training samples and changes the label to a target label. When the transfer of images is considered, the triggers can be added to the images or equivalently to the corresponding transmitted or received signals. In test time, the adversary activates these triggers by providing poisoned samples as input to the encoder (or decoder) of semantic communications. The backdoor attack can effectively change the semantic information transferred for the poisoned input samples to a target meaning. As the performance of semantic communications improves with the signal-to-noise ratio and the number of channel uses, the success of the backdoor attack increases as well. Also, increasing the Trojan ratio in training data makes the attack more successful. In the meantime, the effect of this attack on the unpoisoned input samples remains limited. Overall, this paper shows that the backdoor attack poses a serious threat to semantic communications and presents novel design guidelines to preserve the meaning of transferred information in the presence of backdoor attacks.
    Neighboring state-based RL Exploration. (arXiv:2212.10712v1 [cs.LG])
    Reinforcement Learning is a powerful tool to model decision-making processes. However, it relies on an exploration-exploitation trade-off that remains an open challenge for many tasks. In this work, we study neighboring state-based, model-free exploration led by the intuition that, for an early-stage agent, considering actions derived from a bounded region of nearby states may lead to better actions when exploring. We propose two algorithms that choose exploratory actions based on a survey of nearby states, and find that one of our methods, ${\rho}$-explore, consistently outperforms the Double DQN baseline in an discrete environment by 49\% in terms of Eval Reward Return.
    VCNet: A self-explaining model for realistic counterfactual generation. (arXiv:2212.10847v1 [cs.AI])
    Counterfactual explanation is a common class of methods to make local explanations of machine learning decisions. For a given instance, these methods aim to find the smallest modification of feature values that changes the predicted decision made by a machine learning model. One of the challenges of counterfactual explanation is the efficient generation of realistic counterfactuals. To address this challenge, we propose VCNet-Variational Counter Net-a model architecture that combines a predictor and a counterfactual generator that are jointly trained, for regression or classification tasks. VCNet is able to both generate predictions, and to generate counterfactual explanations without having to solve another minimisation problem. Our contribution is the generation of counterfactuals that are close to the distribution of the predicted class. This is done by learning a variational autoencoder conditionally to the output of the predictor in a join-training fashion. We present an empirical evaluation on tabular datasets and across several interpretability metrics. The results are competitive with the state-of-the-art method.
    MolCPT: Molecule Continuous Prompt Tuning to Generalize Molecular Representation Learning. (arXiv:2212.10614v1 [cs.LG])
    Molecular representation learning is crucial for the problem of molecular property prediction, where graph neural networks (GNNs) serve as an effective solution due to their structure modeling capabilities. Since labeled data is often scarce and expensive to obtain, it is a great challenge for GNNs to generalize in the extensive molecular space. Recently, the training paradigm of "pre-train, fine-tune" has been leveraged to improve the generalization capabilities of GNNs. It uses self-supervised information to pre-train the GNN, and then performs fine-tuning to optimize the downstream task with just a few labels. However, pre-training does not always yield statistically significant improvement, especially for self-supervised learning with random structural masking. In fact, the molecular structure is characterized by motif subgraphs, which are frequently occurring and influence molecular properties. To leverage the task-related motifs, we propose a novel paradigm of "pre-train, prompt, fine-tune" for molecular representation learning, named molecule continuous prompt tuning (MolCPT). MolCPT defines a motif prompting function that uses the pre-trained model to project the standalone input into an expressive prompt. The prompt effectively augments the molecular graph with meaningful motifs in the continuous representation space; this provides more structural patterns to aid the downstream classifier in identifying molecular properties. Extensive experiments on several benchmark datasets show that MolCPT efficiently generalizes pre-trained GNNs for molecular property prediction, with or without a few fine-tuning steps.
    Greenhouse gases emissions: estimating corporate non-reported emissions using interpretable machine learning. (arXiv:2212.10844v1 [cs.LG])
    As of 2022, greenhouse gases (GHG) emissions reporting and auditing are not yet compulsory for all companies and methodologies of measurement and estimation are not unified. We propose a machine learning-based model to estimate scope 1 and scope 2 GHG emissions of companies not reporting them yet. Our model, specifically designed to be transparent and completely adapted to this use case, is able to estimate emissions for a large universe of companies. It shows good out-of-sample global performances as well as good out-of-sample granular performances when evaluating it by sectors, by countries or by revenues buckets. We also compare our results to those of other providers and find our estimates to be more accurate. Thanks to the proposed explainability tools using Shapley values, our model is fully interpretable, the user being able to understand which factors split explain the GHG emissions for each particular company.
    Federated Graph Neural Networks: Overview, Techniques and Challenges. (arXiv:2202.07256v2 [cs.DC] UPDATED)
    With its capability to deal with graph data, which is widely found in practical applications, graph neural networks (GNNs) have attracted significant research attention in recent years. As societies become increasingly concerned with the need for data privacy protection, GNNs face the need to adapt to this new normal. Besides, as clients in Federated Learning (FL) may have relationships, more powerful tools are required to utilize such implicit information to boost performance. This has led to the rapid development of the emerging research field of federated graph neural networks (FedGNNs). This promising interdisciplinary field is highly challenging for interested researchers to grasp. The lack of an insightful survey on this topic further exacerbates the entry difficulty. In this paper, we bridge this gap by offering a comprehensive survey of this emerging field. We propose a 2-dimensional taxonomy of the FedGNNs literature: 1) the main taxonomy provides a clear perspective on the integration of GNNs and FL by analyzing how GNNs enhance FL training as well as how FL assists GNNs training, and 2) the auxiliary taxonomy provides a view on how FedGNNs deal with heterogeneity across FL clients. Through discussions of key ideas, challenges, and limitations of existing works, we envision future research directions that can help build more robust, explainable, efficient, fair, inductive, and comprehensive FedGNNs.
    GraphIX: Graph-based In silico XAI(explainable artificial intelligence) for drug repositioning from biopharmaceutical network. (arXiv:2212.10788v1 [cs.LG])
    Drug repositioning holds great promise because it can reduce the time and cost of new drug development. While drug repositioning can omit various R&D processes, confirming pharmacological effects on biomolecules is essential for application to new diseases. Biomedical explainability in a drug repositioning model can support appropriate insights in subsequent in-depth studies. However, the validity of the XAI methodology is still under debate, and the effectiveness of XAI in drug repositioning prediction applications remains unclear. In this study, we propose GraphIX, an explainable drug repositioning framework using biological networks, and quantitatively evaluate its explainability. GraphIX first learns the network weights and node features using a graph neural network from known drug indication and knowledge graph that consists of three types of nodes (but not given node type information): disease, drug, and protein. Analysis of the post-learning features showed that node types that were not known to the model beforehand are distinguished through the learning process based on the graph structure. From the learned weights and features, GraphIX then predicts the disease-drug association and calculates the contribution values of the nodes located in the neighborhood of the predicted disease and drug. We hypothesized that the neighboring protein node to which the model gave a high contribution is important in understanding the actual pharmacological effects. Quantitative evaluation of the validity of protein nodes' contribution using a real-world database showed that the high contribution proteins shown by GraphIX are reasonable as a mechanism of drug action. GraphIX is a framework for evidence-based drug discovery that can present to users new disease-drug associations and identify the protein important for understanding its pharmacological effects from a large and complex knowledge base.
    Towards Rapid Prototyping and Comparability in Active Learning for Deep Object Detection. (arXiv:2212.10836v1 [cs.CV])
    Active learning as a paradigm in deep learning is especially important in applications involving intricate perception tasks such as object detection where labels are difficult and expensive to acquire. Development of active learning methods in such fields is highly computationally expensive and time consuming which obstructs the progression of research and leads to a lack of comparability between methods. In this work, we propose and investigate a sandbox setup for rapid development and transparent evaluation of active learning in deep object detection. Our experiments with commonly used configurations of datasets and detection architectures found in the literature show that results obtained in our sandbox environment are representative of results on standard configurations. The total compute time to obtain results and assess the learning behavior can thereby be reduced by factors of up to 14 when comparing with Pascal VOC and up to 32 when comparing with BDD100k. This allows for testing and evaluating data acquisition and labeling strategies in under half a day and contributes to the transparency and development speed in the field of active learning for object detection.
    5G Long-Term and Large-Scale Mobile Traffic Forecasting. (arXiv:2212.10869v1 [cs.LG])
    It is crucial for the service provider to comprehend and forecast mobile traffic in large-scale cellular networks in order to govern and manage mechanisms for base station placement, load balancing, and network planning. The purpose of this article is to extract and simulate traffic patterns from more than 14,000 cells that have been installed in different metropolitan areas. To do this, we create, implement, and assess a method in which cells are first categorized by their point of interest and then clustered based on the temporal distribution of cells in each region. The proposed model has been tested using real-world 5G mobile traffic datasets collected over 31 weeks in various cities. We found that our proposed model performed well in predicting mobile traffic patterns up to 2 weeks in advance. Our model outperformed the base model in most areas of interest and generally achieved up to 15\% less prediction error compared to the na\"ive approach. This indicates that our approach is effective in predicting mobile traffic patterns in large-scale cellular networks.
    SPT: Semi-Parametric Prompt Tuning for Multitask Prompted Learning. (arXiv:2212.10929v1 [cs.CL])
    Pre-trained large language models can efficiently interpolate human-written prompts in a natural way. Multitask prompted learning can help generalization through a diverse set of tasks at once, thus enhancing the potential for more effective downstream fine-tuning. To perform efficient multitask-inference in the same batch, parameter-efficient fine-tuning methods such as prompt tuning have been proposed. However, the existing prompt tuning methods may lack generalization. We propose SPT, a semi-parametric prompt tuning method for multitask prompted learning. The novel component of SPT is a memory bank from where memory prompts are retrieved based on discrete prompts. Extensive experiments, such as (i) fine-tuning a full language model with SPT on 31 different tasks from 8 different domains and evaluating zero-shot generalization on 9 heldout datasets under 5 NLP task categories and (ii) pretraining SPT on the GLUE datasets and evaluating fine-tuning on the SuperGLUE datasets, demonstrate effectiveness of SPT.
    Extractive Text Summarization Using Generalized Additive Models with Interactions for Sentence Selection. (arXiv:2212.10707v1 [cs.CL])
    Automatic Text Summarization (ATS) is becoming relevant with the growth of textual data; however, with the popularization of public large-scale datasets, some recent machine learning approaches have focused on dense models and architectures that, despite producing notable results, usually turn out in models difficult to interpret. Given the challenge behind interpretable learning-based text summarization and the importance it may have for evolving the current state of the ATS field, this work studies the application of two modern Generalized Additive Models with interactions, namely Explainable Boosting Machine and GAMI-Net, to the extractive summarization problem based on linguistic features and binary classification.
    Hyperparameters in Contextual RL are Highly Situational. (arXiv:2212.10876v1 [cs.LG])
    Although Reinforcement Learning (RL) has shown impressive results in games and simulation, real-world application of RL suffers from its instability under changing environment conditions and hyperparameters. We give a first impression of the extent of this instability by showing that the hyperparameters found by automatic hyperparameter optimization (HPO) methods are not only dependent on the problem at hand, but even on how well the state describes the environment dynamics. Specifically, we show that agents in contextual RL require different hyperparameters if they are shown how environmental factors change. In addition, finding adequate hyperparameter configurations is not equally easy for both settings, further highlighting the need for research into how hyperparameters influence learning and generalization in RL.
    A Nearly Tight Bound for Fitting an Ellipsoid to Gaussian Random Points. (arXiv:2212.11221v1 [math.PR])
    We prove that for $c>0$ a sufficiently small universal constant that a random set of $c d^2/\log^4(d)$ independent Gaussian random points in $\mathbb{R}^d$ lie on a common ellipsoid with high probability. This nearly establishes a conjecture of~\cite{SaundersonCPW12}, within logarithmic factors. The latter conjecture has attracted significant attention over the past decade, due to its connections to machine learning and sum-of-squares lower bounds for certain statistical problems.
    Complete the Missing Half: Augmenting Aggregation Filtering with Diversification for Graph Convolutional Neural Networks. (arXiv:2212.10822v1 [cs.LG])
    The core operation of current Graph Neural Networks (GNNs) is the aggregation enabled by the graph Laplacian or message passing, which filters the neighborhood information of nodes. Though effective for various tasks, in this paper, we show that they are potentially a problematic factor underlying all GNN models for learning on certain datasets, as they force the node representations similar, making the nodes gradually lose their identity and become indistinguishable. Hence, we augment the aggregation operations with their dual, i.e. diversification operators that make the node more distinct and preserve the identity. Such augmentation replaces the aggregation with a two-channel filtering process that, in theory, is beneficial for enriching the node representations. In practice, the proposed two-channel filters can be easily patched on existing GNN methods with diverse training strategies, including spectral and spatial (message passing) methods. In the experiments, we observe desired characteristics of the models and significant performance boost upon the baselines on 9 node classification tasks.
    Multi-hop Evidence Retrieval for Cross-document Relation Extraction. (arXiv:2212.10786v1 [cs.CL])
    Relation Extraction (RE) has been extended to cross-document scenarios because many relations are not simply described in a single document. This inevitably brings the challenge of efficient open-space evidence retrieval to support the inference of cross-document relations, along with the challenge of multi-hop reasoning on top of entities and evidence scattered in an open set of documents. To combat these challenges, we propose Mr.CoD, a multi-hop evidence retrieval method based on evidence path mining and ranking with adapted dense retrievers. We explore multiple variants of retrievers to show evidence retrieval is an essential part in cross-document RE. Experiments on CodRED show that evidence retrieval with Mr.Cod effectively acquires cross-document evidence that essentially supports open-setting cross-document RE. Additionally, we show that Mr.CoD facilitates evidence retrieval and boosts end-to-end RE performance with effective multi-hop reasoning in both closed and open settings of RE.
    A Physics-Informed Neural Network to Model Port Channels. (arXiv:2212.10681v1 [physics.flu-dyn])
    We describe a Physics-Informed Neural Network (PINN) that simulates the flow induced by the astronomical tide in a synthetic port channel, with dimensions based on the Santos - S\~ao Vicente - Bertioga Estuarine System. PINN models aim to combine the knowledge of physical systems and data-driven machine learning models. This is done by training a neural network to minimize the residuals of the governing equations in sample points. In this work, our flow is governed by the Navier-Stokes equations with some approximations. There are two main novelties in this paper. First, we design our model to assume that the flow is periodic in time, which is not feasible in conventional simulation methods. Second, we evaluate the benefit of resampling the function evaluation points during training, which has a near zero computational cost and has been verified to improve the final model, especially for small batch sizes. Finally, we discuss some limitations of the approximations used in the Navier-Stokes equations regarding the modeling of turbulence and how it interacts with PINNs.
    On Reinforcement Learning for the Game of 2048. (arXiv:2212.11087v1 [cs.LG])
    2048 is a single-player stochastic puzzle game. This intriguing and addictive game has been popular worldwide and has attracted researchers to develop game-playing programs. Due to its simplicity and complexity, 2048 has become an interesting and challenging platform for evaluating the effectiveness of machine learning methods. This dissertation conducts comprehensive research on reinforcement learning and computer game algorithms for 2048. First, this dissertation proposes optimistic temporal difference learning, which significantly improves the quality of learning by employing optimistic initialization to encourage exploration for 2048. Furthermore, based on this approach, a state-of-the-art program for 2048 is developed, which achieves the highest performance among all learning-based programs, namely an average score of 625377 points and a rate of 72% for reaching 32768-tiles. Second, this dissertation investigates several techniques related to 2048, including the n-tuple network ensemble learning, Monte Carlo tree search, and deep reinforcement learning. These techniques are promising for further improving the performance of the current state-of-the-art program. Finally, this dissertation discusses pedagogical applications related to 2048 by proposing course designs and summarizing the teaching experience. The proposed course designs use 2048-like games as materials for beginners to learn reinforcement learning and computer game algorithms. The courses have been successfully applied to graduate-level students and received well by student feedback.
    Benchmarking Large Language Models for Automated Verilog RTL Code Generation. (arXiv:2212.11140v1 [cs.PL])
    Automating hardware design could obviate a significant amount of human error from the engineering process and lead to fewer errors. Verilog is a popular hardware description language to model and design digital systems, thus generating Verilog code is a critical first step. Emerging large language models (LLMs) are able to write high-quality code in other programming languages. In this paper, we characterize the ability of LLMs to generate useful Verilog. For this, we fine-tune pre-trained LLMs on Verilog datasets collected from GitHub and Verilog textbooks. We construct an evaluation framework comprising test-benches for functional analysis and a flow to test the syntax of Verilog code generated in response to problems of varying difficulty. Our findings show that across our problem scenarios, the fine-tuning results in LLMs more capable of producing syntactically correct code (25.9% overall). Further, when analyzing functional correctness, a fine-tuned open-source CodeGen LLM can outperform the state-of-the-art commercial Codex LLM (6.5% overall). Training/evaluation scripts and LLM checkpoints are available: https://github.com/shailja-thakur/VGen.
    METEOR Guided Divergence for Video Captioning. (arXiv:2212.10690v1 [cs.CV])
    Automatic video captioning aims for a holistic visual scene understanding. It requires a mechanism for capturing temporal context in video frames and the ability to comprehend the actions and associations of objects in a given timeframe. Such a system should additionally learn to abstract video sequences into sensible representations as well as to generate natural written language. While the majority of captioning models focus solely on the visual inputs, little attention has been paid to the audiovisual modality. To tackle this issue, we propose a novel two-fold approach. First, we implement a reward-guided KL Divergence to train a video captioning model which is resilient towards token permutations. Second, we utilise a Bi-Modal Hierarchical Reinforcement Learning (BMHRL) Transformer architecture to capture long-term temporal dependencies of the input data as a foundation for our hierarchical captioning module. Using our BMHRL, we show the suitability of the HRL agent in the generation of content-complete and grammatically sound sentences by achieving $4.91$, $2.23$, and $10.80$ in BLEU3, BLEU4, and METEOR scores, respectively on the ActivityNet Captions dataset. Finally, we make our BMHRL framework and trained models publicly available for users and developers at https://github.com/d-rothen/bmhrl.
    Hierarchically branched diffusion models for efficient and interpretable multi-class conditional generation. (arXiv:2212.10777v1 [cs.LG])
    Diffusion models have achieved justifiable popularity by attaining state-of-the-art performance in generating realistic objects from seemingly arbitrarily complex data distributions, including when conditioning generation on labels. Unfortunately, however, their iterative nature renders them very computationally inefficient during the sampling process. For the multi-class conditional generation problem, we propose a novel, structurally unique framework of diffusion models which are hierarchically branched according to the inherent relationships between classes. In this work, we demonstrate that branched diffusion models offer major improvements in efficiently generating samples from multiple classes. We also showcase several other advantages of branched diffusion models, including ease of extension to novel classes in a continual-learning setting, and a unique interpretability that offers insight into these generative models. Branched diffusion models represent an alternative paradigm to their traditional linear counterparts, and can have large impacts in how we use diffusion models for efficient generation, online learning, and scientific discovery.
    Minimizing Worst-Case Violations of Neural Networks. (arXiv:2212.10930v1 [cs.LG])
    Machine learning (ML) algorithms are remarkably good at approximating complex non-linear relationships. Most ML training processes, however, are designed to deliver ML tools with good average performance, but do not offer any guarantees about their worst-case estimation error. For safety-critical systems such as power systems, this places a major barrier for their adoption. So far, approaches could determine the worst-case violations of only trained ML algorithms. To the best of our knowledge, this is the first paper to introduce a neural network training procedure designed to achieve both a good average performance and minimum worst-case violations. Using the Optimal Power Flow (OPF) problem as a guiding application, our approach (i) introduces a framework that reduces the worst-case generation constraint violations during training, incorporating them as a differentiable optimization layer; and (ii) presents a neural network sequential learning architecture to significantly accelerate it. We demonstrate the proposed architecture on four different test systems ranging from 39 buses to 162 buses, for both AC-OPF and DC-OPF applications.
    Temporal Output Discrepancy for Loss Estimation-based Active Learning. (arXiv:2212.10613v1 [cs.CV])
    While deep learning succeeds in a wide range of tasks, it highly depends on the massive collection of annotated data which is expensive and time-consuming. To lower the cost of data annotation, active learning has been proposed to interactively query an oracle to annotate a small proportion of informative samples in an unlabeled dataset. Inspired by the fact that the samples with higher loss are usually more informative to the model than the samples with lower loss, in this paper we present a novel deep active learning approach that queries the oracle for data annotation when the unlabeled sample is believed to incorporate high loss. The core of our approach is a measurement Temporal Output Discrepancy (TOD) that estimates the sample loss by evaluating the discrepancy of outputs given by models at different optimization steps. Our theoretical investigation shows that TOD lower-bounds the accumulated sample loss thus it can be used to select informative unlabeled samples. On basis of TOD, we further develop an effective unlabeled data sampling strategy as well as an unsupervised learning criterion for active learning. Due to the simplicity of TOD, our methods are efficient, flexible, and task-agnostic. Extensive experimental results demonstrate that our approach achieves superior performances than the state-of-the-art active learning methods on image classification and semantic segmentation tasks. In addition, we show that TOD can be utilized to select the best model of potentially the highest testing accuracy from a pool of candidate models.
    Interpretability and causal discovery of the machine learning models to predict the production of CBM wells after hydraulic fracturing. (arXiv:2212.10718v1 [cs.LG])
    Machine learning approaches are widely studied in the production prediction of CBM wells after hydraulic fracturing, but merely used in practice due to the low generalization ability and the lack of interpretability. A novel methodology is proposed in this article to discover the latent causality from observed data, which is aimed at finding an indirect way to interpret the machine learning results. Based on the theory of causal discovery, a causal graph is derived with explicit input, output, treatment and confounding variables. Then, SHAP is employed to analyze the influence of the factors on the production capability, which indirectly interprets the machine learning models. The proposed method can capture the underlying nonlinear relationship between the factors and the output, which remedies the limitation of the traditional machine learning routines based on the correlation analysis of factors. The experiment on the data of CBM shows that the detected relationship between the production and the geological/engineering factors by the presented method, is coincident with the actual physical mechanism. Meanwhile, compared with traditional methods, the interpretable machine learning models have better performance in forecasting production capability, averaging 20% improvement in accuracy.
    Expander Graph Propagation. (arXiv:2210.02997v2 [cs.LG] UPDATED)
    Deploying graph neural networks (GNNs) on whole-graph classification or regression tasks is known to be challenging: it often requires computing node features that are mindful of both local interactions in their neighbourhood and the global context of the graph structure. GNN architectures that navigate this space need to avoid pathological behaviours, such as bottlenecks and oversquashing, while ideally having linear time and space complexity requirements. In this work, we propose an elegant approach based on propagating information over expander graphs. We leverage an efficient method for constructing expander graphs of a given size, and use this insight to propose the EGP model. We show that EGP is able to address all of the above concerns, while requiring minimal effort to set up, and provide evidence of its empirical utility on relevant graph classification datasets and baselines in the Open Graph Benchmark. Importantly, using expander graphs as a template for message passing necessarily gives rise to negative curvature. While this appears to be counterintuitive in light of recent related work on oversquashing, we theoretically demonstrate that negatively curved edges are likely to be required to obtain scalable message passing without bottlenecks. To the best of our knowledge, this is a previously unstudied result in the context of graph representation learning, and we believe our analysis paves the way to a novel class of scalable methods to counter oversquashing in GNNs.
    A Theoretical Study of The Effects of Adversarial Attacks on Sparse Regression. (arXiv:2212.11209v1 [cs.LG])
    This paper analyzes $\ell_1$ regularized linear regression under the challenging scenario of having only adversarially corrupted data for training. We use the primal-dual witness paradigm to provide provable performance guarantees for the support of the estimated regression parameter vector to match the actual parameter. Our theoretical analysis shows the counter-intuitive result that an adversary can influence sample complexity by corrupting the irrelevant features, i.e., those corresponding to zero coefficients of the regression parameter vector, which, consequently, do not affect the dependent variable. As any adversarially robust algorithm has its limitations, our theoretical analysis identifies the regimes under which the learning algorithm and adversary can dominate over each other. It helps us to analyze these fundamental limits and address critical scientific questions of which parameters (like mutual incoherence, the maximum and minimum eigenvalue of the covariance matrix, and the budget of adversarial perturbation) play a role in the high or low probability of success of the LASSO algorithm. Also, the derived sample complexity is logarithmic with respect to the size of the regression parameter vector, and our theoretical claims are validated by empirical analysis on synthetic and real-world datasets.
    Lifelong Reinforcement Learning with Modulating Masks. (arXiv:2212.11110v1 [cs.LG])
    Lifelong learning aims to create AI systems that continuously and incrementally learn during a lifetime, similar to biological learning. Attempts so far have met problems, including catastrophic forgetting, interference among tasks, and the inability to exploit previous knowledge. While considerable research has focused on learning multiple input distributions, typically in classification, lifelong reinforcement learning (LRL) must also deal with variations in the state and transition distributions, and in the reward functions. Modulating masks, recently developed for classification, are particularly suitable to deal with such a large spectrum of task variations. In this paper, we adapted modulating masks to work with deep LRL, specifically PPO and IMPALA agents. The comparison with LRL baselines in both discrete and continuous RL tasks shows competitive performance. We further investigated the use of a linear combination of previously learned masks to exploit previous knowledge when learning new tasks: not only is learning faster, the algorithm solves tasks that we could not otherwise solve from scratch due to extremely sparse rewards. The results suggest that RL with modulating masks is a promising approach to lifelong learning, to the composition of knowledge to learn increasingly complex tasks, and to knowledge reuse for efficient and faster learning.
    Anticancer Peptides Classification using Kernel Sparse Representation Classifier. (arXiv:2212.10567v1 [q-bio.QM])
    Cancer is one of the most challenging diseases because of its complexity, variability, and diversity of causes. It has been one of the major research topics over the past decades, yet it is still poorly understood. To this end, multifaceted therapeutic frameworks are indispensable. \emph{Anticancer peptides} (ACPs) are the most promising treatment option, but their large-scale identification and synthesis require reliable prediction methods, which is still a problem. In this paper, we present an intuitive classification strategy that differs from the traditional \emph{black box} method and is based on the well-known statistical theory of \emph{sparse-representation classification} (SRC). Specifically, we create over-complete dictionary matrices by embedding the \emph{composition of the K-spaced amino acid pairs} (CKSAAP). Unlike the traditional SRC frameworks, we use an efficient \emph{matching pursuit} solver instead of the computationally expensive \emph{basis pursuit} solver in this strategy. Furthermore, the \emph{kernel principal component analysis} (KPCA) is employed to cope with non-linearity and dimension reduction of the feature space whereas the \emph{synthetic minority oversampling technique} (SMOTE) is used to balance the dictionary. The proposed method is evaluated on two benchmark datasets for well-known statistical parameters and is found to outperform the existing methods. The results show the highest sensitivity with the most balanced accuracy, which might be beneficial in understanding structural and chemical aspects and developing new ACPs. The Google-Colab implementation of the proposed method is available at the author's GitHub page (\href{https://github.com/ehtisham-Fazal/ACP-Kernel-SRC}{https://github.com/ehtisham-fazal/ACP-Kernel-SRC}).  ( 2 min )
    Neural Cloth Simulation. (arXiv:2212.11220v1 [cs.CV])
    We present a general framework for the garment animation problem through unsupervised deep learning inspired in physically based simulation. Existing trends in the literature already explore this possibility. Nonetheless, these approaches do not handle cloth dynamics. Here, we propose the first methodology able to learn realistic cloth dynamics unsupervisedly, and henceforth, a general formulation for neural cloth simulation. The key to achieve this is to adapt an existing optimization scheme for motion from simulation based methodologies to deep learning. Then, analyzing the nature of the problem, we devise an architecture able to automatically disentangle static and dynamic cloth subspaces by design. We will show how this improves model performance. Additionally, this opens the possibility of a novel motion augmentation technique that greatly improves generalization. Finally, we show it also allows to control the level of motion in the predictions. This is a useful, never seen before, tool for artists. We provide of detailed analysis of the problem to establish the bases of neural cloth simulation and guide future research into the specifics of this domain.
    Task Ambiguity in Humans and Language Models. (arXiv:2212.10711v1 [cs.CL])
    Language models have recently achieved strong performance across a wide range of NLP benchmarks. However, unlike benchmarks, real world tasks are often poorly specified, and agents must deduce the user's intended behavior from a combination of context, instructions, and examples. We investigate how both humans and models behave in the face of such task ambiguity by proposing AmbiBench, a new benchmark of six ambiguously-specified classification tasks. We evaluate humans and models on AmbiBench by seeing how well they identify the intended task using 1) instructions with varying degrees of ambiguity, and 2) different numbers of labeled examples. We find that the combination of model scaling (to 175B parameters) and training with human feedback data enables models to approach or exceed the accuracy of human participants across tasks, but that either one alone is not sufficient. In addition, we show how to dramatically improve the accuracy of language models trained without large-scale human feedback training by finetuning on a small number of ambiguous in-context examples, providing a promising direction for teaching models to generalize well in the face of ambiguity.  ( 2 min )
    Free-Rider Games for Federated Learning with Selfish Clients in NextG Wireless Networks. (arXiv:2212.11194v1 [cs.GT])
    This paper presents a game theoretic framework for participation and free-riding in federated learning (FL), and determines the Nash equilibrium strategies when FL is executed over wireless links. To support spectrum sensing for NextG communications, FL is used by clients, namely spectrum sensors with limited training datasets and computation resources, to train a wireless signal classifier while preserving privacy. In FL, a client may be free-riding, i.e., it does not participate in FL model updates, if the computation and transmission cost for FL participation is high, and receives the global model (learned by other clients) without incurring a cost. However, the free-riding behavior may potentially decrease the global accuracy due to lack of contribution to global model learning. This tradeoff leads to a non-cooperative game where each client aims to individually maximize its utility as the difference between the global model accuracy and the cost of FL participation. The Nash equilibrium strategies are derived for free-riding probabilities such that no client can unilaterally increase its utility given the strategies of its opponents remain the same. The free-riding probability increases with the FL participation cost and the number of clients, and a significant optimality gap exists in Nash equilibrium with respect to the joint optimization for all clients. The optimality gap increases with the number of clients and the maximum gap is evaluated as a function of the cost. These results quantify the impact of free-riding on the resilience of FL in NextG networks and indicate operational modes for FL participation.
    Beyond Contrastive Learning: A Variational Generative Model for Multilingual Retrieval. (arXiv:2212.10726v1 [cs.CL])
    Contrastive learning has been successfully used for retrieval of semantically aligned sentences, but it often requires large batch sizes or careful engineering to work well. In this paper, we instead propose a generative model for learning multilingual text embeddings which can be used to retrieve or score sentence pairs. Our model operates on parallel data in $N$ languages and, through an approximation we introduce, efficiently encourages source separation in this multilingual setting, separating semantic information that is shared between translations from stylistic or language-specific variation. We show careful large-scale comparisons between contrastive and generation-based approaches for learning multilingual text embeddings, a comparison that has not been done to the best of our knowledge despite the popularity of these approaches. We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval -- the last of which we introduce in this paper. Overall, our Variational Multilingual Source-Separation Transformer (VMSST) model outperforms both a strong contrastive and generative baseline on these tasks.
    AnchorGAE: General Data Clustering via $O(n)$ Bipartite Graph Convolution. (arXiv:2111.06586v2 [cs.LG] UPDATED)
    Since the representative capacity of graph-based clustering methods is usually limited by the graph constructed on the original features, it is attractive to find whether graph neural networks (GNNs) can be applied to augment the capacity. The core problems mainly come from two aspects: (1) the graph is unavailable in the most clustering scenes so that how to construct high-quality graphs on the non-graph data is usually the most important part; (2) given n samples, the graph-based clustering methods usually consume at least $\mathcal O(n^2)$ time to build graphs and the graph convolution requires nearly $\mathcal O(n^2)$ for a dense graph and $\mathcal O(|\mathcal{E}|)$ for a sparse one with $|\mathcal{E}|$ edges. Accordingly, both graph-based clustering and GNNs suffer from the severe inefficiency problem. To tackle these problems, we propose a novel clustering method, AnchorGAE, with the self-supervised estimation of graph and efficient graph convolution. We first show how to convert a non-graph dataset into a graph dataset, by introducing the generative graph model and anchors. We then show that the constructed bipartite graph can reduce the computational complexity of graph convolution from $\mathcal O(n^2)$ and $\mathcal O(|\mathcal{E}|)$ to $\mathcal O(n)$. The succeeding steps for clustering can be easily designed as $\mathcal O(n)$ operations. Interestingly, the anchors naturally lead to siamese architecture with the help of the Markov process. Furthermore, the estimated bipartite graph is updated dynamically according to the features extracted by GNN, to promote the quality of the graph. However, we theoretically prove that the self-supervised paradigm frequently results in a collapse that often occurs after 2-3 update iterations in experiments, especially when the model is well-trained. A specific strategy is accordingly designed to prevent the collapse.
    Control of Continuous Quantum Systems with Many Degrees of Freedom based on Convergent Reinforcement Learning. (arXiv:2212.10705v1 [quant-ph])
    With the development of experimental quantum technology, quantum control has attracted increasing attention due to the realization of controllable artificial quantum systems. However, because quantum-mechanical systems are often too difficult to analytically deal with, heuristic strategies and numerical algorithms which search for proper control protocols are adopted, and, deep learning, especially deep reinforcement learning (RL), is a promising generic candidate solution for the control problems. Although there have been a few successful applications of deep RL to quantum control problems, most of the existing RL algorithms suffer from instabilities and unsatisfactory reproducibility, and require a large amount of fine-tuning and a large computational budget, both of which limit their applicability. To resolve the issue of instabilities, in this dissertation, we investigate the non-convergence issue of Q-learning. Then, we investigate the weakness of existing convergent approaches that have been proposed, and we develop a new convergent Q-learning algorithm, which we call the convergent deep Q network (C-DQN) algorithm, as an alternative to the conventional deep Q network (DQN) algorithm. We prove the convergence of C-DQN and apply it to the Atari 2600 benchmark. We show that when DQN fail, C-DQN still learns successfully. Then, we apply the algorithm to the measurement-feedback cooling problems of a quantum quartic oscillator and a trapped quantum rigid body. We establish the physical models and analyse their properties, and we show that although both C-DQN and DQN can learn to cool the systems, C-DQN tends to behave more stably, and when DQN suffers from instabilities, C-DQN can achieve a better performance. As the performance of DQN can have a large variance and lack consistency, C-DQN can be a better choice for researches on complicated control problems.
    Is it worth it? An experimental comparison of six deep- and classical machine learning methods for unsupervised anomaly detection in time series. (arXiv:2212.11080v1 [cs.LG])
    The detection of anomalies in time series data is crucial in a wide range of applications, such as system monitoring, health care or cyber security. While the vast number of available methods makes selecting the right method for a certain application hard enough, different methods have different strengths, e.g. regarding the type of anomalies they are able to find. In this work, we compare six unsupervised anomaly detection methods with different complexities to answer the questions: Are the more complex methods usually performing better? And are there specific anomaly types that those method are tailored to? The comparison is done on the UCR anomaly archive, a recent benchmark dataset for anomaly detection. We compare the six methods by analyzing the experimental results on a dataset- and anomaly type level after tuning the necessary hyperparameter for each method. Additionally we examine the ability of individual methods to incorporate prior knowledge about the anomalies and analyse the differences of point-wise and sequence wise features. We show with broad experiments, that the classical machine learning methods show a superior performance compared to the deep learning methods across a wide range of anomaly types.
    UnICLAM:Contrastive Representation Learning with Adversarial Masking for Unified and Interpretable Medical Vision Question Answering. (arXiv:2212.10729v1 [cs.CV])
    Medical Visual Question Answering (Medical-VQA) aims to answer clinical questions regarding radiology images, assisting doctors with decision-making options. Nevertheless, current Medical-VQA models learn cross-modal representations through residing vision and texture encoders in dual separate spaces, which lead to indirect semantic alignment. In this paper, we propose UnICLAM, a Unified and Interpretable Medical-VQA model through Contrastive Representation Learning with Adversarial Masking. Specifically, to learn an aligned image-text representation, we first establish a unified dual-stream pre-training structure with the gradually soft-parameter sharing strategy. Technically, the proposed strategy learns a constraint for the vision and texture encoders to be close in a same space, which is gradually loosened as the higher number of layers. Moreover, for grasping the semantic representation, we extend the unified Adversarial Masking data augmentation strategy to the contrastive representation learning of vision and text in a unified manner, alleviating the meaningless of the commonly used random mask. Concretely, while the encoder training minimizes the distance between the original feature and the masking feature, the adversarial masking model keeps adversarial learning to conversely maximize the distance. Furthermore, we also intuitively take a further exploration of the unified adversarial masking strategy, which improves the potential ante-hoc interpretability with remarkable performance and efficiency. Experimental results on VQA-RAD and SLAKE public benchmarks demonstrate that UnICLAM outperforms the existing 11 state-of-the-art Medical-VQA models. More importantly, we make an additional discussion about the performance of UnICLAM in diagnosing heart failure, verifying that UnICLAM exhibits superior few-shot adaption performance in practical disease diagnosis.
    Joint Embedding of 2D and 3D Networks for Medical Image Anomaly Detection. (arXiv:2212.10939v1 [cs.CV])
    Obtaining ground truth data in medical imaging has difficulties due to the fact that it requires a lot of annotating time from the experts in the field. Also, when trained with supervised learning, it detects only the cases included in the labels. In real practice, we want to also open to other possibilities than the named cases while examining the medical images. As a solution, the need for anomaly detection that can detect and localize abnormalities by learning the normal characteristics using only normal images is emerging. With medical image data, we can design either 2D or 3D networks of self-supervised learning for anomaly detection task. Although 3D networks, which learns 3D structures of the human body, show good performance in 3D medical image anomaly detection, they cannot be stacked in deeper layers due to memory problems. While 2D networks have advantage in feature detection, they lack 3D context information. In this paper, we develop a method for combining the strength of the 3D network and the strength of the 2D network through joint embedding. We also propose the pretask of self-supervised learning to make it possible for the networks to learn efficiently. Through the experiments, we show that the proposed method achieves better performance in both classification and segmentation tasks compared to the SoTA method.
    Inversion of Bayesian Networks. (arXiv:2212.10649v1 [cs.LG])
    Variational autoencoders and Helmholtz machines use a recognition network (encoder) to approximate the posterior distribution of a generative model (decoder). In this paper we study the necessary and sufficient properties of a recognition network so that it can model the true posterior distribution exactly. These results are derived in the general context of probabilistic graphical modelling / Bayesian networks, for which the network represents a set of conditional independence statements. We derive both global conditions, in terms of d-separation, and local conditions for the recognition network to have the desired qualities. It turns out that for the local conditions the property perfectness (for every node, all parents are joined) plays an important role.  ( 2 min )
    DCC: A Cascade based Approach to Detect Communities in Social Networks. (arXiv:2212.10937v1 [cs.SI])
    Community detection in Social Networks is associated with finding and grouping the most similar nodes inherent in the network. These similar nodes are identified by computing tie strength. Stronger ties indicates higher proximity shared by connected node pairs. This work is motivated by Granovetter's argument that suggests that strong ties lies within densely connected nodes and the theory that community cores in real-world networks are densely connected. In this paper, we have introduced a novel method called \emph{Disjoint Community detection using Cascades (DCC)} which demonstrates the effectiveness of a new local density based tie strength measure on detecting communities. Here, tie strength is utilized to decide the paths followed for propagating information. The idea is to crawl through the tuple information of cascades towards the community core guided by increasing tie strength. Considering the cascade generation step, a novel preferential membership method has been developed to assign community labels to unassigned nodes. The efficacy of $DCC$ has been analyzed based on quality and accuracy on several real-world datasets and baseline community detection algorithms.
    Video Segmentation Learning Using Cascade Residual Convolutional Neural Network. (arXiv:2212.10570v1 [cs.CV])
    Video segmentation consists of a frame-by-frame selection process of meaningful areas related to foreground moving objects. Some applications include traffic monitoring, human tracking, action recognition, efficient video surveillance, and anomaly detection. In these applications, it is not rare to face challenges such as abrupt changes in weather conditions, illumination issues, shadows, subtle dynamic background motions, and also camouflage effects. In this work, we address such shortcomings by proposing a novel deep learning video segmentation approach that incorporates residual information into the foreground detection learning process. The main goal is to provide a method capable of generating an accurate foreground detection given a grayscale video. Experiments conducted on the Change Detection 2014 and on the private dataset PetrobrasROUTES from Petrobras support the effectiveness of the proposed approach concerning some state-of-the-art video segmentation techniques, with overall F-measures of $\mathbf{0.9535}$ and $\mathbf{0.9636}$ in the Change Detection 2014 and PetrobrasROUTES datasets, respectively. Such a result places the proposed technique amongst the top 3 state-of-the-art video segmentation methods, besides comprising approximately seven times less parameters than its top one counterpart.
    Hidden Poison: Machine Unlearning Enables Camouflaged Poisoning Attacks. (arXiv:2212.10717v1 [cs.LG])
    We introduce camouflaged data poisoning attacks, a new attack vector that arises in the context of machine unlearning and other settings when model retraining may be induced. An adversary first adds a few carefully crafted points to the training dataset such that the impact on the model's predictions is minimal. The adversary subsequently triggers a request to remove a subset of the introduced points at which point the attack is unleashed and the model's predictions are negatively affected. In particular, we consider clean-label targeted attacks (in which the goal is to cause the model to misclassify a specific test point) on datasets including CIFAR-10, Imagenette, and Imagewoof. This attack is realized by constructing camouflage datapoints that mask the effect of a poisoned dataset.
    Temporal Disaggregation of the Cumulative Grass Growth. (arXiv:2212.10865v1 [cs.LG])
    Information on the grass growth over a year is essential for some models simulating the use of this resource to feed animals on pasture or at barn with hay or grass silage. Unfortunately, this information is rarely available. The challenge is to reconstruct grass growth from two sources of information: usual daily climate data (rainfall, radiation, etc.) and cumulative growth over the year. We have to be able to capture the effect of seasonal climatic events which are known to distort the growth curve within the year. In this paper, we formulate this challenge as a problem of disaggregating the cumulative growth into a time series. To address this problem, our method applies time series forecasting using climate information and grass growth from previous time steps. Several alternatives of the method are proposed and compared experimentally using a database generated from a grassland process-based model. The results show that our method can accurately reconstruct the time series, independently of the use of the cumulative growth information.
    Analyzing Semantic Faithfulness of Language Models via Input Intervention on Conversational Question Answering. (arXiv:2212.10696v1 [cs.CL])
    Transformer-based language models have been shown to be highly effective for several NLP tasks. In this paper, we consider three transformer models, BERT, RoBERTa, and XLNet, in both small and large version, and investigate how faithful their representations are with respect to the semantic content of texts. We formalize a notion of semantic faithfulness, in which the semantic content of a text should causally figure in a model's inferences in question answering. We then test this notion by observing a model's behavior on answering questions about a story after performing two novel semantic interventions -- deletion intervention and negation intervention. While transformer models achieve high performance on standard question answering tasks, we show that they fail to be semantically faithful once we perform these interventions for a significant number of cases (~50% for deletion intervention, and ~20% drop in accuracy for negation intervention). We then propose an intervention-based training regime that can mitigate the undesirable effects for deletion intervention by a significant margin (from ~50% to ~6%). We analyze the inner-workings of the models to better understand the effectiveness of intervention-based training for deletion intervention. But we show that this training does not attenuate other aspects of semantic unfaithfulness such as the models' inability to deal with negation intervention or to capture the predicate-argument structure of texts. We also test InstructGPT, via prompting, for its ability to handle the two interventions and to capture predicate-argument structure. While InstructGPT models do achieve very high performance on predicate-argument structure task, they fail to respond adequately to our deletion and negation interventions.  ( 2 min )
    Resonant Anomaly Detection with Multiple Reference Datasets. (arXiv:2212.10579v1 [hep-ph])
    An important class of techniques for resonant anomaly detection in high energy physics builds models that can distinguish between reference and target datasets, where only the latter has appreciable signal. Such techniques, including Classification Without Labels (CWoLa) and Simulation Assisted Likelihood-free Anomaly Detection (SALAD) rely on a single reference dataset. They cannot take advantage of commonly-available multiple datasets and thus cannot fully exploit available information. In this work, we propose generalizations of CWoLa and SALAD for settings where multiple reference datasets are available, building on weak supervision techniques. We demonstrate improved performance in a number of settings with realistic and synthetic data. As an added benefit, our generalizations enable us to provide finite-sample guarantees, improving on existing asymptotic analyses.  ( 2 min )
    Towards Efficient Visual Simplification of Computational Graphs in Deep Neural Networks. (arXiv:2212.10774v1 [cs.HC])
    A computational graph in a deep neural network (DNN) denotes a specific data flow diagram (DFD) composed of many tensors and operators. Existing toolkits for visualizing computational graphs are not applicable when the structure is highly complicated and large-scale (e.g., BERT [1]). To address this problem, we propose leveraging a suite of visual simplification techniques, including a cycle-removing method, a module-based edge-pruning algorithm, and an isomorphic subgraph stacking strategy. We design and implement an interactive visualization system that is suitable for computational graphs with up to 10 thousand elements. Experimental results and usage scenarios demonstrate that our tool reduces 60% elements on average and hence enhances the performance for recognizing and diagnosing DNN models. Our contributions are integrated into an open-source DNN visualization toolkit, namely, MindInsight [2].
    Unsupervised Learning of Neurosymbolic Encoders. (arXiv:2107.13132v2 [cs.LG] UPDATED)
    We present a framework for the unsupervised learning of neurosymbolic encoders, which are encoders obtained by composing neural networks with symbolic programs from a domain-specific language. Our framework naturally incorporates symbolic expert knowledge into the learning process, which leads to more interpretable and factorized latent representations compared to fully neural encoders. We integrate modern program synthesis techniques with the variational autoencoding (VAE) framework, in order to learn a neurosymbolic encoder in conjunction with a standard decoder. The programmatic descriptions from our encoders can benefit many analysis workflows, such as in behavior modeling where interpreting agent actions and movements is important. We evaluate our method on learning latent representations for real-world trajectory data from animal biology and sports analytics. We show that our approach offers significantly better separation of meaningful categories than standard VAEs and leads to practical gains on downstream analysis tasks, such as for behavior classification.
  • Open

    Lifelong Reinforcement Learning with Modulating Masks. (arXiv:2212.11110v1 [cs.LG])
    Lifelong learning aims to create AI systems that continuously and incrementally learn during a lifetime, similar to biological learning. Attempts so far have met problems, including catastrophic forgetting, interference among tasks, and the inability to exploit previous knowledge. While considerable research has focused on learning multiple input distributions, typically in classification, lifelong reinforcement learning (LRL) must also deal with variations in the state and transition distributions, and in the reward functions. Modulating masks, recently developed for classification, are particularly suitable to deal with such a large spectrum of task variations. In this paper, we adapted modulating masks to work with deep LRL, specifically PPO and IMPALA agents. The comparison with LRL baselines in both discrete and continuous RL tasks shows competitive performance. We further investigated the use of a linear combination of previously learned masks to exploit previous knowledge when learning new tasks: not only is learning faster, the algorithm solves tasks that we could not otherwise solve from scratch due to extremely sparse rewards. The results suggest that RL with modulating masks is a promising approach to lifelong learning, to the composition of knowledge to learn increasingly complex tasks, and to knowledge reuse for efficient and faster learning.
    Exponentially Improving the Complexity of Simulating the Weisfeiler-Lehman Test with Graph Neural Networks. (arXiv:2211.03232v2 [cs.LG] UPDATED)
    Recent work shows that the expressive power of Graph Neural Networks (GNNs) in distinguishing non-isomorphic graphs is exactly the same as that of the Weisfeiler-Lehman (WL) graph test. In particular, they show that the WL test can be simulated by GNNs. However, those simulations involve neural networks for the 'combine' function of size polynomial or even exponential in the number of graph nodes $n$, as well as feature vectors of length linear in $n$. We present an improved simulation of the WL test on GNNs with \emph{exponentially} lower complexity. In particular, the neural network implementing the combine function in each node has only a polylogarithmic number of parameters in $n$, and the feature vectors exchanged by the nodes of GNN consists of only $O(\log n)$ bits. We also give logarithmic lower bounds for the feature vector length and the size of the neural networks, showing the (near)-optimality of our construction.
    Provably Reliable Large-Scale Sampling from Gaussian Processes. (arXiv:2211.08036v2 [stat.ML] UPDATED)
    When comparing approximate Gaussian process (GP) models, it can be helpful to be able to generate data from any GP. If we are interested in how approximate methods perform at scale, we may wish to generate very large synthetic datasets to evaluate them. Na\"{i}vely doing so would cost \(\mathcal{O}(n^3)\) flops and \(\mathcal{O}(n^2)\) memory to generate a size \(n\) sample. We demonstrate how to scale such data generation to large \(n\) whilst still providing guarantees that, with high probability, the sample is indistinguishable from a sample from the desired GP.
    Gen\'eLive! Generating Rhythm Actions in Love Live!. (arXiv:2202.12823v2 [cs.LG] UPDATED)
    This article presents our generative model for rhythm action games together with applications in business operations. Rhythm action games are video games in which the player is challenged to issue commands at the right timings during a music session. The timings are rendered in the chart, which consists of visual symbols, called notes, flying through the screen. We introduce our deep generative model, Gen\'eLive!, which outperforms the state-of-the-art model by taking into account musical structures through beats and temporal scales. Thanks to its favorable performance, Gen\'eLive! was put into operation at KLab Inc., a Japan-based video game developer, and reduced the business cost of chart generation by as much as half. The application target included the phenomenal "Love Live!," which has more than 10 million users across Asia and beyond, and is one of the few rhythm action franchises that has led the online era of the genre. In this article, we evaluate the generative performance of Gen\'eLive! using production datasets at KLab as well as open datasets for reproducibility, while the model continues to operate in their business. Our code and the model, tuned and trained using a supercomputer, are publicly available.
    Relative Importance Sampling For Off-Policy Actor-Critic in Deep Reinforcement Learning. (arXiv:1810.12558v7 [cs.LG] UPDATED)
    Off-policy learning is more unstable compared to on-policy learning in reinforcement learning (RL). One reason for the instability of off-policy learning is a discrepancy between the target ($\pi$) and behavior (b) policy distributions. The discrepancy between $\pi$ and b distributions can be alleviated by employing a smooth variant of the importance sampling (IS), such as the relative importance sampling (RIS). RIS has parameter $\beta\in[0, 1]$ which controls smoothness. To cope with instability, we present the first relative importance sampling-off-policy actor-critic (RIS-Off-PAC) model-free algorithms in RL. In our method, the network yields a target policy (the actor), a value function (the critic) assessing the current policy ($\pi$) using samples drawn from behavior policy. We use action value generated from the behavior policy in reward function to train our algorithm rather than from the target policy. We also use deep neural networks to train both actor and critic. We evaluated our algorithm on a number of Open AI Gym benchmark problems and demonstrate better or comparable performance to several state-of-the-art RL baselines.
    Polynomial-Time Reachability for LTI Systems with Two-Level Lattice Neural Network Controllers. (arXiv:2209.09400v2 [cs.LG] UPDATED)
    In this paper, we consider the computational complexity of bounding the reachable set of a Linear Time-Invariant (LTI) system controlled by a Rectified Linear Unit (ReLU) Two-Level Lattice (TLL) Neural Network (NN) controller. In particular, we show that for such a system and controller, it is possible to compute the exact one-step reachable set in polynomial time in the size of the TLL NN controller (number of neurons). Additionally, we show that a tight bounding box of the reachable set is computable via two polynomial-time methods: one with polynomial complexity in the size of the TLL and the other with polynomial complexity in the Lipschitz constant of the controller and other problem parameters. Finally, we propose a pragmatic algorithm that adaptively combines the benefits of (semi-)exact reachability and approximate reachability, which we call L-TLLBox. We evaluate L-TLLBox with an empirical comparison to a state-of-the-art NN controller reachability tool. In our experiments, L-TLLBox completed reachability analysis as much as 5000x faster than this tool on the same network/system, while producing reach boxes that were from 0.08 to 1.42 times the area.
    FedDAG: Federated DAG Structure Learning. (arXiv:2112.03555v2 [cs.LG] UPDATED)
    To date, most directed acyclic graphs (DAGs) structure learning approaches require data to be stored in a central server. However, due to the consideration of privacy protection, data owners gradually refuse to share their personalized raw data to avoid private information leakage, making this task more troublesome by cutting off the first step. Thus, a puzzle arises: \textit{how do we discover the underlying DAG structure from decentralized data?} In this paper, focusing on the additive noise models (ANMs) assumption of data generation, we take the first step in developing a gradient-based learning framework named FedDAG, which can learn the DAG structure without directly touching the local data and also can naturally handle the data heterogeneity. Our method benefits from a two-level structure of each local model. The first level structure learns the edges and directions of the graph and communicates with the server to get the model information from other clients during the learning procedure, while the second level structure approximates the mechanisms among variables and personally updates on its own data to accommodate the data heterogeneity. Moreover, FedDAG formulates the overall learning task as a continuous optimization problem by taking advantage of an equality acyclicity constraint, which can be solved by gradient descent methods to boost the searching efficiency. Extensive experiments on both synthetic and real-world datasets verify the efficacy of the proposed method.
    Strong uniform convergence of Laplacians of random geometric and directed kNN graphs on compact manifolds. (arXiv:2212.10287v1 [math.PR] CROSS LISTED)
    Consider $n$ points independently sampled from a density $p$ of class $\mathcal{C}^2$ on a smooth compact $d$-dimensional sub-manifold $\mathcal{M}$ of $\mathbb{R}^m$, and consider the generator of a random walk visiting these points according to a transition kernel $K$. We study the almost sure uniform convergence of this operator to the diffusive Laplace-Beltrami operator when $n$ tends to infinity. This work extends known results of the past 15 years. In particular, our result does not require the kernel $K$ to be continuous, which covers the cases of walks exploring $k$NN-random and geometric graphs, and convergence rates are given. The distance between the random walk generator and the limiting operator is separated into several terms: a statistical term, related to the law of large numbers, is treated with concentration tools and an approximation term that we control with tools from differential geometry. The convergence of $k$NN Laplacians is detailed.
    A Nearly Tight Bound for Fitting an Ellipsoid to Gaussian Random Points. (arXiv:2212.11221v1 [math.PR])
    We prove that for $c>0$ a sufficiently small universal constant that a random set of $c d^2/\log^4(d)$ independent Gaussian random points in $\mathbb{R}^d$ lie on a common ellipsoid with high probability. This nearly establishes a conjecture of~\cite{SaundersonCPW12}, within logarithmic factors. The latter conjecture has attracted significant attention over the past decade, due to its connections to machine learning and sum-of-squares lower bounds for certain statistical problems.
    Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing. (arXiv:2212.10789v1 [cs.LG])
    There is increasing adoption of artificial intelligence in drug discovery. However, existing works use machine learning to mainly utilize the chemical structures of molecules yet ignore the vast textual knowledge available in chemistry. Incorporating textual knowledge enables us to realize new drug design objectives, adapt to text-based instructions, and predict complex biological activities. We present a multi-modal molecule structure-text model, MoleculeSTM, by jointly learning molecule's chemical structures and textual descriptions via a contrastive learning strategy. To train MoleculeSTM, we construct the largest multi-modal dataset to date, namely PubChemSTM, with over 280K chemical structure-text pairs. To demonstrate the effectiveness and utility of MoleculeSTM, we design two challenging zero-shot tasks based on text instructions, including structure-text retrieval and molecule editing. MoleculeSTM possesses two main properties: open vocabulary and compositionality via natural language. In experiments, MoleculeSTM obtains the state-of-the-art generalization ability to novel biochemical concepts across various benchmarks.
    Crab: Learning Certifiably Fair Predictive Models in the Presence of Selection Bias. (arXiv:2212.10839v1 [cs.LG])
    A recent explosion of research focuses on developing methods and tools for building fair predictive models. However, most of this work relies on the assumption that the training and testing data are representative of the target population on which the model will be deployed. However, real-world training data often suffer from selection bias and are not representative of the target population for many reasons, including the cost and feasibility of collecting and labeling data, historical discrimination, and individual biases. In this paper, we introduce a new framework for certifying and ensuring the fairness of predictive models trained on biased data. We take inspiration from query answering over incomplete and inconsistent databases to present and formalize the problem of consistent range approximation (CRA) of answers to queries about aggregate information for the target population. We aim to leverage background knowledge about the data collection process, biased data, and limited or no auxiliary data sources to compute a range of answers for aggregate queries over the target population that are consistent with available information. We then develop methods that use CRA of such aggregate queries to build predictive models that are certifiably fair on the target population even when no external information about that population is available during training. We evaluate our methods on real data and demonstrate improvements over state of the art. Significantly, we show that enforcing fairness using our methods can lead to predictive models that are not only fair, but more accurate on the target population.
    Is it easier to count communities than find them?. (arXiv:2212.10872v1 [math.ST])
    Random graph models with community structure have been studied extensively in the literature. For both the problems of detecting and recovering community structure, an interesting landscape of statistical and computational phase transitions has emerged. A natural unanswered question is: might it be possible to infer properties of the community structure (for instance, the number and sizes of communities) even in situations where actually finding those communities is believed to be computationally hard? We show the answer is no. In particular, we consider certain hypothesis testing problems between models with different community structures, and we show (in the low-degree polynomial framework) that testing between two options is as hard as finding the communities. In addition, our methods give the first computational lower bounds for testing between two different `planted' distributions, whereas previous results have considered testing between a planted distribution and an i.i.d. `null' distribution.
    An energy-based deep splitting method for the nonlinear filtering problem. (arXiv:2203.17153v3 [stat.CO] UPDATED)
    The purpose of this paper is to explore the use of deep learning for the solution of the nonlinear filtering problem. This is achieved by solving the Zakai equation by a deep splitting method, previously developed for approximate solution of (stochastic) partial differential equations. This is combined with an energy-based model for the approximation of functions by a deep neural network. This results in a computationally fast filter that takes observations as input and that does not require re-training when new observations are received. The method is tested on four examples, two linear in one and twenty dimensions and two nonlinear in one dimension. The method shows promising performance when benchmarked against the Kalman filter and the bootstrap particle filter.
    LogAnMeta: Log Anomaly Detection Using Meta Learning. (arXiv:2212.10992v1 [cs.LG])
    Modern telecom systems are monitored with performance and system logs from multiple application layers and components. Detecting anomalous events from these logs is key to identify security breaches, resource over-utilization, critical/fatal errors, etc. Current supervised log anomaly detection frameworks tend to perform poorly on new types or signatures of anomalies with few or unseen samples in the training data. In this work, we propose a meta-learning-based log anomaly detection framework (LogAnMeta) for detecting anomalies from sequence of log events with few samples. LoganMeta train a hybrid few-shot classifier in an episodic manner. The experimental results demonstrate the efficacy of our proposed method
    Prediction Sets Adaptive to Unknown Covariate Shift. (arXiv:2203.06126v5 [stat.ME] UPDATED)
    Predicting sets of outcomes -- instead of unique outcomes -- is a promising solution to uncertainty quantification in statistical learning. Despite a rich literature on constructing prediction sets with statistical guarantees, adapting to unknown covariate shift -- a prevalent issue in practice -- poses a serious unsolved challenge. In this paper, we show that prediction sets with finite-sample coverage guarantee are uninformative and propose a novel flexible distribution-free method, PredSet-1Step, to efficiently construct prediction sets with an asymptotic coverage guarantee under unknown covariate shift. We formally show that our method is \textit{asymptotically probably approximately correct}, having well-calibrated coverage error with high confidence for large samples. We illustrate that it achieves nominal coverage in a number of experiments and a data set concerning HIV risk prediction in a South African cohort study. Our theory hinges on a new bound for the convergence rate of the coverage of Wald confidence intervals based on general asymptotically linear estimators.
    A Theoretical Study of The Effects of Adversarial Attacks on Sparse Regression. (arXiv:2212.11209v1 [cs.LG])
    This paper analyzes $\ell_1$ regularized linear regression under the challenging scenario of having only adversarially corrupted data for training. We use the primal-dual witness paradigm to provide provable performance guarantees for the support of the estimated regression parameter vector to match the actual parameter. Our theoretical analysis shows the counter-intuitive result that an adversary can influence sample complexity by corrupting the irrelevant features, i.e., those corresponding to zero coefficients of the regression parameter vector, which, consequently, do not affect the dependent variable. As any adversarially robust algorithm has its limitations, our theoretical analysis identifies the regimes under which the learning algorithm and adversary can dominate over each other. It helps us to analyze these fundamental limits and address critical scientific questions of which parameters (like mutual incoherence, the maximum and minimum eigenvalue of the covariance matrix, and the budget of adversarial perturbation) play a role in the high or low probability of success of the LASSO algorithm. Also, the derived sample complexity is logarithmic with respect to the size of the regression parameter vector, and our theoretical claims are validated by empirical analysis on synthetic and real-world datasets.
    Inversion of Bayesian Networks. (arXiv:2212.10649v1 [cs.LG])
    Variational autoencoders and Helmholtz machines use a recognition network (encoder) to approximate the posterior distribution of a generative model (decoder). In this paper we study the necessary and sufficient properties of a recognition network so that it can model the true posterior distribution exactly. These results are derived in the general context of probabilistic graphical modelling / Bayesian networks, for which the network represents a set of conditional independence statements. We derive both global conditions, in terms of d-separation, and local conditions for the recognition network to have the desired qualities. It turns out that for the local conditions the property perfectness (for every node, all parents are joined) plays an important role.  ( 2 min )
    Resonant Anomaly Detection with Multiple Reference Datasets. (arXiv:2212.10579v1 [hep-ph])
    An important class of techniques for resonant anomaly detection in high energy physics builds models that can distinguish between reference and target datasets, where only the latter has appreciable signal. Such techniques, including Classification Without Labels (CWoLa) and Simulation Assisted Likelihood-free Anomaly Detection (SALAD) rely on a single reference dataset. They cannot take advantage of commonly-available multiple datasets and thus cannot fully exploit available information. In this work, we propose generalizations of CWoLa and SALAD for settings where multiple reference datasets are available, building on weak supervision techniques. We demonstrate improved performance in a number of settings with realistic and synthetic data. As an added benefit, our generalizations enable us to provide finite-sample guarantees, improving on existing asymptotic analyses.  ( 2 min )
    Expander Graph Propagation. (arXiv:2210.02997v2 [cs.LG] UPDATED)
    Deploying graph neural networks (GNNs) on whole-graph classification or regression tasks is known to be challenging: it often requires computing node features that are mindful of both local interactions in their neighbourhood and the global context of the graph structure. GNN architectures that navigate this space need to avoid pathological behaviours, such as bottlenecks and oversquashing, while ideally having linear time and space complexity requirements. In this work, we propose an elegant approach based on propagating information over expander graphs. We leverage an efficient method for constructing expander graphs of a given size, and use this insight to propose the EGP model. We show that EGP is able to address all of the above concerns, while requiring minimal effort to set up, and provide evidence of its empirical utility on relevant graph classification datasets and baselines in the Open Graph Benchmark. Importantly, using expander graphs as a template for message passing necessarily gives rise to negative curvature. While this appears to be counterintuitive in light of recent related work on oversquashing, we theoretically demonstrate that negatively curved edges are likely to be required to obtain scalable message passing without bottlenecks. To the best of our knowledge, this is a previously unstudied result in the context of graph representation learning, and we believe our analysis paves the way to a novel class of scalable methods to counter oversquashing in GNNs.  ( 2 min )
    Sequential Training of Neural Networks with Gradient Boosting. (arXiv:1909.12098v3 [cs.LG] UPDATED)
    This paper presents a novel technique based on gradient boosting to train the final layers of a neural network (NN). Gradient boosting is an additive expansion algorithm in which a series of models are trained sequentially to approximate a given function. A neural network can also be seen as an additive expansion where the scalar product of the responses of the last hidden layer and its weights provide the final output of the network. Instead of training the network as a whole, the proposed algorithm trains the network sequentially in $T$ steps. First, the bias term of the network is initialized with a constant approximation that minimizes the average loss of the data. Then, at each step, a portion of the network, composed of $J$ neurons, is trained to approximate the pseudo-residuals on the training data computed from the previous iterations. Finally, the $T$ partial models and bias are integrated as a single NN with $T \times J$ neurons in the hidden layer. Extensive experiments in classification and regression tasks, as well as in combination with deep neural networks, are carried out showing a competitive generalization performance with respect to neural networks trained with different standard solvers, such as Adam, L-BFGS, SGD and deep models. Furthermore, we show that the proposed method design permits to switch off a number of hidden units during test (the units that were last trained) without a significant reduction of its generalization ability. This permits the adaptation of the model to different classification speed requirements on the fly.  ( 2 min )
    Adapting to Latent Subgroup Shifts via Concepts and Proxies. (arXiv:2212.11254v1 [stat.ML])
    We address the problem of unsupervised domain adaptation when the source domain differs from the target domain because of a shift in the distribution of a latent subgroup. When this subgroup confounds all observed data, neither covariate shift nor label shift assumptions apply. We show that the optimal target predictor can be non-parametrically identified with the help of concept and proxy variables available only in the source domain, and unlabeled data from the target. The identification results are constructive, immediately suggesting an algorithm for estimating the optimal predictor in the target. For continuous observations, when this algorithm becomes impractical, we propose a latent variable model specific to the data generation process at hand. We show how the approach degrades as the size of the shift changes, and verify that it outperforms both covariate and label shift adjustment.  ( 2 min )
    A Tunable Loss Function for Robust Classification: Calibration, Landscape, and Generalization. (arXiv:1906.02314v6 [cs.LG] UPDATED)
    We introduce a tunable loss function called $\alpha$-loss, parameterized by $\alpha \in (0,\infty]$, which interpolates between the exponential loss ($\alpha = 1/2$), the log-loss ($\alpha = 1$), and the 0-1 loss ($\alpha = \infty$), for the machine learning setting of classification. Theoretically, we illustrate a fundamental connection between $\alpha$-loss and Arimoto conditional entropy, verify the classification-calibration of $\alpha$-loss in order to demonstrate asymptotic optimality via Rademacher complexity generalization techniques, and build-upon a notion called strictly local quasi-convexity in order to quantitatively characterize the optimization landscape of $\alpha$-loss. Practically, we perform class imbalance, robustness, and classification experiments on benchmark image datasets using convolutional-neural-networks. Our main practical conclusion is that certain tasks may benefit from tuning $\alpha$-loss away from log-loss ($\alpha = 1$), and to this end we provide simple heuristics for the practitioner. In particular, navigating the $\alpha$ hyperparameter can readily provide superior model robustness to label flips ($\alpha > 1$) and sensitivity to imbalanced classes ($\alpha < 1$).  ( 2 min )

  • Open

    What are some of the best ways to get involved in the AI industry?
    I am looking to get ahead of the curve. Right now I'm working toward a career in web development, which is still my goal. But, I really do want to learn more about AI and get involved in this field and join different communities. Things like Chat GPT are so exciting and I know is only the beginning! Being a beginner, what would be some of the first steps I should explore and communities to join? submitted by /u/pistolpeter1111 [link] [comments]  ( 48 min )
    AI advocates, what is your future vision for human flourishing?
    Been following the AI art debate a little. AI has always bothered me a little over the years. I've always had two questions: For advocates: 1.) How do you believe humans will find meaning as AI continues to be able to do more and more things that humans once did? Maybe there is some ideal that people will be free to do whatever they want. But I think deriving value from "personal projects" only goes so far. I think humans flourish when we are actually needed as integral parts of a community. Just doing things for ourselves will become an empty pursuit quickly. It's hard for me to see how we don't end up in a world like in Wall-E, except the real life version has a lot more suicides. 2.) What is your ultimate goal/hope with this stuff? To me it has always seemed like a lot of the technology has been motivated by nothing more than, "we just wanted to see if we could do it". Which feels pretty irresponsible. I'm sure I'm off here. But I'm thinking of the IBM computer that beat a chess master. Or the art AI software. What value does that really add except, "Hey bro wouldn't it be so cool if all I had to do was type what kind of picture I wanted, and it would appear on screen?" That is all. submitted by /u/BurntTurkeyLeg1399 [link] [comments]  ( 51 min )
    How is image generating Ai so good at perfect symmetry in 3D space?
    It's able to create perfect symmetry of really complicated shapes submitted by /u/district999 [link] [comments]  ( 47 min )
    Ultra realistic Deepfake of Elon Musk
    submitted by /u/Microsis [link] [comments]  ( 49 min )
    Channel 4’s Christmas message to be written and read by a robot
    PLEASE READ THIS IN A BRITISH ACCENT FOR OPTIMUM FUN Channel 4’s alternative Christmas message will be generated and read by an AI, The message will be delivered by Ameca, one of the world’s most advanced robots. Ameca will speak about the highs and lows of 2022, after the King’s annual Christmas message at 3 pm. Know what's even crazier??? During the address, Ameca will also respond to questions about humans. She is due to say that humankind should be “neither happy nor sad” about the past year and “take it as a learning opportunity, a chance to change the way we think about the world and a reminder to help those in need whenever we can”. END OF BRITISH ACCENT, Hope that was fun I actually really hope Ameca has a British accent, that will just take whatever she has to say from a 7 to a freaking 11 over 10! ​ This is from the AI With Vibes Newsletter, read the full issue here: https://aiwithvibes.beehiiv.com/p/google-issues-code-red-response-rise-chatgpt submitted by /u/Mk_Makanaki [link] [comments]  ( 50 min )
    I found an AI called DREAMPRESS, which is an image and fanfic generator, and I wanted to know if it is safe
    submitted by /u/Any-Mammoth-4322 [link] [comments]  ( 54 min )
    Using Machine Learning to improve your Retrospectives (Including Template & Instructions)
    submitted by /u/BLECQ1 [link] [comments]  ( 49 min )
    ChatGPT's Most Absurd Product Ideas
    submitted by /u/flambok [link] [comments]  ( 47 min )
    Why are people scared of CHATGPT that it will create better stories and worlds and basically writers are going to lose their jobs when AI Dungeon has been doing this for years?
    just a dumb question submitted by /u/slavi13222 [link] [comments]  ( 51 min )
    AI safety probems are generally...
    Taking the blood type of this sub and others. Might publish a diagram later idk View Poll submitted by /u/ouaisouais2_2 [link] [comments]  ( 51 min )
    A Q&A interview about artificial intelligence with OpenAI's ChatGPT
    submitted by /u/SimonThalmann [link] [comments]  ( 47 min )
    Code Red
    submitted by /u/FapSimulator2016 [link] [comments]  ( 47 min )
    is there any good and up to date downloadable AI image rendering tool?
    Hello! I am looking for any type of AI image generator that can be downloaded offline so I can generate images with my PC instead of cloud platforms In the perfect world situation I would like to have something that can be also updated to renew and refresh it's sample library Thank you! submitted by /u/Greedy_Environment98 [link] [comments]  ( 48 min )
    AI-based interpolation tools for animation?
    Hey all, A friend and I are trying to make an animation for a song but have been struggling to find an animator to work with us. What we have is the completed storyboard (2D, drawn on iPad), drawn such that each frame of the storyboard could work as a keyframe. Is there an AI-based tool which would interpolate the frames in between using any 2 keyframes? It's alright if it's paid too but would prefer a free one. Thanks in advance :) submitted by /u/yipra97 [link] [comments]  ( 48 min )
    Code Red: Google Scrambles to Build Out AI Products
    submitted by /u/liquidocelotYT [link] [comments]  ( 46 min )
    How to process any type of data collected over time in ML-EDM ?
    submitted by /u/ML-EDM [link] [comments]  ( 54 min )
    AI Dream 97 - Trippy Mushroom Race by AI
    submitted by /u/LordPewPew777 [link] [comments]  ( 48 min )
    As an AI researcher, I am quite confused.
    As an AI researcher, after talking with researchers from other domains, I am just wondering why we need to implement other tons of baselines to compare with our own approach in most cases. Painfully, some of the baselines are not open-sourced or well-organized, or nondetailed. To quantify this, I set a rough timer during one research work. I found more than 60% of the time was put into the implementation of other works. I have no idea whether it is my personal case or not. submitted by /u/Jack_Wang_1107 [link] [comments]  ( 49 min )
    How to Use Artificial Intelligence in Schools
    For years, teachers have struggled to help every student with their individualized educational needs. This gets, even more, tougher in a class of twenty, thirty, or forty students in which every student has to pass through the same tests, irrespective of the student's personalized needs. That said, the schoolrooms of today have not changed much in the past 50 years. Students sit in a room together and complete the same lessons—typically using the same textbooks—no matter their learning skills or expertise in a particular subject. Some students get left behind. Others are left unchallenged and bored by this one-size-for-all approach. AI could do all this now. AI today is helping teachers in creating a series of smart content programs and intelligent tutoring systems that help students learn in a more customized approach. Students will be able to learn new things in a much better way with advanced tutor apps. Such apps can become education mentors for students and provide engaging learning opportunities.AI can now track the performance of an individual student based on his previous grades, participation, and performances and help a student realize his maximum potential. The rest of this article covers some ways in which AI is making education smarter, cheaper, and more accessible to all: AI as a tutor AI can automate grading AI can provide cognitive insights in classrooms AI can improve the education system Read more... https://owlcation.com/academia/4-Amazing-Benefits-of-Artificial-Intelligence-in-Schools submitted by /u/IcyCartoonist1955 [link] [comments]  ( 51 min )
    Drakengard X Alice Liddell
    submitted by /u/Vindaya_ [link] [comments]  ( 48 min )
  • Open

    [D] Nick Bostrom “Superintelligence” (2014) and ChatGPT
    I read Superintelligence several years ago and cant stop thinking about it with relation to the latest buzz around GPT-3. One passage in particular stands out to me: Chapter 10: “An oracle is a question answering system. It might accept questions as natural language and present its answers as text... building an oracle that has a fully domain general ability to answer natural language questions is an AI-complete problem... If one could do that, one could probably also build an AI that has a decent ability to understand human intentions as well as human words.” When I first read this passage it seemed like a concept that was definitely possible but basically just google and un-interesting. GPT meets this description of a generally intelligent AI almost to the letter. It can even go a step further and be used to generate functioning code with real applications. Generating real code moves GPT from being an “oracle” to being a “genie” by Bostrom’s classification as the AI now has a means to interact with the world. The question I then have for discussion is “have we reached human level general AI?” The current capabilities we have are starting read eerily close to descriptions in Superintelligence. One addition question to think about: “What if GPT is asked to write a program which violates a law or is otherwise harmful?” GPT refuses to answer some questions for ethical reasons. On one hand it feels wrong for an AI to give out code which it can identify as violating a law or is a virus. On the other hand an ethical instinct for real-world code queries sounds ripe for “malignant failure”. Is it ethical to have GPT purposely lie to people it can confidently identify as malicious or report illegal activity to the police? Is it ethical to allow clearly identifiable criminals to not be reported? submitted by /u/LanchestersLaw [link] [comments]  ( 68 min )
    Study: AI Behind ChatGPT Could Help Spot Early Signs of Alzheimer's Disease
    submitted by /u/supremogh [link] [comments]  ( 65 min )
    [D] Non-deep Q learning with OpenAI gym lunar lander - anyone?
    Every lunar lander tutorial or example I've found so far uses deep RL. Is classical Q learning such an obviously bad idea that no-one bothers with it? I've had some success recently applying Q learning to lunar lander (converting the continuous observations into discrete values) and am surprised there aren't more tutorials about this approach. Am I missing something? submitted by /u/verbigratia [link] [comments]  ( 68 min )
    [D] When chatGPT stops being free: Run SOTA LLM in cloud
    TL;DR: I found GPU compute to be generally cheap and spot or on-demand instances can be launched on AWS for a few USD / hour up to over 100GB vRAM. So I thought it would make sense to run your own SOTA LLM like Bloomz 176B inference endpoint whenever you need it for a few questions to answer. I thought it would still make more sense than shoving money into a closed walled garden like "not-so-OpenAi" when they make ChatGPT or GPT-4 available for $$$. But I struggle due to lack of tutorials/resources. Therefore, I carefully checked benchmarks, model parameters and sizes as well as training sources for all SOTA LLMs here. Knowing since reading the Chinchilla paper that Model Scaling according to OpenAI was wrong and more params != better quality generation. So I was looking for the best per…  ( 74 min )
    [P] App that Determines Whether You've Been Naughty or Nice Based on Your Reddit Comments
    Hex Application Since we are heading into the holiday season, I thought it would be interesting to take a look if you could create a model to look at morality with user's Reddit comments. I used Scikit-Learn's Logistic Regression Model for this. I started by downloading around 750 comments from Social Grep's website. They have pulled Reddit comments from different sets of subreddits. I pulled from their datasets for confession-like subreddits, the irl subreddits, and the dataset subreddit. I classified the comments manually by a set rule of morality. Once they were scored, I trained/tested the Logistic model with those comments. For the specific user testing, I used PRAW to pull the most recent 50 comments from the username provided in the Hex Application. I ran the trained model and outputted the probability of each comment being nice and took an average of the probabilities and used that value to determine whether the user was naughty or nice. I use a script to email a CSV with all of the tested comments and the final score to the user. Based on the results that have came through so far, the model is definitely biased towards giving the user a nice decision. I believe that is based on the training data being around 70% nice versus naughty. Does anyone have a way to help the model from being biased like that? Feel free to try the app out and let me know what you think! submitted by /u/Steven_Johnson34 [link] [comments]  ( 71 min )
    [D] Any suggested data labeling services for side by side comparison?
    Hi! I am working on a project where I need side by side image comparison data labeled. The task is to check if the two images are of the same object or not. However, I could not find any human labelling services that offer such task format. Any suggestions? submitted by /u/Dust-Level [link] [comments]  ( 68 min )
    [P] Regression Model With Added Constraint
    I have a data set with three columns and want to predict a numerical value. The data set is divided into groups such that each group is 50 rows. There is a necessary constraint where the sum of the predicted value in each group of 50 rows must equal the value in one column for that group. What model can I use for this, if any? submitted by /u/rapp17 [link] [comments]  ( 71 min )
    [D] Diversifying your pretraining dataset
    Hello everyone, I'm currently very curious about the performance of self-supervised traing in small models. Tried, mostof the implementation provided by lightly. my question is, should we use all available data for pretraining. could our model benefit from removing similar images?, if so how would you test it? Finally, sampling for pretraining seems to be overlooked in literature, which seems wrong. I'm feeling like I didn't search for it correctly. so any papers that talk about sampling would be great. submitted by /u/sad_potato00 [link] [comments]  ( 70 min )
    [D] L2 - Is higher always better?
    I am wondering if you have several networks with similar performance on all your available datasets but with different hyperparameters. What would your criteria be to choose one of the networks? E.g would you choose the one with the highest L2 to increase the generalisation? submitted by /u/Wedrux [link] [comments]  ( 69 min )
    [P] A self-driving car using Nvidia Jetson Nano, with movement controlled by a pre-trained convolution neural network (CNN) written in Taichi
    Intro & source code: https://github.com/houkensjtu/taichi-hackathon-akinasan The circuit of an ordinary RC toy car is modified so that Jetson Nano can control the movement of the car through GPIO port. Of course, we need to use motor drive controller here, because the upper limit of the output current of Jetson Nano is not enough to drive the car motor directly. The convolution neural network (CNN) is implemented using Taichi programming language. The road data was collected, then classified and labeled, and finally used in the training of CNN models. The pre-trained model is imported into Jetson Nano and the action prediction made for the images captured during driving. Demo: https://reddit.com/link/zshrlv/video/pcm3f6id3f7a1/player submitted by /u/TaichiOfficial [link] [comments]  ( 67 min )
    [P] question about the generate method in a model() from huggingface
    I am working on an NMT model, I'm a newbie at this but so far I have a decent result. I'm translating from a rich language to a low-resource one. If I use something different in my generate function (e.g., I've used greedy method and Beam ) I could ostensibly improve the results. What I wonder is, has anybody worked with these for an NMT project involving agglutinative languages? If so, any recommendations? I'm currently searching for options but I honestly don't find much and know less. submitted by /u/Proxify [link] [comments]  ( 68 min )
    [D] Using "duplicates" during training?
    I have collected experimental data for various conditions. In order to ensure repeatability, each test is replicated 5 times: which means same input but slightly different output due to experimental variability. If you were to build a machine learning algorithm, would you use all 5 data points for each given test, hoping that your algorithm will learn to converge towards the mean response? Or it is advisable to pre-compute the means and only feed these to the model? ( so that you ensure that one input can only have one output) I can see pros and cons to both approches and would welcome feedback. Thank you. submitted by /u/DreamyPen [link] [comments]  ( 70 min )
    [D] Tensorflow vs Pytorch for LSTM stock bot
    Dabbling with machine learning at the moment, wanting to create a stock bot that uses an LSTM ANN. Not expecting this thing to be profitable (to start with anyway), until I at least later add more datasets other than just historical price data, but the idea is that I'll need to work with something that's flexible enough to incorporate expanded data sets in the future to draw statistical correlations off (such as sentiment data, financial earnings, etc). But I also want to work with something that's not super-complex and difficult to use since I'm new to both Python (software dev for 10 yrs though so should be able to pick it up fairly easily) and machine learning. After a bit of research, seems like my best options are either Tensorflow or Pytorch (?). Keras was mentioned, but apparently Tensorflow v2 basically has Keras built in, so it's not needed anymore (?). Can someone with a bit of experience please point me in a good starting direction in terms of what ML software/libs to settle on. Thanks! submitted by /u/Careful-Temporary388 [link] [comments]  ( 68 min )
  • Open

    Cognitive scientists develop new model explaining difficulty in language comprehension
    Built on recent advances in machine learning, the model predicts how well individuals will produce and comprehend sentences.  ( 10 min )
  • Open

    Get to production-grade data faster by using new built-in interfaces with Amazon SageMaker Ground Truth Plus
    Launched at AWS re:Invent 2021, Amazon SageMaker Ground Truth Plus helps you create high-quality training datasets by removing the undifferentiated heavy lifting associated with building data labeling applications and managing the labeling workforce. All you do is share data along with labeling requirements, and Ground Truth Plus sets up and manages your data labeling workflow […]  ( 9 min )
  • Open

    Top Food Stories From 2022: Meet 4 Startups Putting AI on the Plate
    This holiday season, feast on the bounty of food-themed stories NVIDIA Blog readers gobbled up in 2022. Startups in the retail industry — and particularly in quick-service restaurants — are using NVIDIA AI and robotics technology to make it easier to order food in drive-thrus, find beverages on store shelves and have meals delivered. They’re Read article >  ( 5 min )
    Toy Jensen Rings in Holidays With AI-Powered ‘Jingle Bells’
    In a moment of pure serendipity, Lah Yileh Lee and Xinting Lee, a pair of talented singers who often stream their performances online, found themselves performing in a public square in Taipei when NVIDIA founder and CEO Jensen Huang happened upon them. Huang couldn’t resist joining in, cheering on their serenade as they recorded Lady Read article >  ( 5 min )
    Make Your Spirit Merry and Bright With Hit Games on GeForce NOW This Holiday Season
    Gear up for some festive fun this GFN Thursday with some of the GeForce NOW community’s top picks of games to play during the holidays, as well as a new title joining the GeForce NOW library this week. And, following the recent update that enabled Ubisoft Connect account syncing with GeForce NOW, select Ubisoft+ Multi-Access Read article >  ( 6 min )
  • Open

    Game characters BUT in anime neural network...
    Game characters BUT in anime neural network... I created pictures of game characters in a social network that makes anime out of any pictures, look, and if it's not difficult to rate here, well, or on YouTube https://youtu.be/ZDdzG333x9Q submitted by /u/Crazy-Assistant5399 [link] [comments]  ( 52 min )
    Obtaining Taxonomies for Textual Data - Father of the Data Warehouse Bill Inmon
    submitted by /u/Valuable-Panic-7793 [link] [comments]  ( 49 min )
    Legendary footballers BUT in anime...
    Legendary footballers BUT in anime... I created pictures of legendary footballers in a social network that makes anime out of any pictures, look, and if it's not difficult to rate here, well, or on YouTube https://youtu.be/kqGBQ_0BXc0 submitted by /u/Crazy-Assistant5399 [link] [comments]  ( 48 min )
  • Open

    Coding theory posts
    Here are some posts I’ve written that fall under the general heading of coding theory. Although coding theory can overlap with secret codes, it’s more concerned with things like Morse code, Reed-Solomon codes, and Unicode. Radio related Frequency Shift Keying Morse code numbers and abbreviations How efficient is Morse code? Algebraic coding theory Prefix codes […] Coding theory posts first appeared on John D. Cook.  ( 4 min )
  • Open

    Remapping the action can improve the learning?
    For example, if I consider a robot that has to open a door… I would expect it to be more difficult for an agent to learn directly the torques of the joints instead of learning their positions (and mapping these into the required torques with a PID for controlling the robot). ​ Is there any work that discuss this topic? Can you link me a paper? submitted by /u/riccardogauss [link] [comments]  ( 61 min )
    Petting zoo and stable baselines 3
    Hi! I would like to (independently) train the agents of a multi-agent environment using some popular single agent RL algorithms, such as PPO. Namely, I would like to train each agent as if it was acting in a single agent MDP and see what happens. Is there a way to directly use the algorithms implemented in stable baselines 3 to train agents in a pettingzoo environmen? submitted by /u/Tabunamok [link] [comments]  ( 57 min )
    [D] Performance of DQN model not consistent
    I use DQN (from stable baselines 3) and have the following problem. If I run training multiple times the performance of the trained models is vastly different in testing even though they are trained with the same hyperparameters and seem to converge to the same episode rewards in training. The environment is custom but deterministic (right now). I split the environment into a training part and a test part. So the agent sees new data during the test phase. Trying to illustrate the situation: - Training model A with hyperparameters P. Train on training data TRAIN: good performance on test data TEST - Training model B with hyperparameters P. Train on training data TRAIN: bad performance on test data TEST - Training model C with hyperparameters P. Train on training data TRAIN: bad performance on test data TEST P, TRAIN, and TEST are the same across all 3 Examples of episode reward plots are here: https://imgur.com/a/wT698qK Do you have any ideas of what I could be doing wrong or what is going on? I am doing RL as a hobby and personal learning project, so you can just assume I missed something obvious. Thank you in advance. submitted by /u/---___---___---___ [link] [comments]  ( 64 min )
  • Open

    Microgrid Optimal Energy Scheduling Considering Neural Network based Battery Degradation. (arXiv:2202.12416v4 [eess.SP] UPDATED)
    Battery energy storage system (BESS) can effec-tively mitigate the uncertainty of variable renewable generation. Degradation is unpreventable and hard to model and predict for batteries such as the most popular Lithium-ion battery (LiB). In this paper, we propose a data driven method to predict the bat-tery degradation per a given scheduled battery operational pro-file. Particularly, a neural network based battery degradation (NNBD) model is proposed to quantify the battery degradation with inputs of major battery degradation factors. When incorpo-rating the proposed NNBD model into microgrid day-ahead scheduling (MDS), we can establish a battery degradation based MDS (BDMDS) model that can consider the equivalent battery degradation cost precisely with the proposed cycle based battery usage processing (CBUP) method for the NNBD model. Since the proposed NNBD model is highly non-linear and non-convex, BDMDS would be very hard to solve. To address this issue, a neural network and optimization decoupled heuristic (NNODH) algorithm is proposed in this paper to effectively solve this neural network embedded optimization problem. Simulation results demonstrate that the proposed NNODH algorithm is able to ob-tain the optimal solution with lowest total cost including normal operation cost and battery degradation cost.
    Application-Driven Learning: A Closed-Loop Prediction and Optimization Approach Applied to Dynamic Reserves and Demand Forecasting. (arXiv:2102.13273v4 [math.OC] CROSS LISTED)
    Forecasting and decision-making are generally modeled as two sequential steps with no feedback, following an open-loop approach. In this paper, we present application-driven learning, a new closed-loop framework in which the processes of forecasting and decision-making are merged and co-optimized through a bilevel optimization problem. We present our methodology in a general format and prove that the solution converges to the best estimator in terms of the expected cost of the selected application. Then, we propose two solution methods: an exact method based on the KKT conditions of the second-level problem and a scalable heuristic approach suitable for decomposition methods. The proposed methodology is applied to the relevant problem of defining dynamic reserve requirements and conditional load forecasts, offering an alternative approach to current \emph{ad hoc} procedures implemented in industry practices. We benchmark our methodology with the standard sequential least-squares forecast and dispatch planning process. We apply the proposed methodology to an illustrative system and to a wide range of instances, from dozens of buses to large-scale realistic systems with thousands of buses. Our results show that the proposed methodology is scalable and yields consistently better performance than the standard open-loop approach.
    Walking Noise: Understanding Implications of Noisy Computations on Classification Tasks. (arXiv:2212.10430v1 [cs.LG])
    Machine learning methods like neural networks are extremely successful and popular in a variety of applications, however, they come at substantial computational costs, accompanied by high energy demands. In contrast, hardware capabilities are limited and there is evidence that technology scaling is stuttering, therefore, new approaches to meet the performance demands of increasingly complex model architectures are required. As an unsafe optimization, noisy computations are more energy efficient, and given a fixed power budget also more time efficient. However, any kind of unsafe optimization requires counter measures to ensure functionally correct results. This work considers noisy computations in an abstract form, and gears to understand the implications of such noise on the accuracy of neural-network-based classifiers as an exemplary workload. We propose a methodology called "Walking Noise" that allows to assess the robustness of different layers of deep architectures by means of a so-called "midpoint noise level" metric. We then investigate the implications of additive and multiplicative noise for different classification tasks and model architectures, with and without batch normalization. While noisy training significantly increases robustness for both noise types, we observe a clear trend to increase weights and thus increase the signal-to-noise ratio for additive noise injection. For the multiplicative case, we find that some networks, with suitably simple tasks, automatically learn an internal binary representation, hence becoming extremely robust. Overall this work proposes a method to measure the layer-specific robustness and shares first insights on how networks learn to compensate injected noise, and thus, contributes to understand robustness against noisy computations.
    Method to Classify Skin Lesions using Dermoscopic images. (arXiv:2008.09418v2 [eess.IV] UPDATED)
    Skin cancer is the most common cancer in the existing world constituting one-third of the cancer cases. Benign skin cancers are not fatal, can be cured with proper medication. But it is not the same as the malignant skin cancers. In the case of malignant melanoma, in its peak stage, the maximum life expectancy is less than or equal to 5 years. But, it can be cured if detected in early stages. Though there are numerous clinical procedures, the accuracy of diagnosis falls between 49% to 81% and is time-consuming. So, dermoscopy has been brought into the picture. It helped in increasing the accuracy of diagnosis but could not demolish the error-prone behaviour. A quick and less error-prone solution is needed to diagnose this majorly growing skin cancer. This project deals with the usage of deep learning in skin lesion classification. In this project, an automated model for skin lesion classification using dermoscopic images has been developed with CNN(Convolution Neural Networks) as a training model. Convolution neural networks are known for capturing features of an image. So, they are preferred in analyzing medical images to find the characteristics that drive the model towards success. Techniques like data augmentation for tackling class imbalance, segmentation for focusing on the region of interest and 10-fold cross-validation to make the model robust have been brought into the picture. This project also includes usage of certain preprocessing techniques like brightening the images using piece-wise linear transformation function, grayscale conversion of the image, resize the image. This project throws a set of valuable insights on how the accuracy of the model hikes with the bringing of new input strategies, preprocessing techniques. The best accuracy this model could achieve is 0.886.
    Almost Cost-Free Communication in Federated Best Arm Identification. (arXiv:2208.09215v2 [cs.LG] UPDATED)
    We study the problem of best arm identification in a federated learning multi-armed bandit setup with a central server and multiple clients. Each client is associated with a multi-armed bandit in which each arm yields {\em i.i.d.}\ rewards following a Gaussian distribution with an unknown mean and known variance. The set of arms is assumed to be the same at all the clients. We define two notions of best arm -- local and global. The local best arm at a client is the arm with the largest mean among the arms local to the client, whereas the global best arm is the arm with the largest average mean across all the clients. We assume that each client can only observe the rewards from its local arms and thereby estimate its local best arm. The clients communicate with a central server on uplinks that entail a cost of $C\ge0$ units per usage per uplink. The global best arm is estimated at the server. The goal is to identify the local best arms and the global best arm with minimal total cost, defined as the sum of the total number of arm selections at all the clients and the total communication cost, subject to an upper bound on the error probability. We propose a novel algorithm {\sc FedElim} that is based on successive elimination and communicates only in exponential time steps and obtain a high probability instance-dependent upper bound on its total cost. The key takeaway from our paper is that for any $C\geq 0$ and error probabilities sufficiently small, the total number of arm selections (resp.\ the total cost) under {\sc FedElim} is at most~$2$ (resp.~$3$) times the maximum total number of arm selections under its variant that communicates in every time step. Additionally, we show that the latter is optimal in expectation up to a constant factor, thereby demonstrating that communication is almost cost-free in {\sc FedElim}. We numerically validate the efficacy of {\sc FedElim}.
    Are Deep Neural Networks SMARTer than Second Graders?. (arXiv:2212.09993v1 [cs.AI])
    Recent times have witnessed an increasing number of applications of deep neural networks towards solving tasks that require superior cognitive abilities, e.g., playing Go, generating art, question answering (such as ChatGPT), etc. Such a dramatic progress raises the question: how generalizable are neural networks in solving problems that demand broad skills? To answer this question, we propose SMART: a Simple Multimodal Algorithmic Reasoning Task and the associated SMART-101 dataset, for evaluating the abstraction, deduction, and generalization abilities of neural networks in solving visuo-linguistic puzzles designed specifically for children in the 6-8 age group. Our dataset consists of 101 unique puzzles; each puzzle comprises a picture and a question, and their solution needs a mix of several elementary skills, including arithmetic, algebra, and spatial reasoning, among others. To scale our dataset towards training deep neural networks, we programmatically generate entirely new instances for each puzzle while retaining their solution algorithm. To benchmark the performance on the SMART-101 dataset, we propose a vision and language meta-learning model using varied state-of-the-art backbone neural networks. Our experiments reveal that while powerful deep models offer reasonable performances on puzzles that they are trained on, they are not better than random accuracy when analyzed for generalization. We also evaluate the recent ChatGPT large language model on a subset of our dataset and find that while ChatGPT produces convincing reasoning abilities, the answers are often incorrect.
    Managing Large Dataset Gaps in Urban Air Quality Prediction: DCU-Insight-AQ at MediaEval 2022. (arXiv:2212.10273v1 [cs.LG])
    Calculating an Air Quality Index (AQI) typically uses data streams from air quality sensors deployed at fixed locations and the calculation is a real time process. If one or a number of sensors are broken or offline, then the real time AQI value cannot be computed. Estimating AQI values for some point in the future is a predictive process and uses historical AQI values to train and build models. In this work we focus on gap filling in air quality data where the task is to predict the AQI at 1, 5 and 7 days into the future. The scenario is where one or a number of air, weather and traffic sensors are offline and explores prediction accuracy under such situations. The work is part of the MediaEval'2022 Urban Air: Urban Life and Air Pollution task submitted by the DCU-Insight-AQ team and uses multimodal and crossmodal data consisting of AQI, weather and CCTV traffic images for air pollution prediction.
    A Survey of Deep Learning for Mathematical Reasoning. (arXiv:2212.10535v1 [cs.AI])
    Mathematical reasoning is a fundamental aspect of human intelligence and is applicable in various fields, including science, engineering, finance, and everyday life. The development of artificial intelligence (AI) systems capable of solving math problems and proving theorems has garnered significant interest in the fields of machine learning and natural language processing. For example, mathematics serves as a testbed for aspects of reasoning that are challenging for powerful deep learning models, driving new algorithmic and modeling advances. On the other hand, recent advances in large-scale neural language models have opened up new benchmarks and opportunities to use deep learning for mathematical reasoning. In this survey paper, we review the key tasks, datasets, and methods at the intersection of mathematical reasoning and deep learning over the past decade. We also evaluate existing benchmarks and methods, and discuss future research directions in this domain.
    HyperBO+: Pre-training a universal prior for Bayesian optimization with hierarchical Gaussian processes. (arXiv:2212.10538v1 [cs.LG])
    Bayesian optimization (BO), while proved highly effective for many black-box function optimization tasks, requires practitioners to carefully select priors that well model their functions of interest. Rather than specifying by hand, researchers have investigated transfer learning based methods to automatically learn the priors, e.g. multi-task BO (Swersky et al., 2013), few-shot BO (Wistuba and Grabocka, 2021) and HyperBO (Wang et al., 2022). However, those prior learning methods typically assume that the input domains are the same for all tasks, weakening their ability to use observations on functions with different domains or generalize the learned priors to BO on different search spaces. In this work, we present HyperBO+: a pre-training approach for hierarchical Gaussian processes that enables the same prior to work universally for Bayesian optimization on functions with different domains. We propose a two-step pre-training method and analyze its appealing asymptotic properties and benefits to BO both theoretically and empirically. On real-world hyperparameter tuning tasks that involve multiple search spaces, we demonstrate that HyperBO+ is able to generalize to unseen search spaces and achieves lower regrets than competitive baselines.
    lilGym: Natural Language Visual Reasoning with Reinforcement Learning. (arXiv:2211.01994v2 [cs.LG] UPDATED)
    We present lilGym, a new benchmark for language-conditioned reinforcement learning in visual environments. lilGym is based on 2,661 highly-compositional human-written natural language statements grounded in an interactive visual environment. We introduce a new approach for exact reward computation in every possible world state by annotating all statements with executable Python programs. Each statement is paired with multiple start states and reward functions to form thousands of distinct Markov Decision Processes of varying difficulty. We experiment with lilGym with different models and learning regimes. Our results and analysis show that while existing methods are able to achieve non-trivial performance, lilGym forms a challenging open problem. lilGym is available at https://lil.nlp.cornell.edu/lilgym/.
    GLUE-X: Evaluating Natural Language Understanding Models from an Out-of-distribution Generalization Perspective. (arXiv:2211.08073v2 [cs.CL] UPDATED)
    Pre-trained language models (PLMs) are known to improve the generalization performance of natural language understanding models by leveraging large amounts of data during the pre-training phase. However, the out-of-distribution (OOD) generalization problem remains a challenge in many NLP tasks, limiting the real-world deployment of these methods. This paper presents the first attempt at creating a unified benchmark named GLUE-X for evaluating OOD robustness in NLP models, highlighting the importance of OOD robustness and providing insights on how to measure the robustness of a model and how to improve it. The benchmark includes 13 publicly available datasets for OOD testing, and evaluations are conducted on 8 classic NLP tasks over 19 popularly used PLMs. Our findings confirm the need for improved OOD accuracy in NLP tasks, as significant performance degradation was observed in all settings compared to in-distribution (ID) accuracy.
    In-situ animal behavior classification using knowledge distillation and fixed-point quantization. (arXiv:2209.04130v2 [cs.LG] UPDATED)
    We explore the use of knowledge distillation (KD) for learning compact and accurate models that enable classification of animal behavior from accelerometry data on wearable devices. To this end, we take a deep and complex convolutional neural network, known as residual neural network (ResNet), as the teacher model. ResNet is specifically designed for multivariate time-series classification. We use ResNet to distill the knowledge of animal behavior classification datasets into soft labels, which consist of the predicted pseudo-probabilities of every class for each datapoint. We then use the soft labels to train our significantly less complex student models, which are based on the gated recurrent unit (GRU) and multilayer perceptron (MLP). The evaluation results using two real-world animal behavior classification datasets show that the classification accuracy of the student GRU-MLP models improves appreciably through KD, approaching that of the teacher ResNet model. To further reduce the computational and memory requirements of performing inference using the student models trained via KD, we utilize dynamic fixed-point quantization (DQ) through an appropriate modification of the computational graph of the considered models. We implement both unquantized and quantized versions of the developed KD-based models on the embedded systems of our purpose-built collar and ear tag devices to classify animal behavior in situ and in real time. Our evaluations corroborate the effectiveness of KD and DQ in improving the accuracy and efficiency of in-situ animal behavior classification.
    Well-definedness of Physical Law Learning: The Uniqueness Problem. (arXiv:2210.08342v7 [cs.LG] UPDATED)
    Physical law learning is the ambiguous attempt at automating the derivation of governing equations with the use of machine learning techniques. The current literature focuses however solely on the development of methods to achieve this goal, and a theoretical foundation is at present missing. This paper shall thus serve as a first step to build a comprehensive theoretical framework for learning physical laws, aiming to provide reliability to according algorithms. One key problem consists in the fact that the governing equations might not be uniquely determined by the given data. We will study this problem in the common situation that a physical law is described by an ordinary or partial differential equation. For various different classes of differential equations, we provide both necessary and sufficient conditions for a function to uniquely determine the differential equation which is governing the phenomenon. We then use our results to devise numerical algorithms to determine whether a function solves a differential equation uniquely. Finally, we provide extensive numerical experiments showing that our algorithms in combination with common approaches for learning physical laws indeed allow to guarantee that a unique governing differential equation is learnt, without assuming any knowledge about the function, thereby ensuring reliability.
    AskewSGD : An Annealed interval-constrained Optimisation method to train Quantized Neural Networks. (arXiv:2211.03741v2 [stat.ML] UPDATED)
    In this paper, we develop a new algorithm, Annealed Skewed SGD - AskewSGD - for training deep neural networks (DNNs) with quantized weights. First, we formulate the training of quantized neural networks (QNNs) as a smoothed sequence of interval-constrained optimization problems. Then, we propose a new first-order stochastic method, AskewSGD, to solve each constrained optimization subproblem. Unlike algorithms with active sets and feasible directions, AskewSGD avoids projections or optimization under the entire feasible set and allows iterates that are infeasible. The numerical complexity of AskewSGD is comparable to existing approaches for training QNNs, such as the straight-through gradient estimator used in BinaryConnect, or other state of the art methods (ProxQuant, LUQ). We establish convergence guarantees for AskewSGD (under general assumptions for the objective function). Experimental results show that the AskewSGD algorithm performs better than or on par with state of the art methods in classical benchmarks.
    Uncertainty Quantification for Deep Unrolling-Based Computational Imaging. (arXiv:2207.00698v2 [eess.IV] UPDATED)
    Deep unrolling is an emerging deep learning-based image reconstruction methodology that bridges the gap between model-based and purely deep learning-based image reconstruction methods. Although deep unrolling methods achieve state-of-the-art performance for imaging problems and allow the incorporation of the observation model into the reconstruction process, they do not provide any uncertainty information about the reconstructed image, which severely limits their use in practice, especially for safety-critical imaging applications. In this paper, we propose a learning-based image reconstruction framework that incorporates the observation model into the reconstruction task and that is capable of quantifying epistemic and aleatoric uncertainties, based on deep unrolling and Bayesian neural networks. We demonstrate the uncertainty characterization capability of the proposed framework on magnetic resonance imaging and computed tomography reconstruction problems. We investigate the characteristics of the epistemic and aleatoric uncertainty information provided by the proposed framework to motivate future research on utilizing uncertainty information to develop more accurate, robust, trustworthy, uncertainty-aware, learning-based image reconstruction and analysis methods for imaging problems. We show that the proposed framework can provide uncertainty information while achieving comparable reconstruction performance to state-of-the-art deep unrolling methods.
    FRAME: Evaluating Rationale-Label Consistency Metrics for Free-Text Rationales. (arXiv:2207.00779v2 [cs.CL] CROSS LISTED)
    Following how humans communicate, free-text rationales aim to use natural language to explain neural language model (LM) behavior. However, free-text rationales' unconstrained nature makes them prone to hallucination, so it is important to have metrics for free-text rationale quality. Existing free-text rationale metrics measure how consistent the rationale is with the LM's predicted label, but there is no protocol for assessing such metrics' reliability. Thus, we propose FRAME, a framework for evaluating rationale-label consistency (RLC) metrics for free-text rationales. FRAME is based on three axioms: (1) good metrics should yield highest scores for reference rationales, which maximize RLC by construction; (2) good metrics should be appropriately sensitive to semantic perturbation of rationales; and (3) good metrics should be robust to variation in the LM's task performance. Across three text classification datasets, we show that existing RLC metrics cannot satisfy all three FRAME axioms, since they are implemented via model pretraining which muddles the metric's signal. Then, we introduce a non-pretraining RLC metric that greatly outperforms baselines on (1) and (3), while performing competitively on (2). Finally, we discuss the limitations of using RLC to evaluate free-text rationales.
    Physics Informed Symbolic Networks. (arXiv:2207.06240v2 [cs.LG] UPDATED)
    We introduce Physics Informed Symbolic Networks (PISN) which utilize physics-informed loss to obtain a symbolic solution for a system of Partial Differential Equations (PDE). Given a context-free grammar to describe the language of symbolic expressions, we propose to use weighted sum as continuous approximation for selection of a production rule. We use this approximation to define multilayer symbolic networks. We consider Kovasznay flow (Navier-Stokes) and two-dimensional viscous Burger's equations to illustrate that PISN are able to provide a performance comparable to PINNs across various start-of-the-art advances: multiple outputs and governing equations, domain-decomposition, hypernetworks. Furthermore, we propose Physics-informed Neurosymbolic Networks (PINSN) which employ a multilayer perceptron (MLP) operator to model the residue of symbolic networks. PINSNs are observed to give 2-3 orders of performance gain over standard PINN.
    SciMED: A Computational Framework For Physics-Informed Symbolic Regression with Scientist-In-The-Loop. (arXiv:2209.06257v2 [cs.LG] UPDATED)
    Discovering a meaningful symbolic expression that explains experimental data is a fundamental challenge in many scientific fields. We present a novel, open-source computational framework called Scientist-Machine Equation Detector (SciMed), which integrates scientific discipline wisdom in a scientist-in-the-loop approach, with state-of-the-art symbolic regression (SR) methods. SciMed combines a wrapper selection method, that is based on a genetic algorithm, with automatic machine learning and two levels of SR methods. We test SciMed on five configurations of a settling sphere, with and without aerodynamic non-linear drag force, and with excessive noise in the measurements. We show that SciMed is sufficiently robust to discover the correct physically meaningful symbolic expressions from the data, and demonstrate how the integration of domain knowledge enhances its performance. Our results indicate better performance on these tasks than the state-of-the-art SR software packages, even in cases where no knowledge is integrated. Moreover, we demonstrate how SciMed can alert the user about possible missing features, unlike the majority of current SR systems.
    A general approximation lower bound in $L^p$ norm, with applications to feed-forward neural networks. (arXiv:2206.04360v2 [cs.LG] UPDATED)
    We study the fundamental limits to the expressive power of neural networks. Given two sets $F$, $G$ of real-valued functions, we first prove a general lower bound on how well functions in $F$ can be approximated in $L^p(\mu)$ norm by functions in $G$, for any $p \geq 1$ and any probability measure $\mu$. The lower bound depends on the packing number of $F$, the range of $F$, and the fat-shattering dimension of $G$. We then instantiate this bound to the case where $G$ corresponds to a piecewise-polynomial feed-forward neural network, and describe in details the application to two sets $F$: H{\"o}lder balls and multivariate monotonic functions. Beside matching (known or new) upper bounds up to log factors, our lower bounds shed some light on the similarities or differences between approximation in $L^p$ norm or in sup norm, solving an open question by DeVore et al. (2021). Our proof strategy differs from the sup norm case and uses a key probability result of Mendelson (2002).
    DDIPNet and DDIPNet+: Discriminant Deep Image Prior Networks for Remote Sensing Image Classification. (arXiv:2212.10411v1 [cs.CV])
    Research on remote sensing image classification significantly impacts essential human routine tasks such as urban planning and agriculture. Nowadays, the rapid advance in technology and the availability of many high-quality remote sensing images create a demand for reliable automation methods. The current paper proposes two novel deep learning-based architectures for image classification purposes, i.e., the Discriminant Deep Image Prior Network and the Discriminant Deep Image Prior Network+, which combine Deep Image Prior and Triplet Networks learning strategies. Experiments conducted over three well-known public remote sensing image datasets achieved state-of-the-art results, evidencing the effectiveness of using deep image priors for remote sensing image classification.
    Investigating Bayesian optimization for expensive-to-evaluate black box functions: Application in fluid dynamics. (arXiv:2207.09154v2 [cs.LG] UPDATED)
    Bayesian optimization provides an effective method to optimize expensive-to-evaluate black box functions. It has been widely applied to problems in many fields, including notably in computer science, e.g. in machine learning to optimize hyperparameters of neural networks, and in engineering, e.g. in fluid dynamics to optimize control strategies that maximize drag reduction. This paper empirically studies and compares the performance and the robustness of common Bayesian optimization algorithms on a range of synthetic test functions to provide general guidance on the design of Bayesian optimization algorithms for specific problems. It investigates the choice of acquisition function, the effect of different numbers of training samples, the exact and Monte Carlo based calculation of acquisition functions, and both single-point and multi-point optimization. The test functions considered cover a wide selection of challenges and therefore serve as an ideal test bed to understand the performance of Bayesian optimization to specific challenges, and in general. To illustrate how these findings can be used to inform a Bayesian optimization setup tailored to a specific problem, two simulations in the area of computational fluid dynamics are optimized, giving evidence that suitable solutions can be found in a small number of evaluations of the objective function for complex, real problems. The results of our investigation can similarly be applied to other areas, such as machine learning and physical experiments, where objective functions are expensive to evaluate and their mathematical expressions are unknown.
    Diffusion Models in Vision: A Survey. (arXiv:2209.04747v3 [cs.CV] UPDATED)
    Denoising diffusion models represent a recent emerging topic in computer vision, demonstrating remarkable results in the area of generative modeling. A diffusion model is a deep generative model that is based on two stages, a forward diffusion stage and a reverse diffusion stage. In the forward diffusion stage, the input data is gradually perturbed over several steps by adding Gaussian noise. In the reverse stage, a model is tasked at recovering the original input data by learning to gradually reverse the diffusion process, step by step. Diffusion models are widely appreciated for the quality and diversity of the generated samples, despite their known computational burdens, i.e. low speeds due to the high number of steps involved during sampling. In this survey, we provide a comprehensive review of articles on denoising diffusion models applied in vision, comprising both theoretical and practical contributions in the field. First, we identify and present three generic diffusion modeling frameworks, which are based on denoising diffusion probabilistic models, noise conditioned score networks, and stochastic differential equations. We further discuss the relations between diffusion models and other deep generative models, including variational auto-encoders, generative adversarial networks, energy-based models, autoregressive models and normalizing flows. Then, we introduce a multi-perspective categorization of diffusion models applied in computer vision. Finally, we illustrate the current limitations of diffusion models and envision some interesting directions for future research.
    Asynchronous Distributed Bilevel Optimization. (arXiv:2212.10048v1 [cs.LG])
    Bilevel optimization plays an essential role in many machine learning tasks, ranging from hyperparameter optimization to meta-learning. Existing studies on bilevel optimization, however, focus on either centralized or synchronous distributed setting. The centralized bilevel optimization approaches require collecting massive amount of data to a single server, which inevitably incur significant communication expenses and may give rise to data privacy risks. Synchronous distributed bilevel optimization algorithms, on the other hand, often face the straggler problem and will immediately stop working if a few workers fail to respond. As a remedy, we propose Asynchronous Distributed Bilevel Optimization (ADBO) algorithm. The proposed ADBO can tackle bilevel optimization problems with both nonconvex upper-level and lower-level objective functions, and its convergence is theoretically guaranteed. Furthermore, it is revealed through theoretic analysis that the iteration complexity of ADBO to obtain the $\epsilon$-stationary point is upper bounded by $\mathcal{O}(\frac{1}{{{\epsilon ^2}}})$. Thorough empirical studies on public datasets have been conducted to elucidate the effectiveness and efficiency of the proposed ADBO.
    Deterministic Sequencing of Exploration and Exploitation for Reinforcement Learning. (arXiv:2209.05408v3 [cs.LG] UPDATED)
    We propose Deterministic Sequencing of Exploration and Exploitation (DSEE) algorithm with interleaving exploration and exploitation epochs for model-based RL problems that aim to simultaneously learn the system model, i.e., a Markov decision process (MDP), and the associated optimal policy. During exploration, DSEE explores the environment and updates the estimates for expected reward and transition probabilities. During exploitation, the latest estimates of the expected reward and transition probabilities are used to obtain a robust policy with high probability. We design the lengths of the exploration and exploitation epochs such that the cumulative regret grows as a sub-linear function of time.
    Analyzing Transformers in Embedding Space. (arXiv:2209.02535v2 [cs.CL] UPDATED)
    Understanding Transformer-based models has attracted significant attention, as they lie at the heart of recent technological advances across machine learning. While most interpretability methods rely on running models over inputs, recent work has shown that a zero-pass approach, where parameters are interpreted directly without a forward/backward pass is feasible for some Transformer parameters, and for two-layer attention networks. In this work, we present a theoretical analysis where all parameters of a trained Transformer are interpreted by projecting them into the embedding space, that is, the space of vocabulary items they operate on. We derive a simple theoretical framework to support our arguments and provide ample evidence for its validity. First, an empirical analysis showing that parameters of both pretrained and fine-tuned models can be interpreted in embedding space. Second, we present two applications of our framework: (a) aligning the parameters of different models that share a vocabulary, and (b) constructing a classifier without training by ``translating'' the parameters of a fine-tuned classifier to parameters of a different model that was only pretrained. Overall, our findings open the door to interpretation methods that, at least in part, abstract away from model specifics and operate in the embedding space only.
    Analysis of Distributed Deep Learning in the Cloud. (arXiv:2208.14344v2 [cs.LG] UPDATED)
    We aim to resolve this problem by introducing a comprehensive distributed deep learning (DDL) profiler, which can determine the various execution "stalls" that DDL suffers from while running on a public cloud. We have implemented the profiler by extending prior work to additionally estimate two types of communication stalls - interconnect and network stalls. We train popular DNN models using the profiler to characterize various AWS GPU instances and list their advantages and shortcomings for users to make an informed decision. We observe that the more expensive GPU instances may not be the most performant for all DNN models and AWS may sub-optimally allocate hardware interconnect resources. Specifically, the intra-machine interconnect can introduce communication overheads up to 90% of DNN training time and network-connected instances can suffer from up to 5x slowdown compared to training on a single instance. Further, we model the impact of DNN macroscopic features such as the number of layers and the number of gradients on communication stalls. Finally, we propose a measurement-based recommendation model for users to lower their public cloud monetary costs for DDL, given a time budget.
    MILAN: Masked Image Pretraining on Language Assisted Representation. (arXiv:2208.06049v3 [cs.CV] UPDATED)
    Self-attention based transformer models have been dominating many computer vision tasks in the past few years. Their superb model qualities heavily depend on the excessively large labeled image datasets. In order to reduce the reliance on large labeled datasets, reconstruction based masked autoencoders are gaining popularity, which learn high quality transferable representations from unlabeled images. For the same purpose, recent weakly supervised image pretraining methods explore language supervision from text captions accompanying the images. In this work, we propose masked image pretraining on language assisted representation, dubbed as MILAN. Instead of predicting raw pixels or low level features, our pretraining objective is to reconstruct the image features with substantial semantic signals that are obtained using caption supervision. Moreover, to accommodate our reconstruction target, we propose a more effective prompting decoder architecture and a semantic aware mask sampling mechanism, which further advance the transfer performance of the pretrained model. Experimental results demonstrate that MILAN delivers higher accuracy than the previous works. When the masked autoencoder is pretrained and finetuned on ImageNet-1K dataset with an input resolution of 224x224, MILAN achieves a top-1 accuracy of 85.4% on ViT-Base, surpassing previous state-of-the-arts by 1%. In the downstream semantic segmentation task, MILAN achieves 52.7 mIoU using ViT-Base on ADE20K dataset, outperforming previous masked pretraining results by 4 points.
    Deep Reinforcement Learning for Turbulence Modeling in Large Eddy Simulations. (arXiv:2206.11038v2 [physics.flu-dyn] UPDATED)
    Over the last years, supervised learning (SL) has established itself as the state-of-the-art for data-driven turbulence modeling. In the SL paradigm, models are trained based on a dataset, which is typically computed a priori from a high-fidelity solution by applying the respective filter function, which separates the resolved and the unresolved flow scales. For implicitly filtered large eddy simulation (LES), this approach is infeasible, since here, the employed discretization itself acts as an implicit filter function. As a consequence, the exact filter form is generally not known and thus, the corresponding closure terms cannot be computed even if the full solution is available. The reinforcement learning (RL) paradigm can be used to avoid this inconsistency by training not on a previously obtained training dataset, but instead by interacting directly with the dynamical LES environment itself. This allows to incorporate the potentially complex implicit LES filter into the training process by design. In this work, we apply a reinforcement learning framework to find an optimal eddy-viscosity for implicitly filtered large eddy simulations of forced homogeneous isotropic turbulence. For this, we formulate the task of turbulence modeling as an RL task with a policy network based on convolutional neural networks that adapts the eddy-viscosity in LES dynamically in space and time based on the local flow state only. We demonstrate that the trained models can provide long-term stable simulations and that they outperform established analytical models in terms of accuracy. In addition, the models generalize well to other resolutions and discretizations. We thus demonstrate that RL can provide a framework for consistent, accurate and stable turbulence modeling especially for implicitly filtered LES.
    Learned Systems Security. (arXiv:2212.10318v1 [cs.CR])
    A learned system uses machine learning (ML) internally to improve performance. We can expect such systems to be vulnerable to some adversarial-ML attacks. Often, the learned component is shared between mutually-distrusting users or processes, much like microarchitectural resources such as caches, potentially giving rise to highly-realistic attacker models. However, compared to attacks on other ML-based systems, attackers face a level of indirection as they cannot interact directly with the learned model. Additionally, the difference between the attack surface of learned and non-learned versions of the same system is often subtle. These factors obfuscate the de-facto risks that the incorporation of ML carries. We analyze the root causes of potentially-increased attack surface in learned systems and develop a framework for identifying vulnerabilities that stem from the use of ML. We apply our framework to a broad set of learned systems under active development. To empirically validate the many vulnerabilities surfaced by our framework, we choose 3 of them and implement and evaluate exploits against prominent learned-system instances. We show that the use of ML caused leakage of past queries in a database, enabled a poisoning attack that causes exponential memory blowup in an index structure and crashes it in seconds, and enabled index users to snoop on each others' key distributions by timing queries over their own keys. We find that adversarial ML is a universal threat against learned systems, point to open research gaps in our understanding of learned-systems security, and conclude by discussing mitigations, while noting that data leakage is inherent in systems whose learned component is shared between multiple parties.
    Generalized Simultaneous Perturbation Stochastic Approximation with Reduced Estimator Bias. (arXiv:2212.10477v1 [cs.LG])
    We present in this paper a family of generalized simultaneous perturbation stochastic approximation (G-SPSA) estimators that estimate the gradient of the objective using noisy function measurements, but where the number of function measurements and the form of the gradient estimator is guided by the desired estimator bias. In particular, estimators with more function measurements are seen to result in lower bias. We provide an analysis of convergence of the generalized SPSA algorithm, and point to possible future directions.
    Tackling Data Scarcity with Transfer Learning: A Case Study of Thickness Characterization from Optical Spectra of Perovskite Thin Films. (arXiv:2207.02209v2 [cs.LG] UPDATED)
    Transfer learning increasingly becomes an important tool in handling data scarcity often encountered in machine learning. In the application of high-throughput thickness as a downstream process of the high-throughput optimization of optoelectronic thin films with autonomous workflows, data scarcity occurs especially for new materials. To achieve high-throughput thickness characterization, we propose a machine learning model called thicknessML that predicts thickness from UV-Vis spectrophotometry input and an overarching transfer learning workflow. We demonstrate the transfer learning workflow from generic source domain of generic band-gapped materials to specific target domain of perovskite materials, where the target domain data only come from limited number (18) of refractive indices from literature. The target domain can be easily extended to other material classes with a few literature data. Defining thickness prediction accuracy to be within-10% deviation, thicknessML achieves 92.2% (with a deviation of 3.6%) accuracy with transfer learning compared to 81.8% (with a deviation of 3.6%) 11.7% without (lower mean and larger standard deviation). Experimental validation on six deposited perovskite films also corroborates the efficacy of the proposed workflow by yielding a 10.5% mean absolute percentage error (MAPE).
    PoissonMat: Remodeling Matrix Factorization using Poisson Distribution and Solving the Cold Start Problem without Input Data. (arXiv:2212.10460v1 [cs.IR])
    Matrix Factorization is one of the most successful recommender system techniques over the past decade. However, the classic probabilistic theory framework for matrix factorization is modeled using normal distributions. To find better probabilistic models, algorithms such as RankMat, ZeroMat and DotMat have been invented in recent years. In this paper, we model the user rating behavior in recommender system as a Poisson process, and design an algorithm that relies on no input data to solve the recommendation problem and the cold start issue at the same time. We prove the superiority of our algorithm in comparison with matrix factorization, random placement, Zipf placement, ZeroMat, DotMat, etc.
    Does unsupervised grammar induction need pixels?. (arXiv:2212.10564v1 [cs.CL])
    Are extralinguistic signals such as image pixels crucial for inducing constituency grammars? While past work has shown substantial gains from multimodal cues, we investigate whether such gains persist in the presence of rich information from large language models (LLMs). We find that our approach, LLM-based C-PCFG (LC-PCFG), outperforms previous multi-modal methods on the task of unsupervised constituency parsing, achieving state-of-the-art performance on a variety of datasets. Moreover, LC-PCFG results in an over 50% reduction in parameter count, and speedups in training time of 1.7x for image-aided models and more than 5x for video-aided models, respectively. These results challenge the notion that extralinguistic signals such as image pixels are needed for unsupervised grammar induction, and point to the need for better text-only baselines in evaluating the need of multi-modality for the task.
    Mini-Model Adaptation: Efficiently Extending Pretrained Models to New Languages via Aligned Shallow Training. (arXiv:2212.10503v1 [cs.CL])
    Prior work has shown that it is possible to expand pretrained Masked Language Models (MLMs) to new languages by learning a new set of embeddings, while keeping the transformer body frozen. Despite learning a small subset of parameters, this approach is not compute-efficient, as training the new embeddings requires a full forward and backward pass over the entire model. In this work, we propose mini-model adaptation, a compute-efficient alternative that builds a shallow mini-model from a fraction of a large model's parameters. New language-specific embeddings can then be efficiently trained over the mini-model, and plugged into the aligned large model for rapid cross-lingual transfer. We explore two approaches to learn mini-models: MiniJoint, which jointly pretrains the primary model and the mini-model using a single transformer with a secondary MLM head at a middle layer; and MiniPost, where we start from a regular pretrained model and build a mini-model by extracting and freezing a few layers and learning a small number of parameters on top. Experiments on XNLI, MLQA and PAWS-X show that mini-model adaptation matches the performance of the standard approach using up to 2.4x less compute.
    Federated Learning via Inexact ADMM. (arXiv:2204.10607v2 [math.OC] UPDATED)
    One of the crucial issues in federated learning is how to develop efficient optimization algorithms. Most of the current ones require full device participation and/or impose strong assumptions for convergence. Different from the widely-used gradient descent-based algorithms, in this paper, we develop an inexact alternating direction method of multipliers (ADMM), which is both computation- and communication-efficient, capable of combating the stragglers' effect, and convergent under mild conditions. Furthermore, it has a high numerical performance compared with several state-of-the-art algorithms for federated learning.
    ADAS: A Simple Active-and-Adaptive Baseline for Cross-Domain 3D Semantic Segmentation. (arXiv:2212.10390v1 [cs.CV])
    State-of-the-art 3D semantic segmentation models are trained on the off-the-shelf public benchmarks, but they often face the major challenge when these well-trained models are deployed to a new domain. In this paper, we propose an Active-and-Adaptive Segmentation (ADAS) baseline to enhance the weak cross-domain generalization ability of a well-trained 3D segmentation model, and bridge the point distribution gap between domains. Specifically, before the cross-domain adaptation stage begins, ADAS performs an active sampling operation to select a maximally-informative subset from both source and target domains for effective adaptation, reducing the adaptation difficulty under 3D scenarios. Benefiting from the rise of multi-modal 2D-3D datasets, ADAS utilizes a cross-modal attention-based feature fusion module that can extract a representative pair of image features and point features to achieve a bi-directional image-point feature interaction for better safe adaptation. Experimentally, ADAS is verified to be effective in many cross-domain settings including: 1) Unsupervised Domain Adaptation (UDA), which means that all samples from target domain are unlabeled; 2) Unsupervised Few-shot Domain Adaptation (UFDA) which means that only a few unlabeled samples are available in the unlabeled target domain; 3) Active Domain Adaptation (ADA) which means that the selected target samples by ADAS are manually annotated. Their results demonstrate that ADAS achieves a significant accuracy gain by easily coupling ADAS with self-training methods or off-the-shelf UDA works.
    Next Period Recommendation Reality Check. (arXiv:2110.05589v3 [cs.LG] UPDATED)
    Over the past decade, tremendous progress has been made in Recommender Systems (RecSys) for well-known tasks such as next-item and next-basket prediction. On the other hand, the recently proposed next-period recommendation (NPR) task is not covered as much. Current works about NPR are mostly based around distinct problem formulations, methods, and proprietary datasets, making solutions difficult to reproduce. In this article, we aim to fill the gap in RecSys methods evaluation on the NPR task using publicly available datasets and (1) introduce the TTRS, a large-scale financial transactions dataset suitable for RecSys methods evaluation; (2) benchmark popular RecSys approaches on several datasets for the NPR task. When performing our analysis, we found a strong repetitive consumption pattern in several real-world datasets. With this setup, our results suggest that the repetitive nature of data is still hard to generalize for the evaluated RecSys methods, and novel item prediction performance is still questionable.
    Parsel: A Unified Natural Language Framework for Algorithmic Reasoning. (arXiv:2212.10561v1 [cs.CL])
    Despite recent success in large language model (LLM) reasoning, LLMs still struggle with hierarchical multi-step reasoning like generating complex programs. In these cases, humans often start with a high-level algorithmic design and implement each part gradually. We introduce Parsel, a framework enabling automatic implementation and validation of complex algorithms with code LLMs, based on hierarchical function descriptions in natural language. Parsel can be used across domains requiring hierarchical reasoning, e.g. code synthesis, theorem proving, and robotic planning. We demonstrate Parsel's capabilities by using it to generate complex programs that cannot currently be automatically implemented from one description and backtranslating Python programs in the APPS dataset. Beyond modeling capabilities, Parsel allows problem-solving with high-level algorithmic designs, benefiting both students and professional programmers.
    Data Augmentation on Graphs: A Survey. (arXiv:2212.09970v1 [cs.LG])
    In recent years, graph representation learning has achieved remarkable success while suffering from low-quality data problems. As a mature technology to improve data quality in computer vision, data augmentation has also attracted increasing attention in graph domain. For promoting the development of this emerging research direction, in this survey, we comprehensively review and summarize the existing graph data augmentation (GDAug) techniques. Specifically, we first summarize a variety of feasible taxonomies, and then classify existing GDAug studies based on fine-grained graph elements. Furthermore, for each type of GDAug technique, we formalize the general definition, discuss the technical details, and give schematic illustration. In addition, we also summarize common performance metrics and specific design metrics for constructing a GDAug evaluation system. Finally, we summarize the applications of GDAug from both data and model levels, as well as future directions.
    Scene Change Detection Using Multiscale Cascade Residual Convolutional Neural Networks. (arXiv:2212.10417v1 [cs.CV])
    Scene change detection is an image processing problem related to partitioning pixels of a digital image into foreground and background regions. Mostly, visual knowledge-based computer intelligent systems, like traffic monitoring, video surveillance, and anomaly detection, need to use change detection techniques. Amongst the most prominent detection methods, there are the learning-based ones, which besides sharing similar training and testing protocols, differ from each other in terms of their architecture design strategies. Such architecture design directly impacts on the quality of the detection results, and also in the device resources capacity, like memory. In this work, we propose a novel Multiscale Cascade Residual Convolutional Neural Network that integrates multiscale processing strategy through a Residual Processing Module, with a Segmentation Convolutional Neural Network. Experiments conducted on two different datasets support the effectiveness of the proposed approach, achieving average overall $\boldsymbol{F\text{-}measure}$ results of $\boldsymbol{0.9622}$ and $\boldsymbol{0.9664}$ over Change Detection 2014 and PetrobrasROUTES datasets respectively, besides comprising approximately eight times fewer parameters. Such obtained results place the proposed technique amongst the top four state-of-the-art scene change detection methods.
    Counterfactual Risk Assessments under Unmeasured Confounding. (arXiv:2212.09844v1 [econ.EM])
    Statistical risk assessments inform consequential decisions such as pretrial release in criminal justice, and loan approvals in consumer finance. Such risk assessments make counterfactual predictions, predicting the likelihood of an outcome under a proposed decision (e.g., what would happen if we approved this loan?). A central challenge, however, is that there may have been unmeasured confounders that jointly affected past decisions and outcomes in the historical data. This paper proposes a tractable mean outcome sensitivity model that bounds the extent to which unmeasured confounders could affect outcomes on average. The mean outcome sensitivity model partially identifies the conditional likelihood of the outcome under the proposed decision, popular predictive performance metrics (e.g., accuracy, calibration, TPR, FPR), and commonly-used predictive disparities. We derive their sharp identified sets, and we then solve three tasks that are essential to deploying statistical risk assessments in high-stakes settings. First, we propose a doubly-robust learning procedure for the bounds on the conditional likelihood of the outcome under the proposed decision. Second, we translate our estimated bounds on the conditional likelihood of the outcome under the proposed decision into a robust, plug-in decision-making policy. Third, we develop doubly-robust estimators of the bounds on the predictive performance of an existing risk assessment.
    Recycling diverse models for out-of-distribution generalization. (arXiv:2212.10445v1 [cs.LG])
    Foundation models are redefining how AI systems are built. Practitioners now follow a standard procedure to build their machine learning solutions: download a copy of a foundation model, and fine-tune it using some in-house data about the target task of interest. Consequently, the Internet is swarmed by a handful of foundation models fine-tuned on many diverse tasks. Yet, these individual fine-tunings often lack strong generalization and exist in isolation without benefiting from each other. In our opinion, this is a missed opportunity, as these specialized models contain diverse features. Based on this insight, we propose model recycling, a simple strategy that leverages multiple fine-tunings of the same foundation model on diverse auxiliary tasks, and repurposes them as rich and diverse initializations for the target task. Specifically, model recycling fine-tunes in parallel each specialized model on the target task, and then averages the weights of all target fine-tunings into a final model. Empirically, we show that model recycling maximizes model diversity by benefiting from diverse auxiliary tasks, and achieves a new state of the art on the reference DomainBed benchmark for out-of-distribution generalization. Looking forward, model recycling is a contribution to the emerging paradigm of updatable machine learning where, akin to open-source software development, the community collaborates to incrementally and reliably update machine learning models.
    RangeAugment: Efficient Online Augmentation with Range Learning. (arXiv:2212.10553v1 [cs.CV])
    State-of-the-art automatic augmentation methods (e.g., AutoAugment and RandAugment) for visual recognition tasks diversify training data using a large set of augmentation operations. The range of magnitudes of many augmentation operations (e.g., brightness and contrast) is continuous. Therefore, to make search computationally tractable, these methods use fixed and manually-defined magnitude ranges for each operation, which may lead to sub-optimal policies. To answer the open question on the importance of magnitude ranges for each augmentation operation, we introduce RangeAugment that allows us to efficiently learn the range of magnitudes for individual as well as composite augmentation operations. RangeAugment uses an auxiliary loss based on image similarity as a measure to control the range of magnitudes of augmentation operations. As a result, RangeAugment has a single scalar parameter for search, image similarity, which we simply optimize via linear search. RangeAugment integrates seamlessly with any model and learns model- and task-specific augmentation policies. With extensive experiments on the ImageNet dataset across different networks, we show that RangeAugment achieves competitive performance to state-of-the-art automatic augmentation methods with 4-5 times fewer augmentation operations. Experimental results on semantic segmentation, object detection, foundation models, and knowledge distillation further shows RangeAugment's effectiveness.
    ReCode: Robustness Evaluation of Code Generation Models. (arXiv:2212.10264v1 [cs.LG])
    Code generation models have achieved impressive performance. However, they tend to be brittle as slight edits to a prompt could lead to very different generations; these robustness properties, critical for user experience when deployed in real-life applications, are not well understood. Most existing works on robustness in text or code tasks have focused on classification, while robustness in generation tasks is an uncharted area and to date there is no comprehensive benchmark for robustness in code generation. In this paper, we propose ReCode, a comprehensive robustness evaluation benchmark for code generation models. We customize over 30 transformations specifically for code on docstrings, function and variable names, code syntax, and code format. They are carefully designed to be natural in real-life coding practice, preserve the original semantic meaning, and thus provide multifaceted assessments of a model's robustness performance. With human annotators, we verified that over 90% of the perturbed prompts do not alter the semantic meaning of the original prompt. In addition, we define robustness metrics for code generation models considering the worst-case behavior under each type of perturbation, taking advantage of the fact that executing the generated code can serve as objective evaluation. We demonstrate ReCode on SOTA models using HumanEval, MBPP, as well as function completion tasks derived from them. Interesting observations include: better robustness for CodeGen over InCoder and GPT-J; models are most sensitive to syntax perturbations; more challenging robustness evaluation on MBPP over HumanEval.
    Cell-Free Data Power Control Via Scalable Multi-Objective Bayesian Optimisation. (arXiv:2212.10299v1 [eess.SY])
    Cell-free multi-user multiple input multiple output networks are a promising alternative to classical cellular architectures, since they have the potential to provide uniform service quality and high resource utilisation over the entire coverage area of the network. To realise this potential, previous works have developed radio resource management mechanisms using various optimisation engines. In this work, we consider the problem of overall ergodic spectral efficiency maximisation in the context of uplink-downlink data power control in cell-free networks. To solve this problem in large networks, and to address convergence-time limitations, we apply scalable multi-objective Bayesian optimisation. Furthermore, we discuss how an intersection of multi-fidelity emulation and Bayesian optimisation can improve radio resource management in cell-free networks.
    ModularFed: Leveraging Modularity in Federated Learning Frameworks. (arXiv:2212.10427v1 [cs.DC])
    Numerous research recently proposed integrating Federated Learning (FL) to address the privacy concerns of using machine learning in privacy-sensitive firms. However, the standards of the available frameworks can no longer sustain the rapid advancement and hinder the integration of FL solutions, which can be prominent in advancing the field. In this paper, we propose ModularFed, a research-focused framework that addresses the complexity of FL implementations and the lack of adaptability and extendability in the available frameworks. We provide a comprehensive architecture that assists FL approaches through well-defined protocols to cover three dominant FL paradigms: adaptable workflow, datasets distribution, and third-party application support. Within this architecture, protocols are blueprints that strictly define the framework's components' design, contribute to its flexibility, and strengthen its infrastructure. Further, our protocols aim to enable modularity in FL, supporting third-party plug-and-play architecture and dynamic simulators coupled with major built-in data distributors in the field. Additionally, the framework support wrapping multiple approaches in a single environment to enable consistent replication of FL issues such as clients' deficiency, data distribution, and network latency, which entails a fair comparison of techniques outlying FL technologies. In our evaluation, we examine the applicability of our framework addressing three major FL domains, including statistical distribution and modular-based approaches for resource monitoring and client selection.
    Open-World Entity Segmentation. (arXiv:2107.14228v3 [cs.CV] UPDATED)
    We introduce a new image segmentation task, called Entity Segmentation (ES), which aims to segment all visual entities (objects and stuffs) in an image without predicting their semantic labels. By removing the need of class label prediction, the models trained for such task can focus more on improving segmentation quality. It has many practical applications such as image manipulation and editing where the quality of segmentation masks is crucial but class labels are less important. We conduct the first-ever study to investigate the feasibility of convolutional center-based representation to segment things and stuffs in a unified manner, and show that such representation fits exceptionally well in the context of ES. More specifically, we propose a CondInst-like fully-convolutional architecture with two novel modules specifically designed to exploit the class-agnostic and non-overlapping requirements of ES. Experiments show that the models designed and trained for ES significantly outperforms popular class-specific panoptic segmentation models in terms of segmentation quality. Moreover, an ES model can be easily trained on a combination of multiple datasets without the need to resolve label conflicts in dataset merging, and the model trained for ES on one or more datasets can generalize very well to other test datasets of unseen domains. The code has been released at https://github.com/dvlab-research/Entity/.
    Improving the Robustness of Summarization Models by Detecting and Removing Input Noise. (arXiv:2212.09928v1 [cs.CL])
    The evaluation of abstractive summarization models typically uses test data that is identically distributed as training data. In real-world practice, documents to be summarized may contain input noise caused by text extraction artifacts or data pipeline bugs. The robustness of model performance under distribution shift caused by such noise is relatively under-studied. We present a large empirical study quantifying the sometimes severe loss in performance (up to 12 ROUGE-1 points) from different types of input noise for a range of datasets and model sizes. We then propose a light-weight method for detecting and removing such noise in the input during model inference without requiring any extra training, auxiliary models, or even prior knowledge of the type of noise. Our proposed approach effectively mitigates the loss in performance, recovering a large fraction of the performance drop, sometimes as large as 11 ROUGE-1 points.
    Quantum causal inference in the presence of hidden common causes: An entropic approach. (arXiv:2104.13227v2 [quant-ph] UPDATED)
    Quantum causality is an emerging field of study which has the potential to greatly advance our understanding of quantum systems. In this paper, we put forth a theoretical framework for merging quantum information science and causal inference by exploiting entropic principles. For this purpose, we leverage the tradeoff between the entropy of hidden cause and the conditional mutual information of observed variables to develop a scalable algorithmic approach for inferring causality in the presence of latent confounders (common causes) in quantum systems. As an application, we consider a system of three entangled qubits and transmit the second and third qubits over separate noisy quantum channels. In this model, we validate that the first qubit is a latent confounder and the common cause of the second and third qubits. In contrast, when two entangled qubits are prepared and one of them is sent over a noisy channel, there is no common confounder. We also demonstrate that the proposed approach outperforms the results of classical causal inference for the Tubingen database when the variables are classical by exploiting quantum dependence between variables through density matrices rather than joint probability distributions. Thus, the proposed approach unifies classical and quantum causal inference in a principled way.
    Dynamic Sparse Network for Time Series Classification: Learning What to "see''. (arXiv:2212.09840v1 [cs.LG])
    The receptive field (RF), which determines the region of time series to be ``seen'' and used, is critical to improve the performance for time series classification (TSC). However, the variation of signal scales across and within time series data, makes it challenging to decide on proper RF sizes for TSC. In this paper, we propose a dynamic sparse network (DSN) with sparse connections for TSC, which can learn to cover various RF without cumbersome hyper-parameters tuning. The kernels in each sparse layer are sparse and can be explored under the constraint regions by dynamic sparse training, which makes it possible to reduce the resource cost. The experimental results show that the proposed DSN model can achieve state-of-art performance on both univariate and multivariate TSC datasets with less than 50\% computational cost compared with recent baseline methods, opening the path towards more accurate resource-aware methods for time series analyses. Our code is publicly available at: https://github.com/QiaoXiao7282/DSN.
    Panoptic Lifting for 3D Scene Understanding with Neural Fields. (arXiv:2212.09802v1 [cs.CV])
    We propose Panoptic Lifting, a novel approach for learning panoptic 3D volumetric representations from images of in-the-wild scenes. Once trained, our model can render color images together with 3D-consistent panoptic segmentation from novel viewpoints. Unlike existing approaches which use 3D input directly or indirectly, our method requires only machine-generated 2D panoptic segmentation masks inferred from a pre-trained network. Our core contribution is a panoptic lifting scheme based on a neural field representation that generates a unified and multi-view consistent, 3D panoptic representation of the scene. To account for inconsistencies of 2D instance identifiers across views, we solve a linear assignment with a cost based on the model's current predictions and the machine-generated segmentation masks, thus enabling us to lift 2D instances to 3D in a consistent way. We further propose and ablate contributions that make our method more robust to noisy, machine-generated labels, including test-time augmentations for confidence estimates, segment consistency loss, bounded segmentation fields, and gradient stopping. Experimental results validate our approach on the challenging Hypersim, Replica, and ScanNet datasets, improving by 8.4, 13.8, and 10.6% in scene-level PQ over state of the art.
    Memory-efficient NLLB-200: Language-specific Expert Pruning of a Massively Multilingual Machine Translation Model. (arXiv:2212.09811v1 [cs.CL])
    Compared to conventional bilingual translation systems, massively multilingual machine translation is appealing because a single model can translate into multiple languages and benefit from knowledge transfer for low resource languages. On the other hand, massively multilingual models suffer from the curse of multilinguality, unless scaling their size massively, which increases their training and inference costs. Sparse Mixture-of-Experts models are a way to drastically increase model capacity without the need for a proportional amount of computing. The recently released NLLB-200 is an example of such a model. It covers 202 languages but requires at least four 32GB GPUs just for inference. In this work, we propose a pruning method that allows the removal of up to 80\% of experts with a negligible loss in translation quality, which makes it feasible to run the model on a single 32GB GPU. Further analysis suggests that our pruning metrics allow to identify language-specific experts and prune non-relevant experts for a given language pair.
    An Information-Theoretic Approach to Transferability in Task Transfer Learning. (arXiv:2212.10082v1 [cs.LG])
    Task transfer learning is a popular technique in image processing applications that uses pre-trained models to reduce the supervision cost of related tasks. An important question is to determine task transferability, i.e. given a common input domain, estimating to what extent representations learned from a source task can help in learning a target task. Typically, transferability is either measured experimentally or inferred through task relatedness, which is often defined without a clear operational meaning. In this paper, we present a novel metric, H-score, an easily-computable evaluation function that estimates the performance of transferred representations from one task to another in classification problems using statistical and information theoretic principles. Experiments on real image data show that our metric is not only consistent with the empirical transferability measurement, but also useful to practitioners in applications such as source model selection and task transfer curriculum learning.
    AdverSAR: Adversarial Search and Rescue via Multi-Agent Reinforcement Learning. (arXiv:2212.10064v1 [cs.RO])
    Search and Rescue (SAR) missions in remote environments often employ autonomous multi-robot systems that learn, plan, and execute a combination of local single-robot control actions, group primitives, and global mission-oriented coordination and collaboration. Often, SAR coordination strategies are manually designed by human experts who can remotely control the multi-robot system and enable semi-autonomous operations. However, in remote environments where connectivity is limited and human intervention is often not possible, decentralized collaboration strategies are needed for fully-autonomous operations. Nevertheless, decentralized coordination may be ineffective in adversarial environments due to sensor noise, actuation faults, or manipulation of inter-agent communication data. In this paper, we propose an algorithmic approach based on adversarial multi-agent reinforcement learning (MARL) that allows robots to efficiently coordinate their strategies in the presence of adversarial inter-agent communications. In our setup, the objective of the multi-robot team is to discover targets strategically in an obstacle-strewn geographical area by minimizing the average time needed to find the targets. It is assumed that the robots have no prior knowledge of the target locations, and they can interact with only a subset of neighboring robots at any time. Based on the centralized training with decentralized execution (CTDE) paradigm in MARL, we utilize a hierarchical meta-learning framework to learn dynamic team-coordination modalities and discover emergent team behavior under complex cooperative-competitive scenarios. The effectiveness of our approach is demonstrated on a collection of prototype grid-world environments with different specifications of benign and adversarial agents, target locations, and agent rewards.
    Denoising instrumented mouthguard measurements of head impact kinematics with a convolutional neural network. (arXiv:2212.09832v1 [cs.LG])
    Wearable sensors for measuring head kinematics can be noisy due to imperfect interfaces with the body. Mouthguards are used to measure head kinematics during impacts in traumatic brain injury (TBI) studies, but deviations from reference kinematics can still occur due to potential looseness. In this study, deep learning is used to compensate for the imperfect interface and improve measurement accuracy. A set of one-dimensional convolutional neural network (1D-CNN) models was developed to denoise mouthguard kinematics measurements along three spatial axes of linear acceleration and angular velocity. The denoised kinematics had significantly reduced errors compared to reference kinematics, and reduced errors in brain injury criteria and tissue strain and strain rate calculated via finite element modeling. The 1D-CNN models were also tested on an on-field dataset of college football impacts and a post-mortem human subject dataset, with similar denoising effects observed. The models can be used to improve detection of head impacts and TBI risk evaluation, and potentially extended to other sensors measuring kinematics.
    Posterior and Computational Uncertainty in Gaussian Processes. (arXiv:2205.15449v3 [cs.LG] UPDATED)
    Gaussian processes scale prohibitively with the size of the dataset. In response, many approximation methods have been developed, which inevitably introduce approximation error. This additional source of uncertainty, due to limited computation, is entirely ignored when using the approximate posterior. Therefore in practice, GP models are often as much about the approximation method as they are about the data. Here, we develop a new class of methods that provides consistent estimation of the combined uncertainty arising from both the finite number of data observed and the finite amount of computation expended. The most common GP approximations map to an instance in this class, such as methods based on the Cholesky factorization, conjugate gradients, and inducing points. For any method in this class, we prove (i) convergence of its posterior mean in the associated RKHS, (ii) decomposability of its combined posterior covariance into mathematical and computational covariances, and (iii) that the combined variance is a tight worst-case bound for the squared error between the method's posterior mean and the latent function. Finally, we empirically demonstrate the consequences of ignoring computational uncertainty and show how implicitly modeling it improves generalization performance on benchmark datasets.  ( 2 min )
    Group Meritocratic Fairness in Linear Contextual Bandits. (arXiv:2206.03150v3 [stat.ML] UPDATED)
    We study the linear contextual bandit problem where an agent has to select one candidate from a pool and each candidate belongs to a sensitive group. In this setting, candidates' rewards may not be directly comparable between groups, for example when the agent is an employer hiring candidates from different ethnic groups and some groups have a lower reward due to discriminatory bias and/or social injustice. We propose a notion of fairness that states that the agent's policy is fair when it selects a candidate with highest relative rank, which measures how good the reward is when compared to candidates from the same group. This is a very strong notion of fairness, since the relative rank is not directly observed by the agent and depends on the underlying reward model and on the distribution of rewards. Thus we study the problem of learning a policy which approximates a fair policy under the condition that the contexts are independent between groups and the distribution of rewards of each group is absolutely continuous. In particular, we design a greedy policy which at each round constructs a ridge regression estimate from the observed context-reward pairs, and then computes an estimate of the relative rank of each candidate using the empirical cumulative distribution function. We prove that, despite its simplicity and the lack of an initial exploration phase, the greedy policy achieves, up to log factors and with high probability, a fair pseudo-regret of order $\sqrt{dT}$ after $T$ rounds, where $d$ is the dimension of the context vectors. The policy also satisfies demographic parity at each round when averaged over all possible information available before the selection. Finally, we use simulated settings and experiments on the US census data to show that our policy achieves sub-linear fair pseudo-regret also in practice.  ( 3 min )
    Multi-head Uncertainty Inference for Adversarial Attack Detection. (arXiv:2212.10006v1 [cs.LG])
    Deep neural networks (DNNs) are sensitive and susceptible to tiny perturbation by adversarial attacks which causes erroneous predictions. Various methods, including adversarial defense and uncertainty inference (UI), have been developed in recent years to overcome the adversarial attacks. In this paper, we propose a multi-head uncertainty inference (MH-UI) framework for detecting adversarial attack examples. We adopt a multi-head architecture with multiple prediction heads (i.e., classifiers) to obtain predictions from different depths in the DNNs and introduce shallow information for the UI. Using independent heads at different depths, the normalized predictions are assumed to follow the same Dirichlet distribution, and we estimate distribution parameter of it by moment matching. Cognitive uncertainty brought by the adversarial attacks will be reflected and amplified on the distribution. Experimental results show that the proposed MH-UI framework can outperform all the referred UI methods in the adversarial attack detection task with different settings.  ( 2 min )
    When Federated Learning Meets Pre-trained Language Models' Parameter-Efficient Tuning Methods. (arXiv:2212.10025v1 [cs.LG])
    With increasing privacy concerns on data, recent studies have made significant progress using federated learning (FL) on privacy-sensitive natural language processing (NLP) tasks. Much literature suggests fully fine-tuning pre-trained language models (PLMs) in the FL paradigm can mitigate the data heterogeneity problem and close the performance gap with centralized training. However, large PLMs bring the curse of prohibitive communication overhead and local model adaptation costs for the FL system. To this end, we introduce various parameter-efficient tuning (PETuning) methods into federated learning. Specifically, we provide a holistic empirical study of representative PLMs tuning methods in FL. The experimental results cover the analysis of data heterogeneity levels, data scales, and different FL scenarios. Overall communication overhead can be significantly reduced by locally tuning and globally aggregating lightweight model parameters while maintaining acceptable performance in various FL settings. To facilitate the research of PETuning in FL, we also develop a federated tuning framework FedPETuning, which allows practitioners to exploit different PETuning methods under the FL training paradigm conveniently. The source code is available at \url{https://github.com/iezhuozhuo/FedETuning/tree/deltaTuning}.  ( 2 min )
    Variational Factorization Machines for Preference Elicitation in Large-Scale Recommender Systems. (arXiv:2212.09920v1 [cs.LG])
    Factorization machines (FMs) are a powerful tool for regression and classification in the context of sparse observations, that has been successfully applied to collaborative filtering, especially when side information over users or items is available. Bayesian formulations of FMs have been proposed to provide confidence intervals over the predictions made by the model, however they usually involve Markov-chain Monte Carlo methods that require many samples to provide accurate predictions, resulting in slow training in the context of large-scale data. In this paper, we propose a variational formulation of factorization machines that allows us to derive a simple objective that can be easily optimized using standard mini-batch stochastic gradient descent, making it amenable to large-scale data. Our algorithm learns an approximate posterior distribution over the user and item parameters, which leads to confidence intervals over the predictions. We show, using several datasets, that it has comparable or better performance than existing methods in terms of prediction accuracy, and provide some applications in active learning strategies, e.g., preference elicitation techniques.  ( 2 min )
    Fixed-Weight Difference Target Propagation. (arXiv:2212.10352v1 [cs.NE])
    Target Propagation (TP) is a biologically more plausible algorithm than the error backpropagation (BP) to train deep networks, and improving practicality of TP is an open issue. TP methods require the feedforward and feedback networks to form layer-wise autoencoders for propagating the target values generated at the output layer. However, this causes certain drawbacks; e.g., careful hyperparameter tuning is required to synchronize the feedforward and feedback training, and frequent updates of the feedback path are usually required than that of the feedforward path. Learning of the feedforward and feedback networks is sufficient to make TP methods capable of training, but is having these layer-wise autoencoders a necessary condition for TP to work? We answer this question by presenting Fixed-Weight Difference Target Propagation (FW-DTP) that keeps the feedback weights constant during training. We confirmed that this simple method, which naturally resolves the abovementioned problems of TP, can still deliver informative target values to hidden layers for a given task; indeed, FW-DTP consistently achieves higher test performance than a baseline, the Difference Target Propagation (DTP), on four classification datasets. We also present a novel propagation architecture that explains the exact form of the feedback function of DTP to analyze FW-DTP.
    Learning Subgrid-scale Models with Neural Ordinary Differential Equations. (arXiv:2212.09967v1 [math.NA])
    We propose a new approach to learning the subgrid-scale model effects when simulating partial differential equations (PDEs) solved by the method of lines and their representation in chaotic ordinary differential equations, based on neural ordinary differential equations (NODEs). Solving systems with fine temporal and spatial grid scales is an ongoing computational challenge, and closure models are generally difficult to tune. Machine learning approaches have increased the accuracy and efficiency of computational fluid dynamics solvers. In this approach neural networks are used to learn the coarse- to fine-grid map, which can be viewed as subgrid scale parameterization. We propose a strategy that uses the NODE and partial knowledge to learn the source dynamics at a continuous level. Our method inherits the advantages of NODEs and can be used to parameterize subgrid scales, approximate coupling operators, and improve the efficiency of low-order solvers. Numerical results using the two-scale Lorenz 96 ODE and the convection-diffusion PDE are used to illustrate this approach.
    Choice of training label matters: how to best use deep learning for quantitative MRI parameter estimation. (arXiv:2205.05587v2 [physics.med-ph] UPDATED)
    Deep learning (DL) is gaining popularity as a parameter estimation method for quantitative MRI. A range of competing implementations have been proposed, relying on either supervised or self-supervised learning. Self-supervised approaches, sometimes referred to as unsupervised, have been loosely based on auto-encoders, whereas supervised methods have, to date, been trained on groundtruth labels. These two learning paradigms have been shown to have distinct strengths. Notably, self-supervised approaches have offered lower-bias parameter estimates than their supervised alternatives. This result is counterintuitive - incorporating prior knowledge with supervised labels should, in theory, lead to improved accuracy. In this work, we show that this apparent limitation of supervised approaches stems from the naive choice of groundtruth training labels. By training on labels which are deliberately not groundtruth, we show that the low-bias parameter estimation previously associated with self-supervised methods can be replicated - and improved on - within a supervised learning framework. This approach sets the stage for a single, unifying, deep learning parameter estimation framework, based on supervised learning, where trade-offs between bias and variance are made by careful adjustment of training label.  ( 2 min )
    A Meta-Learning Approach for Training Explainable Graph Neural Networks. (arXiv:2109.09426v2 [cs.LG] UPDATED)
    In this paper, we investigate the degree of explainability of graph neural networks (GNNs). Existing explainers work by finding global/local subgraphs to explain a prediction, but they are applied after a GNN has already been trained. Here, we propose a meta-learning framework for improving the level of explainability of a GNN directly at training time, by steering the optimization procedure towards what we call `interpretable minima'. Our framework (called MATE, MetA-Train to Explain) jointly trains a model to solve the original task, e.g., node classification, and to provide easily processable outputs for downstream algorithms that explain the model's decisions in a human-friendly way. In particular, we meta-train the model's parameters to quickly minimize the error of an instance-level GNNExplainer trained on-the-fly on randomly sampled nodes. The final internal representation relies upon a set of features that can be `better' understood by an explanation algorithm, e.g., another instance of GNNExplainer. Our model-agnostic approach can improve the explanations produced for different GNN architectures and use any instance-based explainer to drive this process. Experiments on synthetic and real-world datasets for node and graph classification show that we can produce models that are consistently easier to explain by different algorithms. Furthermore, this increase in explainability comes at no cost for the accuracy of the model.  ( 2 min )
    Delving into the Openness of CLIP. (arXiv:2206.01986v2 [cs.CV] UPDATED)
    Contrastive Language-Image Pre-training (CLIP) has demonstrated great potential in realizing open-vocabulary visual recognition in a matching style, due to its holistic use of natural language supervision that covers unconstrained real-world visual concepts. However, it is, in turn, also difficult to evaluate and analyze the openness of CLIP-like models, since they are in theory open to any vocabulary but the actual accuracy varies. To address the insufficiency of conventional studies on openness, we resort to an incremental perspective and define the extensibility, which essentially approximates the model's ability to deal with new visual concepts, by evaluating openness through vocabulary expansions. Our evaluation based on extensibility shows that CLIP-like models are hardly truly open and their performance degrades as the vocabulary expands to different degrees. Further analysis reveals that the over-estimation of openness is not because CLIP-like models fail to capture the general similarity of image and text features of novel visual concepts, but because of the confusion among competing text features, that is, they are not stable with respect to the vocabulary. In light of this, we propose to improve the openness of CLIP in the feature space by enforcing the distinguishability of text features. Our method retrieves relevant texts from the pre-training corpus to enhance prompts for inference, which boosts the extensibility and stability of CLIP even without fine-tuning.  ( 2 min )
    Revisiting Priority $k$-Center: Fairness and Outliers. (arXiv:2103.03337v2 [cs.DS] UPDATED)
    In the Priority $k$-Center problem, the input consists of a metric space $(X,d)$, an integer $k$, and for each point $v \in X$ a priority radius $r(v)$. The goal is to choose $k$-centers $S \subseteq X$ to minimize $\max_{v \in X} \frac{1}{r(v)} d(v,S)$. If all $r(v)$'s are uniform, one obtains the $k$-Center problem. Plesn\'ik [Plesn\'ik, Disc. Appl. Math. 1987] introduced the Priority $k$-Center problem and gave a $2$-approximation algorithm matching the best possible algorithm for $k$-Center. We show how the problem is related to two different notions of fair clustering [Harris et al., NeurIPS 2018; Jung et al., FORC 2020]. Motivated by these developments we revisit the problem and, in our main technical contribution, develop a framework that yields constant factor approximation algorithms for Priority $k$-Center with outliers. Our framework extends to generalizations of Priority $k$-Center to matroid and knapsack constraints, and as a corollary, also yields algorithms with fairness guarantees in the lottery model of Harris et al [Harris et al, JMLR 2019].  ( 2 min )
    Machine Learning based Framework for Robust Price-Sensitivity Estimation with Application to Airline Pricing. (arXiv:2205.01875v2 [stat.ML] UPDATED)
    We consider the problem of dynamic pricing of a product in the presence of feature-dependent price sensitivity. Developing practical algorithms that can estimate price elasticities robustly, especially when information about no purchases (losses) is not available, to drive such automated pricing systems is a challenge faced by many industries. Based on the Poisson semi-parametric approach, we construct a flexible yet interpretable demand model where the price related part is parametric while the remaining (nuisance) part of the model is non-parametric and can be modeled via sophisticated machine learning (ML) techniques. The estimation of price-sensitivity parameters of this model via direct one-stage regression techniques may lead to biased estimates due to regularization. To address this concern, we propose a two-stage estimation methodology which makes the estimation of the price-sensitivity parameters robust to biases in the estimators of the nuisance parameters of the model. In the first-stage we construct estimators of observed purchases and prices given the feature vector using sophisticated ML estimators such as deep neural networks. Utilizing the estimators from the first-stage, in the second-stage we leverage a Bayesian dynamic generalized linear model to estimate the price-sensitivity parameters. We test the performance of the proposed estimation schemes on simulated and real sales transaction data from the Airline industry. Our numerical studies demonstrate that our proposed two-stage approach reduces the estimation error in price-sensitivity parameters from 25\% to 4\% in realistic simulation settings. The two-stage estimation techniques proposed in this work allows practitioners to leverage modern ML techniques to robustly estimate price-sensitivities while still maintaining interpretability and allowing ease of validation of its various constituent parts.  ( 3 min )
    Out-of-sample scoring and automatic selection of causal estimators. (arXiv:2212.10076v1 [cs.LG])
    Recently, many causal estimators for Conditional Average Treatment Effect (CATE) and instrumental variable (IV) problems have been published and open sourced, allowing to estimate granular impact of both randomized treatments (such as A/B tests) and of user choices on the outcomes of interest. However, the practical application of such models has ben hampered by the lack of a valid way to score the performance of such models out of sample, in order to select the best one for a given application. We address that gap by proposing novel scoring approaches for both the CATE case and an important subset of instrumental variable problems, namely those where the instrumental variable is customer acces to a product feature, and the treatment is the customer's choice to use that feature. Being able to score model performance out of sample allows us to apply hyperparameter optimization methods to causal model selection and tuning. We implement that in an open source package that relies on DoWhy and EconML libraries for implementation of causal inference models (and also includes a Transformed Outcome model implementation), and on FLAML for hyperparameter optimization and for component models used in the causal models. We demonstrate on synthetic data that optimizing the proposed scores is a reliable method for choosing the model and its hyperparameter values, whose estimates are close to the true impact, in the randomized CATE and IV cases. Further, we provide examles of applying these methods to real customer data from Wise.  ( 2 min )
    Deduplicating Training Data Mitigates Privacy Risks in Language Models. (arXiv:2202.06539v3 [cs.CR] UPDATED)
    Past work has shown that large language models are susceptible to privacy attacks, where adversaries generate sequences from a trained model and detect which sequences are memorized from the training set. In this work, we show that the success of these attacks is largely due to duplication in commonly used web-scraped training sets. We first show that the rate at which language models regenerate training sequences is superlinearly related to a sequence's count in the training set. For instance, a sequence that is present 10 times in the training data is on average generated ~1000 times more often than a sequence that is present only once. We next show that existing methods for detecting memorized sequences have near-chance accuracy on non-duplicated training sequences. Finally, we find that after applying methods to deduplicate training data, language models are considerably more secure against these types of privacy attacks. Taken together, our results motivate an increased focus on deduplication in privacy-sensitive applications and a reevaluation of the practicality of existing privacy attacks.  ( 2 min )
    KINet: Keypoint Interaction Networks for Unsupervised Forward Modeling. (arXiv:2202.09006v2 [cs.CV] UPDATED)
    Object-centric representation is an essential abstraction for forward prediction. Most existing forward models learn this representation through extensive supervision (e.g., object class and bounding box) although such ground-truth information is not readily accessible in reality. To address this, we introduce KINet (Keypoint Interaction Network) -- an end-to-end unsupervised framework to reason about object interactions based on a keypoint representation. Using visual observations, our model learns to associate objects with keypoint coordinates and discovers a graph representation of the system as a set of keypoint embeddings and their relations. It then learns an action-conditioned forward model using contrastive estimation to predict future keypoint states. By learning to perform physical reasoning in the keypoint space, our model automatically generalizes to scenarios with a different number of objects, novel backgrounds, and unseen object geometries. Experiments demonstrate the effectiveness of our model in accurately performing forward prediction and learning plannable object-centric representations which can also be used in downstream robotic manipulation tasks.
    Visual Transformers for Primates Classification and Covid Detection. (arXiv:2212.10093v1 [cs.SD])
    We apply the vision transformer, a deep machine learning model build around the attention mechanism, on mel-spectrogram representations of raw audio recordings. When adding mel-based data augmentation techniques and sample-weighting, we achieve comparable performance on both (PRS and CCS challenge) tasks of ComParE21, outperforming most single model baselines. We further introduce overlapping vertical patching and evaluate the influence of parameter configurations. Index Terms: audio classification, attention, mel-spectrogram, unbalanced data-sets, computational paralinguistics
    Machine Learning and Polymer Self-Consistent Field Theory in Two Spatial Dimensions. (arXiv:2212.10478v1 [cond-mat.mtrl-sci])
    A computational framework that leverages data from self-consistent field theory simulations with deep learning to accelerate the exploration of parameter space for block copolymers is presented. This is a substantial two-dimensional extension of the framework introduced in [1]. Several innovations and improvements are proposed. (1) A Sobolev space-trained, convolutional neural network (CNN) is employed to handle the exponential dimension increase of the discretized, local average monomer density fields and to strongly enforce both spatial translation and rotation invariance of the predicted, field-theoretic intensive Hamiltonian. (2) A generative adversarial network (GAN) is introduced to efficiently and accurately predict saddle point, local average monomer density fields without resorting to gradient descent methods that employ the training set. This GAN approach yields important savings of both memory and computational cost. (3) The proposed machine learning framework is successfully applied to 2D cell size optimization as a clear illustration of its broad potential to accelerate the exploration of parameter space for discovering polymer nanostructures. Extensions to three-dimensional phase discovery appear to be feasible.
    Scheduling with Predictions. (arXiv:2212.10433v1 [cs.DS])
    There is significant interest in deploying machine learning algorithms for diagnostic radiology, as modern learning techniques have made it possible to detect abnormalities in medical images within minutes. While machine-assisted diagnoses cannot yet reliably replace human reviews of images by a radiologist, they could inform prioritization rules for determining the order by which to review patient cases so that patients with time-sensitive conditions could benefit from early intervention. We study this scenario by formulating it as a learning-augmented online scheduling problem. We are given information about each arriving patient's urgency level in advance, but these predictions are inevitably error-prone. In this formulation, we face the challenges of decision making under imperfect information, and of responding dynamically to prediction error as we observe better data in real-time. We propose a simple online policy and show that this policy is in fact the best possible in certain stylized settings. We also demonstrate that our policy achieves the two desiderata of online algorithms with predictions: consistency (performance improvement with prediction accuracy) and robustness (protection against the worst case). We complement our theoretical findings with empirical evaluations of the policy under settings that more accurately reflect clinical scenarios in the real world.
    Sophisticated deep learning with on-chip optical diffractive tensor processing. (arXiv:2212.09975v1 [cs.ET])
    The ever-growing deep learning technologies are making revolutionary changes for modern life. However, conventional computing architectures are designed to process sequential and digital programs, being extremely burdened with performing massive parallel and adaptive deep learning applications. Photonic integrated circuits provide an efficient approach to mitigate bandwidth limitations and power-wall brought by its electronic counterparts, showing great potential in ultrafast and energy-free high-performance computing. Here, we propose an optical computing architecture enabled by on-chip diffraction to implement convolutional acceleration, termed optical convolution unit (OCU). We demonstrate that any real-valued convolution kernels can be exploited by OCU with a prominent computational throughput boosting via the concept of structral re-parameterization. With OCU as the fundamental unit, we build an optical convolutional neural network (oCNN) to implement two popular deep learning tasks: classification and regression. For classification, Fashion-MNIST and CIFAR-4 datasets are tested with accuracy of 91.63% and 86.25%, respectively. For regression, we build an optical denoising convolutional neural network (oDnCNN) to handle Gaussian noise in gray scale images with noise level {\sigma} = 10, 15, 20, resulting clean images with average PSNR of 31.70dB, 29.39dB and 27.72dB, respectively. The proposed OCU presents remarkable performance of low energy consumption and high information density due to its fully passive nature and compact footprint, providing a highly parallel while lightweight solution for future computing architecture to handle high dimensional tensors in deep learning.
    Neural Model Reprogramming with Similarity Based Mapping for Low-Resource Spoken Command Classification. (arXiv:2110.03894v3 [eess.AS] UPDATED)
    In this study, we propose a novel adversarial reprogramming (AR) approach for low-resource spoken command recognition (SCR), and build an AR-SCR system. The AR procedure aims to modify the acoustic signals (from the target domain) to repurpose a pretrained SCR model (from the source domain). To solve the label mismatches between source and target domains, and further improve the stability of AR, we propose a novel similarity-based label mapping technique to align classes. In addition, the transfer learning (TL) technique is combined with the original AR process to improve the model adaptation capability. We evaluate the proposed AR-SCR system on three low-resource SCR datasets, including Arabic, Lithuanian, and dysarthric Mandarin speech. Experimental results show that with a pretrained AM trained on a large-scale English dataset, the proposed AR-SCR system outperforms the current state-of-the-art results on Arabic and Lithuanian speech commands datasets, with only a limited amount of training data.
    Large Language Models Are Reasoning Teachers. (arXiv:2212.10071v1 [cs.CL])
    Language models (LMs) have demonstrated remarkable performance on downstream tasks, using in-context exemplars or human instructions. Recent works have shown that chain-of-thought (CoT) prompting can elicit models to solve complex reasoning tasks, step-by-step. However, the efficacy of prompt-based CoT methods is restricted to very large LMs such as GPT-3 (175B), thus limiting deployability. In this paper, we revisit the fine-tuning approach to enable complex reasoning in smaller LMs, optimized to efficiently perform a specific task. We propose Fine-tune-CoT, a method that leverages the capabilities of very large LMs to generate reasoning samples and teach smaller models via fine-tuning. We evaluate our method on publicly available LMs across a wide range of complex tasks and model sizes. We find that Fine-tune-CoT enables substantial reasoning capability in small models, whereas previous prompt-based baselines exhibit near-random performance. Student models can even outperform the teacher in some tasks while reducing model size requirements by several orders of magnitude. We conduct extensive ablations and sample studies to understand the reasoning capabilities of student models. We also identify several important nuances that have been overlooked in concurrent fine-tuning works on CoT and address them in our analysis.
    Shared Certificates for Neural Network Verification. (arXiv:2109.00542v3 [cs.LG] UPDATED)
    Existing neural network verifiers compute a proof that each input is handled correctly under a given perturbation by propagating a symbolic abstraction of reachable values at each layer. This process is repeated from scratch independently for each input (e.g., image) and perturbation (e.g., rotation), leading to an expensive overall proof effort when handling an entire dataset. In this work, we introduce a new method for reducing this verification cost without losing precision based on a key insight that abstractions obtained at intermediate layers for different inputs and perturbations can overlap or contain each other. Leveraging our insight, we introduce the general concept of shared certificates, enabling proof effort reuse across multiple inputs to reduce overall verification costs. We perform an extensive experimental evaluation to demonstrate the effectiveness of shared certificates in reducing the verification cost on a range of datasets and attack specifications on image classifiers including the popular patch and geometric perturbations. We release our implementation at https://github.com/eth-sri/proof-sharing.
    Roto-translated Local Coordinate Frames For Interacting Dynamical Systems. (arXiv:2110.14961v2 [cs.LG] UPDATED)
    Modelling interactions is critical in learning complex dynamical systems, namely systems of interacting objects with highly non-linear and time-dependent behaviour. A large class of such systems can be formalized as $\textit{geometric graphs}$, $\textit{i.e.}$, graphs with nodes positioned in the Euclidean space given an $\textit{arbitrarily}$ chosen global coordinate system, for instance vehicles in a traffic scene. Notwithstanding the arbitrary global coordinate system, the governing dynamics of the respective dynamical systems are invariant to rotations and translations, also known as $\textit{Galilean invariance}$. As ignoring these invariances leads to worse generalization, in this work we propose local coordinate frames per node-object to induce roto-translation invariance to the geometric graph of the interacting dynamical system. Further, the local coordinate frames allow for a natural definition of anisotropic filtering in graph neural networks. Experiments in traffic scenes, 3D motion capture, and colliding particles demonstrate that the proposed approach comfortably outperforms the recent state-of-the-art.
    PairReranker: Pairwise Reranking for Natural Language Generation. (arXiv:2212.10555v1 [cs.CL])
    Pre-trained language models have been successful in natural language generation (NLG) tasks. While various decoding methods have been employed, they often produce suboptimal results. We first present an empirical analysis of three NLG tasks: summarization, machine translation, and constrained text generation. We found that selecting the best output from the results of multiple decoding methods can significantly improve performance. To further improve reranking for NLG tasks, we proposed a novel method, \textsc{PairReranker}, which uses a single encoder and a pairwise loss function to jointly encode a source input and a pair of candidates and compare them. Experiments on three NLG tasks demonstrated the effectiveness and flexibility of \textsc{PairReranker}, showing strong results, compared with previous baselines. In addition, our \textsc{PairReranker} can generalize to significantly improve GPT-3 (text-davinci-003) results (e.g., 24.55\% on CommonGen and 11.35\% on WMT18 zh-en), even though our rerankers are not trained with any GPT-3 candidates.
    Is Semantic Communications Secure? A Tale of Multi-Domain Adversarial Attacks. (arXiv:2212.10438v1 [cs.CR])
    Semantic communications seeks to transfer information from a source while conveying a desired meaning to its destination. We model the transmitter-receiver functionalities as an autoencoder followed by a task classifier that evaluates the meaning of the information conveyed to the receiver. The autoencoder consists of an encoder at the transmitter to jointly model source coding, channel coding, and modulation, and a decoder at the receiver to jointly model demodulation, channel decoding and source decoding. By augmenting the reconstruction loss with a semantic loss, the two deep neural networks (DNNs) of this encoder-decoder pair are interactively trained with the DNN of the semantic task classifier. This approach effectively captures the latent feature space and reliably transfers compressed feature vectors with a small number of channel uses while keeping the semantic loss low. We identify the multi-domain security vulnerabilities of using the DNNs for semantic communications. Based on adversarial machine learning, we introduce test-time (targeted and non-targeted) adversarial attacks on the DNNs by manipulating their inputs at different stages of semantic communications. As a computer vision attack, small perturbations are injected to the images at the input of the transmitter's encoder. As a wireless attack, small perturbations signals are transmitted to interfere with the input of the receiver's decoder. By launching these stealth attacks individually or more effectively in a combined form as a multi-domain attack, we show that it is possible to change the semantics of the transferred information even when the reconstruction loss remains low. These multi-domain adversarial attacks pose as a serious threat to the semantics of information transfer (with larger impact than conventional jamming) and raise the need of defense methods for the safe adoption of semantic communications.
    Adaptivity for clustering-based reduced-order modeling of localized history-dependent phenomena. (arXiv:2109.11897v2 [math.NA] UPDATED)
    This paper proposes a novel Adaptive Clustering-based Reduced-Order Modeling (ACROM) framework to significantly improve and extend the recent family of clustering-based reduced-order models (CROMs). This adaptive framework enables the clustering-based domain decomposition to evolve dynamically throughout the problem solution, ensuring optimum refinement in regions where the relevant fields present steeper gradients. It offers a new route to fast and accurate material modeling of history-dependent nonlinear problems involving highly localized plasticity and damage phenomena. The overall approach is composed of three main building blocks: target clusters selection criterion, adaptive cluster analysis, and computation of cluster interaction tensors. In addition, an adaptive clustering solution rewinding procedure and a dynamic adaptivity split factor strategy are suggested to further enhance the adaptive process. The coined Adaptive Self-Consistent Clustering Analysis (ASCA) is shown to perform better than its static counterpart when capturing the multi-scale elasto-plastic behavior of a particle-matrix composite and predicting the associated fracture and toughness. Given the encouraging results shown in this paper, the ACROM framework sets the stage and opens new avenues to explore adaptivity in the context of CROMs.
    On the Convergence of Policy Gradient in Robust MDPs. (arXiv:2212.10439v1 [cs.LG])
    Robust Markov decision processes (RMDPs) are promising models that provide reliable policies under ambiguities in model parameters. As opposed to nominal Markov decision processes (MDPs), however, the state-of-the-art solution methods for RMDPs are limited to value-based methods, such as value iteration and policy iteration. This paper proposes Double-Loop Robust Policy Gradient (DRPG), the first generic policy gradient method for RMDPs with a global convergence guarantee in tabular problems. Unlike value-based methods, DRPG does not rely on dynamic programming techniques. In particular, the inner-loop robust policy evaluation problem is solved via projected gradient descent. Finally, our experimental results demonstrate the performance of our algorithm and verify our theoretical guarantees.
    Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment. (arXiv:2212.10549v1 [cs.CL])
    Despite recent progress towards scaling up multimodal vision-language models, these models are still known to struggle on compositional generalization benchmarks such as Winoground. We find that a critical component lacking from current vision-language models is relation-level alignment: the ability to match directional semantic relations in text (e.g., "mug in grass") with spatial relationships in the image (e.g., the position of the mug relative to the grass). To tackle this problem, we show that relation alignment can be enforced by encouraging the directed language attention from 'mug' to 'grass' (capturing the semantic relation 'in') to match the directed visual attention from the mug to the grass. Tokens and their corresponding objects are softly identified using the cross-modal attention. We prove that this notion of soft relation alignment is equivalent to enforcing congruence between vision and language attention matrices under a 'change of basis' provided by the cross-modal attention matrix. Intuitively, our approach projects visual attention into the language attention space to calculate its divergence from the actual language attention, and vice versa. We apply our Cross-modal Attention Congruence Regularization (CACR) loss to UNITER and improve on the state-of-the-art approach to Winoground.
    QuantArt: Quantizing Image Style Transfer Towards High Visual Fidelity. (arXiv:2212.10431v1 [cs.CV])
    The mechanism of existing style transfer algorithms is by minimizing a hybrid loss function to push the generated image toward high similarities in both content and style. However, this type of approach cannot guarantee visual fidelity, i.e., the generated artworks should be indistinguishable from real ones. In this paper, we devise a new style transfer framework called QuantArt for high visual-fidelity stylization. QuantArt pushes the latent representation of the generated artwork toward the centroids of the real artwork distribution with vector quantization. By fusing the quantized and continuous latent representations, QuantArt allows flexible control over the generated artworks in terms of content preservation, style similarity, and visual fidelity. Experiments on various style transfer settings show that our QuantArt framework achieves significantly higher visual fidelity compared with the existing style transfer methods.
    Wind Power Scenario Generation Using Graph Convolutional Generative Adversarial Network. (arXiv:2212.10454v1 [cs.LG])
    Generating wind power scenarios is very important for studying the impacts of multiple wind farms that are interconnected to the grid. We develop a graph convolutional generative adversarial network (GCGAN) approach by leveraging GAN's capability in generating large number of realistic scenarios without using statistical modeling. Unlike existing GAN-based wind power data generation approaches, we design GAN's hidden layers to match the underlying spatial and temporal characteristics. We advocate to use graph filters to embed the spatial correlation among multiple wind farms, and a one-dimensional (1D) convolutional layer for representing the temporal feature filters. The proposed graph and feature filter designs significantly reduce the GAN model complexity, leading to improvements on the training efficiency and computation complexity. Numerical results using real wind power data from Australia demonstrate that the scenarios generated by the proposed GCGAN exhibit more realistic spatial and temporal statistics than other GAN-based outputs.
    Efficient and Sound Differentiable Programming in a Functional Array-Processing Language. (arXiv:2212.10307v1 [cs.PL])
    Automatic differentiation (AD) is a technique for computing the derivative of a function represented by a program. This technique is considered as the de-facto standard for computing the differentiation in many machine learning and optimisation software tools. Despite the practicality of this technique, the performance of the differentiated programs, especially for functional languages and in the presence of vectors, is suboptimal. We present an AD system for a higher-order functional array-processing language. The core functional language underlying this system simultaneously supports both source-to-source forward-mode AD and global optimisations such as loop transformations. In combination, gradient computation with forward-mode AD can be as efficient as reverse mode, and the Jacobian matrices required for numerical algorithms such as Gauss-Newton and Levenberg-Marquardt can be efficiently computed.
    Identifying latent distances with Finslerian geometry. (arXiv:2212.10010v1 [cs.LG])
    Riemannian geometry provides powerful tools to explore the latent space of generative models while preserving the inherent structure of the data manifold. Lengths, energies and volume measures can be derived from a pullback metric, defined through the immersion that maps the latent space to the data space. With this in mind, most generative models are stochastic, and so is the pullback metric. Manipulating stochastic objects is strenuous in practice. In order to perform operations such as interpolations, or measuring the distance between data points, we need a deterministic approximation of the pullback metric. In this work, we are defining a new metric as the expected length derived from the stochastic pullback metric. We show this metric is Finslerian, and we compare it with the expected pullback metric. In high dimensions, we show that the metrics converge to each other at a rate of $\mathcal{O}\left(\frac{1}{D}\right)$.
    Human-Guided Fair Classification for Natural Language Processing. (arXiv:2212.10154v1 [cs.CL])
    Text classifiers have promising applications in high-stake tasks such as resume screening and content moderation. These classifiers must be fair and avoid discriminatory decisions by being invariant to perturbations of sensitive attributes such as gender or ethnicity. However, there is a gap between human intuition about these perturbations and the formal similarity specifications capturing them. While existing research has started to address this gap, current methods are based on hardcoded word replacements, resulting in specifications with limited expressivity or ones that fail to fully align with human intuition (e.g., in cases of asymmetric counterfactuals). This work proposes novel methods for bridging this gap by discovering expressive and intuitive individual fairness specifications. We show how to leverage unsupervised style transfer and GPT-3's zero-shot capabilities to automatically generate expressive candidate pairs of semantically similar sentences that differ along sensitive attributes. We then validate the generated pairs via an extensive crowdsourcing study, which confirms that a lot of these pairs align with human intuition about fairness in the context of toxicity classification. Finally, we show how limited amounts of human feedback can be leveraged to learn a similarity specification that can be used to train downstream fairness-aware models.
    Deep Riemannian Networks for EEG Decoding. (arXiv:2212.10426v1 [cs.LG])
    State-of-the-art performance in electroencephalography (EEG) decoding tasks is currently often achieved with either Deep-Learning or Riemannian-Geometry-based decoders. Recently, there is growing interest in Deep Riemannian Networks (DRNs) possibly combining the advantages of both previous classes of methods. However, there are still a range of topics where additional insight is needed to pave the way for a more widespread application of DRNs in EEG. These include architecture design questions such as network size and end-to-end ability as well as model training questions. How these factors affect model performance has not been explored. Additionally, it is not clear how the data within these networks is transformed, and whether this would correlate with traditional EEG decoding. Our study aims to lay the groundwork in the area of these topics through the analysis of DRNs for EEG with a wide range of hyperparameters. Networks were tested on two public EEG datasets and compared with state-of-the-art ConvNets. Here we propose end-to-end EEG SPDNet (EE(G)-SPDNet), and we show that this wide, end-to-end DRN can outperform the ConvNets, and in doing so use physiologically plausible frequency regions. We also show that the end-to-end approach learns more complex filters than traditional band-pass filters targeting the classical alpha, beta, and gamma frequency bands of the EEG, and that performance can benefit from channel specific filtering approaches. Additionally, architectural analysis revealed areas for further improvement due to the possible loss of Riemannian specific information throughout the network. Our study thus shows how to design and train DRNs to infer task-related information from the raw EEG without the need of handcrafted filterbanks and highlights the potential of end-to-end DRNs such as EE(G)-SPDNet for high-performance EEG decoding.
    Graph Neural Networks in Computer Vision -- Architectures, Datasets and Common Approaches. (arXiv:2212.10207v1 [cs.LG])
    Graph Neural Networks (GNNs) are a family of graph networks inspired by mechanisms existing between nodes on a graph. In recent years there has been an increased interest in GNN and their derivatives, i.e., Graph Attention Networks (GAT), Graph Convolutional Networks (GCN), and Graph Recurrent Networks (GRN). An increase in their usability in computer vision is also observed. The number of GNN applications in this field continues to expand; it includes video analysis and understanding, action and behavior recognition, computational photography, image and video synthesis from zero or few shots, and many more. This contribution aims to collect papers published about GNN-based approaches towards computer vision. They are described and summarized from three perspectives. Firstly, we investigate the architectures of Graph Neural Networks and their derivatives used in this area to provide accurate and explainable recommendations for the ensuing investigations. As for the other aspect, we also present datasets used in these works. Finally, using graph analysis, we also examine relations between GNN-based studies in computer vision and potential sources of inspiration identified outside of this field.
    Steel Phase Kinetics Modeling using Symbolic Regression. (arXiv:2212.10284v1 [cs.LG])
    We describe an approach for empirical modeling of steel phase kinetics based on symbolic regression and genetic programming. The algorithm takes processed data gathered from dilatometer measurements and produces a system of differential equations that models the phase kinetics. Our initial results demonstrate that the proposed approach allows to identify compact differential equations that fit the data. The model predicts ferrite, pearlite and bainite formation for a single steel type. Martensite is not yet included in the model. Future work shall incorporate martensite and generalize to multiple steel types with different chemical compositions.
    Modeling Human Eye Movements with Neural Networks in a Maze-Solving Task. (arXiv:2212.10367v1 [cs.LG])
    From smoothly pursuing moving objects to rapidly shifting gazes during visual search, humans employ a wide variety of eye movement strategies in different contexts. While eye movements provide a rich window into mental processes, building generative models of eye movements is notoriously difficult, and to date the computational objectives guiding eye movements remain largely a mystery. In this work, we tackled these problems in the context of a canonical spatial planning task, maze-solving. We collected eye movement data from human subjects and built deep generative models of eye movements using a novel differentiable architecture for gaze fixations and gaze shifts. We found that human eye movements are best predicted by a model that is optimized not to perform the task as efficiently as possible but instead to run an internal simulation of an object traversing the maze. This not only provides a generative model of eye movements in this task but also suggests a computational theory for how humans solve the task, namely that humans use mental simulation.
    Interpretable models for extrapolation in scientific machine learning. (arXiv:2212.10283v1 [cond-mat.mtrl-sci])
    Data-driven models are central to scientific discovery. In efforts to achieve state-of-the-art model accuracy, researchers are employing increasingly complex machine learning algorithms that often outperform simple regressions in interpolative settings (e.g. random k-fold cross-validation) but suffer from poor extrapolation performance, portability, and human interpretability, which limits their potential for facilitating novel scientific insight. Here we examine the trade-off between model performance and interpretability across a broad range of science and engineering problems with an emphasis on materials science datasets. We compare the performance of black box random forest and neural network machine learning algorithms to that of single-feature linear regressions which are fitted using interpretable input features discovered by a simple random search algorithm. For interpolation problems, the average prediction errors of linear regressions were twice as high as those of black box models. Remarkably, when prediction tasks required extrapolation, linear models yielded average error only 5% higher than that of black box models, and outperformed black box models in roughly 40% of the tested prediction tasks, which suggests that they may be desirable over complex algorithms in many extrapolation problems because of their superior interpretability, computational overhead, and ease of use. The results challenge the common assumption that extrapolative models for scientific machine learning are constrained by an inherent trade-off between performance and interpretability.
    Optimizing Serially Concatenated Neural Codes with Classical Decoders. (arXiv:2212.10355v1 [cs.IT])
    For improving short-length codes, we demonstrate that classic decoders can also be used with real-valued, neural encoders, i.e., deep-learning based codeword sequence generators. Here, the classical decoder can be a valuable tool to gain insights into these neural codes and shed light on weaknesses. Specifically, the turbo-autoencoder is a recently developed channel coding scheme where both encoder and decoder are replaced by neural networks. We first show that the limited receptive field of convolutional neural network (CNN)-based codes enables the application of the BCJR algorithm to optimally decode them with feasible computational complexity. These maximum a posteriori (MAP) component decoders then are used to form classical (iterative) turbo decoders for parallel or serially concatenated CNN encoders, offering a close-to-maximum likelihood (ML) decoding of the learned codes. To the best of our knowledge, this is the first time that a classical decoding algorithm is applied to a non-trivial, real-valued neural code. Furthermore, as the BCJR algorithm is fully differentiable, it is possible to train, or fine-tune, the neural encoder in an end-to-end fashion.
    Settling the Reward Hypothesis. (arXiv:2212.10420v1 [cs.AI])
    The reward hypothesis posits that, "all of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward)." We aim to fully settle this hypothesis. This will not conclude with a simple affirmation or refutation, but rather specify completely the implicit requirements on goals and purposes under which the hypothesis holds.
    In and Out-of-Domain Text Adversarial Robustness via Label Smoothing. (arXiv:2212.10258v1 [cs.CL])
    Recently it has been shown that state-of-the-art NLP models are vulnerable to adversarial attacks, where the predictions of a model can be drastically altered by slight modifications to the input (such as synonym substitutions). While several defense techniques have been proposed, and adapted, to the discrete nature of text adversarial attacks, the benefits of general-purpose regularization methods such as label smoothing for language models, have not been studied. In this paper, we study the adversarial robustness provided by various label smoothing strategies in foundational models for diverse NLP tasks in both in-domain and out-of-domain settings. Our experiments show that label smoothing significantly improves adversarial robustness in pre-trained models like BERT, against various popular attacks. We also analyze the relationship between prediction confidence and robustness, showing that label smoothing reduces over-confident errors on adversarial examples.
    VoronoiPatches: Evaluating A New Data Augmentation Method. (arXiv:2212.10054v1 [cs.CV])
    Overfitting is a problem in Convolutional Neural Networks (CNN) that causes poor generalization of models on unseen data. To remediate this problem, many new and diverse data augmentation methods (DA) have been proposed to supplement or generate more training data, and thereby increase its quality. In this work, we propose a new data augmentation algorithm: VoronoiPatches (VP). We primarily utilize non-linear recombination of information within an image, fragmenting and occluding small information patches. Unlike other DA methods, VP uses small convex polygon-shaped patches in a random layout to transport information around within an image. Sudden transitions created between patches and the original image can, optionally, be smoothed. In our experiments, VP outperformed current DA methods regarding model variance and overfitting tendencies. We demonstrate data augmentation utilizing non-linear re-combination of information within images, and non-orthogonal shapes and structures improves CNN model robustness on unseen data.
    Emotion Selectable End-to-End Text-based Speech Editing. (arXiv:2212.10191v1 [cs.SD])
    Text-based speech editing allows users to edit speech by intuitively cutting, copying, and pasting text to speed up the process of editing speech. In the previous work, CampNet (context-aware mask prediction network) is proposed to realize text-based speech editing, significantly improving the quality of edited speech. This paper aims at a new task: adding emotional effect to the editing speech during the text-based speech editing to make the generated speech more expressive. To achieve this task, we propose Emo-CampNet (emotion CampNet), which can provide the option of emotional attributes for the generated speech in text-based speech editing and has the one-shot ability to edit unseen speakers' speech. Firstly, we propose an end-to-end emotion-selectable text-based speech editing model. The key idea of the model is to control the emotion of generated speech by introducing additional emotion attributes based on the context-aware mask prediction network. Secondly, to prevent the emotion of the generated speech from being interfered by the emotional components in the original speech, a neutral content generator is proposed to remove the emotion from the original speech, which is optimized by the generative adversarial framework. Thirdly, two data augmentation methods are proposed to enrich the emotional and pronunciation information in the training set, which can enable the model to edit the unseen speaker's speech. The experimental results that 1) Emo-CampNet can effectively control the emotion of the generated speech in the process of text-based speech editing; And can edit unseen speakers' speech. 2) Detailed ablation experiments further prove the effectiveness of emotional selectivity and data augmentation methods. The demo page is available at https://hairuo55.github.io/Emo-CampNet/
    Automatically Answering and Generating Machine Learning Final Exams. (arXiv:2206.05442v3 [cs.LG] UPDATED)
    Can a machine learn machine learning? We propose to answer this question using the same criteria we use to answer a similar question: can a human learn machine learning? We automatically answer final exams in MIT's, Harvard's and Cornell's large machine learning courses and generate new questions at a human level. Recently, program synthesis and few-shot learning solved university-level problem set questions in mathematics and STEM courses at a human level. In this work, we solve questions from final exams that differ from problem sets in several ways: the questions are longer, have multiple parts, are more complicated, and span a broader set of topics. We provide a new dataset and benchmark of questions from machine learning final exams and code for automatically answering these questions and generating new questions. To make our dataset a reproducible benchmark, we use automatic checkers for multiple choice questions, questions with numeric answers, and questions with expression answers, and evaluate a large free language model, Meta's OPT, and compare the results with Open AI's GPT-3, ChatGPT, and Codex. A student survey comparing the quality, appropriateness, and difficulty of machine-generated questions with human-written questions shows that across multiple aspects, machine-generated questions are indistinguishable from human-generated questions and are suitable for final exams. We perform ablation studies comparing zero-shot learning with few-shot learning, chain-of-thought prompting, GPT-3, ChatGPT, and OPT pre-trained on text and Codex fine-tuned on code on a range of machine learning topics and find that few-shot learning methods perform best. We make our data and code publicly available for the machine learning community.
    Constructing Organism Networks from Collaborative Self-Replicators. (arXiv:2212.10078v1 [cs.NE])
    We introduce organism networks, which function like a single neural network but are composed of several neural particle networks; while each particle network fulfils the role of a single weight application within the organism network, it is also trained to self-replicate its own weights. As organism networks feature vastly more parameters than simpler architectures, we perform our initial experiments on an arithmetic task as well as on simplified MNIST-dataset classification as a collective. We observe that individual particle networks tend to specialise in either of the tasks and that the ones fully specialised in the secondary task may be dropped from the network without hindering the computational accuracy of the primary task. This leads to the discovery of a novel pruning-strategy for sparse neural networks
    Continual Mean Estimation Under User-Level Privacy. (arXiv:2212.09980v1 [cs.LG])
    We consider the problem of continually releasing an estimate of the population mean of a stream of samples that is user-level differentially private (DP). At each time instant, a user contributes a sample, and the users can arrive in arbitrary order. Until now these requirements of continual release and user-level privacy were considered in isolation. But, in practice, both these requirements come together as the users often contribute data repeatedly and multiple queries are made. We provide an algorithm that outputs a mean estimate at every time instant $t$ such that the overall release is user-level $\varepsilon$-DP and has the following error guarantee: Denoting by $M_t$ the maximum number of samples contributed by a user, as long as $\tilde{\Omega}(1/\varepsilon)$ users have $M_t/2$ samples each, the error at time $t$ is $\tilde{O}(1/\sqrt{t}+\sqrt{M}_t/t\varepsilon)$. This is a universal error guarantee which is valid for all arrival patterns of the users. Furthermore, it (almost) matches the existing lower bounds for the single-release setting at all time instants when users have contributed equal number of samples.
    When Not to Trust Language Models: Investigating Effectiveness and Limitations of Parametric and Non-Parametric Memories. (arXiv:2212.10511v1 [cs.CL])
    Despite their impressive performance on diverse tasks, large language models (LMs) still struggle with tasks requiring rich world knowledge, implying the limitations of relying solely on their parameters to encode a wealth of world knowledge. This paper aims to understand LMs' strengths and limitations in memorizing factual knowledge, by conducting large-scale knowledge probing experiments of 10 models and 4 augmentation methods on PopQA, our new open-domain QA dataset with 14k questions. We find that LMs struggle with less popular factual knowledge, and that scaling fails to appreciably improve memorization of factual knowledge in the tail. We then show that retrieval-augmented LMs largely outperform orders of magnitude larger LMs, while unassisted LMs remain competitive in questions about high-popularity entities. Based on those findings, we devise a simple, yet effective, method for powerful and efficient retrieval-augmented LMs, which retrieves non-parametric memories only when necessary. Experimental results show that this significantly improves models' performance while reducing the inference costs.
    Learning efficient backprojections across cortical hierarchies in real time. (arXiv:2212.10249v1 [q-bio.NC])
    Models of sensory processing and learning in the cortex need to efficiently assign credit to synapses in all areas. In deep learning, a known solution is error backpropagation, which however requires biologically implausible weight transport from feed-forward to feedback paths. We introduce Phaseless Alignment Learning (PAL), a bio-plausible method to learn efficient feedback weights in layered cortical hierarchies. This is achieved by exploiting the noise naturally found in biophysical systems as an additional carrier of information. In our dynamical system, all weights are learned simultaneously with always-on plasticity and using only information locally available to the synapses. Our method is completely phase-free (no forward and backward passes or phased learning) and allows for efficient error propagation across multi-layer cortical hierarchies, while maintaining biologically plausible signal transport and learning. Our method is applicable to a wide class of models and improves on previously known biologically plausible ways of credit assignment: compared to random synaptic feedback, it can solve complex tasks with less neurons and learn more useful latent representations. We demonstrate this on various classification tasks using a cortical microcircuit model with prospective coding.
    MuseMorphose: Full-Song and Fine-Grained Piano Music Style Transfer with One Transformer VAE. (arXiv:2105.04090v3 [cs.SD] UPDATED)
    Transformers and variational autoencoders (VAE) have been extensively employed for symbolic (e.g., MIDI) domain music generation. While the former boast an impressive capability in modeling long sequences, the latter allow users to willingly exert control over different parts (e.g., bars) of the music to be generated. In this paper, we are interested in bringing the two together to construct a single model that exhibits both strengths. The task is split into two steps. First, we equip Transformer decoders with the ability to accept segment-level, time-varying conditions during sequence generation. Subsequently, we combine the developed and tested in-attention decoder with a Transformer encoder, and train the resulting MuseMorphose model with the VAE objective to achieve style transfer of long pop piano pieces, in which users can specify musical attributes including rhythmic intensity and polyphony (i.e., harmonic fullness) they desire, down to the bar level. Experiments show that MuseMorphose outperforms recurrent neural network (RNN) based baselines on numerous widely-used metrics for style transfer tasks.
    Molformer: Motif-based Transformer on 3D Heterogeneous Molecular Graphs. (arXiv:2110.01191v6 [q-bio.QM] UPDATED)
    Procuring expressive molecular representations underpins AI-driven molecule design and scientific discovery. The research mainly focuses on atom-level homogeneous molecular graphs, ignoring the rich information in subgraphs or motifs. However, it has been widely accepted that substructures play a dominant role in identifying and determining molecular properties. To address such issues, we formulate heterogeneous molecular graphs (HMGs), and introduce a novel architecture to exploit both molecular motifs and 3D geometry. Precisely, we extract functional groups as motifs for small molecules and employ reinforcement learning to adaptively select quaternary amino acids as motif candidates for proteins. Then HMGs are constructed with both atom-level and motif-level nodes. To better accommodate those HMGs, we introduce a variant of Transformer named Molformer, which adopts a heterogeneous self-attention layer to distinguish the interactions between multi-level nodes. Besides, it is also coupled with a multi-scale mechanism to capture fine-grained local patterns with increasing contextual scales. An attentive farthest point sampling algorithm is also proposed to obtain the molecular representations. We validate Molformer across a broad range of domains, including quantum chemistry, physiology, and biophysics. Extensive experiments show that Molformer outperforms or achieves the comparable performance of several state-of-the-art baselines. Our work provides a promising way to utilize informative motifs from the perspective of multi-level graph construction.
    Learning to Play General-Sum Games Against Multiple Boundedly Rational Agents. (arXiv:2106.05492v3 [cs.LG] UPDATED)
    We study the problem of training a principal in a multi-agent general-sum game using reinforcement learning (RL). Learning a robust principal policy requires anticipating the worst possible strategic responses of other agents, which is generally NP-hard. However, we show that no-regret dynamics can identify these worst-case responses in poly-time in smooth games. We propose a framework that uses this policy evaluation method for efficiently learning a robust principal policy using RL. This framework can be extended to provide robustness to boundedly rational agents too. Our motivating application is automated mechanism design: we empirically demonstrate our framework learns robust mechanisms in both matrix games and complex spatiotemporal games. In particular, we learn a dynamic tax policy that improves the welfare of a simulated trade-and-barter economy by 15%, even when facing previously unseen boundedly rational RL taxpayers.
    Localising In-Domain Adaptation of Transformer-Based Biomedical Language Models. (arXiv:2212.10422v1 [cs.CL])
    In the era of digital healthcare, the huge volumes of textual information generated every day in hospitals constitute an essential but underused asset that could be exploited with task-specific, fine-tuned biomedical language representation models, improving patient care and management. For such specialized domains, previous research has shown that fine-tuning models stemming from broad-coverage checkpoints can largely benefit additional training rounds over large-scale in-domain resources. However, these resources are often unreachable for less-resourced languages like Italian, preventing local medical institutions to employ in-domain adaptation. In order to reduce this gap, our work investigates two accessible approaches to derive biomedical language models in languages other than English, taking Italian as a concrete use-case: one based on neural machine translation of English resources, favoring quantity over quality; the other based on a high-grade, narrow-scoped corpus natively in Italian, thus preferring quality over quantity. Our study shows that data quantity is a harder constraint than data quality for biomedical adaptation, but the concatenation of high-quality data can improve model performance even when dealing with relatively size-limited corpora. The models published from our investigations have the potential to unlock important research opportunities for Italian hospitals and academia. Finally, the set of lessons learned from the study constitutes valuable insights towards a solution to build biomedical language models that are generalizable to other less-resourced languages and different domain settings.
    Training Trajectories of Language Models Across Scales. (arXiv:2212.09803v1 [cs.CL])
    Scaling up language models has led to unprecedented performance gains, but little is understood about how the training dynamics change as models get larger. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors? In this paper, we analyze the intermediate training checkpoints of differently sized OPT models (Zhang et al.,2022)--from 125M to 175B parameters--on next-token prediction, sequence-level generation, and downstream tasks. We find that 1) at a given perplexity and independent of model sizes, a similar subset of training tokens see the most significant reduction in loss, with the rest stagnating or showing double-descent behavior; 2) early in training, all models learn to reduce the perplexity of grammatical sequences that contain hallucinations, with small models halting at this suboptimal distribution and larger ones eventually learning to assign these sequences lower probabilities; 3) perplexity is a strong predictor of in-context learning performance on 74 multiple-choice tasks from BIG-Bench, and this holds independent of the model size. Together, these results show that perplexity is more predictive of model behaviors than model size or training computation.
    Pretraining Without Attention. (arXiv:2212.10544v1 [cs.CL])
    Transformers have been essential to pretraining success in NLP. Other architectures have been used, but require attention layers to match benchmark accuracy. This work explores pretraining without attention. We test recently developed routing layers based on state-space models (SSM) and model architectures based on multiplicative gating. Used together these modeling choices have a large impact on pretraining accuracy. Empirically the proposed Bidirectional Gated SSM (BiGS) replicates BERT pretraining results without attention and can be extended to long-form pretraining of 4096 tokens without approximation.
    FriendlyCore: Practical Differentially Private Aggregation. (arXiv:2110.10132v4 [cs.LG] UPDATED)
    Differentially private algorithms for common metric aggregation tasks, such as clustering or averaging, often have limited practicality due to their complexity or to the large number of data points that is required for accurate results. We propose a simple and practical tool, $\mathsf{FriendlyCore}$, that takes a set of points ${\cal D}$ from an unrestricted (pseudo) metric space as input. When ${\cal D}$ has effective diameter $r$, $\mathsf{FriendlyCore}$ returns a "stable" subset ${\cal C} \subseteq {\cal D}$ that includes all points, except possibly few outliers, and is {\em certified} to have diameter $r$. $\mathsf{FriendlyCore}$ can be used to preprocess the input before privately aggregating it, potentially simplifying the aggregation or boosting its accuracy. Surprisingly, $\mathsf{FriendlyCore}$ is light-weight with no dependence on the dimension. We empirically demonstrate its advantages in boosting the accuracy of mean estimation and clustering tasks such as $k$-means and $k$-GMM, outperforming tailored methods.
    Model-based Deep Learning Receiver Design for Rate-Splitting Multiple Access. (arXiv:2205.00849v2 [eess.SP] UPDATED)
    Effective and adaptive interference management is required in next generation wireless communication systems. To address this challenge, Rate-Splitting Multiple Access (RSMA), relying on multi-antenna rate-splitting (RS) at the transmitter and successive interference cancellation (SIC) at the receivers, has been intensively studied in recent years, albeit mostly under the assumption of perfect Channel State Information at the Receiver (CSIR) and ideal capacity-achieving modulation and coding schemes. To assess its practical performance, benefits, and limits under more realistic conditions, this work proposes a novel design for a practical RSMA receiver based on model-based deep learning (MBDL) methods, which aims to unite the simple structure of the conventional SIC receiver and the robustness and model agnosticism of deep learning techniques. The MBDL receiver is evaluated in terms of uncoded Symbol Error Rate (SER), throughput performance through Link-Level Simulations (LLS), and average training overhead. Also, a comparison with the SIC receiver, with perfect and imperfect CSIR, is given. Results reveal that the MBDL receiver outperforms by a significant margin the SIC receiver with imperfect CSIR, due to its ability to generate on demand non-linear symbol detection boundaries in a pure data-driven manner.
    Berlin V2X: A Machine Learning Dataset from Multiple Vehicles and Radio Access Technologies. (arXiv:2212.10343v1 [cs.LG])
    The evolution of wireless communications into 6G and beyond is expected to rely on new machine learning (ML)-based capabilities. These can enable proactive decisions and actions from wireless-network components to sustain quality-of-service (QoS) and user experience. Moreover, new use cases in the area of vehicular and industrial communications will emerge. Specifically in the area of vehicle communication, vehicle-to-everything (V2X) schemes will benefit strongly from such advances. With this in mind, we have conducted a detailed measurement campaign with the purpose of enabling a plethora of diverse ML-based studies. The resulting datasets offer GPS-located wireless measurements across diverse urban environments for both cellular (with two different operators) and sidelink radio access technologies, thus enabling a variety of different studies towards V2X. The datasets are labeled and sampled with a high time resolution. Furthermore, we make the data publicly available with all the necessary information to support the on-boarding of new researchers. We provide an initial analysis of the data showing some of the challenges that ML needs to overcome and the features that ML can leverage, as well as some hints at potential research studies.
    GD-VAEs: Geometric Dynamic Variational Autoencoders for Learning Nonlinear Dynamics and Dimension Reductions. (arXiv:2206.05183v2 [cs.LG] UPDATED)
    We develop data-driven methods incorporating geometric and topological information to learn parsimonious representations of nonlinear dynamics from observations. We develop approaches for learning nonlinear state space models of the dynamics for general manifold latent spaces using training strategies related to Variational Autoencoders (VAEs). Our methods are referred to as Geometric Dynamic (GD) Variational Autoencoders (GD-VAEs). We learn encoders and decoders for the system states and evolution based on deep neural network architectures that include general Multilayer Perceptrons (MLPs), Convolutional Neural Networks (CNNs), and Transpose CNNs (T-CNNs). Motivated by problems arising in parameterized PDEs and physics, we investigate the performance of our methods on tasks for learning low dimensional representations of the nonlinear Burgers equations, constrained mechanical systems, and spatial fields of reaction-diffusion systems. GD-VAEs provide methods for obtaining representations for use in diverse learning tasks involving dynamics.
    ASAT: Adaptively Scaled Adversarial Training in Time Series. (arXiv:2108.08976v2 [cs.LG] UPDATED)
    Adversarial training is a method for enhancing neural networks to improve the robustness against adversarial examples. Besides the security concerns of potential adversarial examples, adversarial training can also improve the generalization ability of neural networks, train robust neural networks, and provide interpretability for neural networks. In this work, we introduce adversarial training in time series analysis to enhance the neural networks for better generalization ability by taking the finance field as an example. Rethinking existing research on adversarial training, we propose the adaptively scaled adversarial training (ASAT) in time series analysis, by rescaling data at different time slots with adaptive scales. Experimental results show that the proposed ASAT can improve both the generalization ability and the adversarial robustness of neural networks compared to the baselines. Compared to the traditional adversarial training algorithm, ASAT can achieve better generalization ability and similar adversarial robustness.
    Pareto Pairwise Ranking for Fairness Enhancement of Recommender Systems. (arXiv:2212.10459v1 [cs.IR])
    Learning to rank is an effective recommendation approach since its introduction around 2010. Famous algorithms such as Bayesian Personalized Ranking and Collaborative Less is More Filtering have left deep impact in both academia and industry. However, most learning to rank approaches focus on improving technical accuracy metrics such as AUC, MRR and NDCG. Other evaluation metrics of recommender systems like fairness have been largely overlooked until in recent years. In this paper, we propose a new learning to rank algorithm named Pareto Pairwise Ranking. We are inspired by the idea of Bayesian Personalized Ranking and power law distribution. We show that our algorithm is competitive with other algorithms when evaluated on technical accuracy metrics. What is more important, in our experiment section we demonstrate that Pareto Pairwise Ranking is the most fair algorithm in comparison with 9 other contemporary algorithms.
    StyleDomain: Analysis of StyleSpace for Domain Adaptation of StyleGAN. (arXiv:2212.10229v1 [cs.CV])
    Domain adaptation of GANs is a problem of fine-tuning the state-of-the-art GAN models (e.g. StyleGAN) pretrained on a large dataset to a specific domain with few samples (e.g. painting faces, sketches, etc.). While there are a great number of methods that tackle this problem in different ways there are still many important questions that remain unanswered. In this paper, we provide a systematic and in-depth analysis of the domain adaptation problem of GANs, focusing on the StyleGAN model. First, we perform a detailed exploration of the most important parts of StyleGAN that are responsible for adapting the generator to a new domain depending on the similarity between the source and target domains. In particular, we show that affine layers of StyleGAN can be sufficient for fine-tuning to similar domains. Second, inspired by these findings, we investigate StyleSpace to utilize it for domain adaptation. We show that there exist directions in the StyleSpace that can adapt StyleGAN to new domains. Further, we examine these directions and discover their many surprising properties. Finally, we leverage our analysis and findings to deliver practical improvements and applications in such standard tasks as image-to-image translation and cross-domain morphing.
    The Third International Verification of Neural Networks Competition (VNN-COMP 2022): Summary and Results. (arXiv:2212.10376v1 [cs.LG])
    This report summarizes the 3rd International Verification of Neural Networks Competition (VNN-COMP 2022), held as a part of the 5th Workshop on Formal Methods for ML-Enabled Autonomous Systems (FoMLAS), which was collocated with the 34th International Conference on Computer-Aided Verification (CAV). VNN-COMP is held annually to facilitate the fair and objective comparison of state-of-the-art neural network verification tools, encourage the standardization of tool interfaces, and bring together the neural network verification community. To this end, standardized formats for networks (ONNX) and specification (VNN-LIB) were defined, tools were evaluated on equal-cost hardware (using an automatic evaluation pipeline based on AWS instances), and tool parameters were chosen by the participants before the final test sets were made public. In the 2022 iteration, 11 teams participated on a diverse set of 12 scored benchmarks. This report summarizes the rules, benchmarks, participating tools, results, and lessons learned from this iteration of this competition.
    A Pattern Discovery Approach to Multivariate Time Series Forecasting. (arXiv:2212.10306v1 [cs.LG])
    Multivariate time series forecasting constitutes important functionality in cyber-physical systems, whose prediction accuracy can be improved significantly by capturing temporal and multivariate correlations among multiple time series. State-of-the-art deep learning methods fail to construct models for full time series because model complexity grows exponentially with time series length. Rather, these methods construct local temporal and multivariate correlations within subsequences, but fail to capture correlations among subsequences, which significantly affect their forecasting accuracy. To capture the temporal and multivariate correlations among subsequences, we design a pattern discovery model, that constructs correlations via diverse pattern functions. While the traditional pattern discovery method uses shared and fixed pattern functions that ignore the diversity across time series. We propose a novel pattern discovery method that can automatically capture diverse and complex time series patterns. We also propose a learnable correlation matrix, that enables the model to capture distinct correlations among multiple time series. Extensive experiments show that our model achieves state-of-the-art prediction accuracy.
    Galaxy Image Classification using Hierarchical Data Learning with Weighted Sampling and Label Smoothing. (arXiv:2212.10081v1 [astro-ph.IM])
    With the development of a series of Galaxy sky surveys in recent years, the observations increased rapidly, which makes the research of machine learning methods for galaxy image recognition a hot topic. Available automatic galaxy image recognition researches are plagued by the large differences in similarity between categories, the imbalance of data between different classes, and the discrepancy between the discrete representation of Galaxy classes and the essentially gradual changes from one morphological class to the adjacent class (DDRGC). These limitations have motivated several astronomers and machine learning experts to design projects with improved galaxy image recognition capabilities. Therefore, this paper proposes a novel learning method, ``Hierarchical Imbalanced data learning with Weighted sampling and Label smoothing" (HIWL). The HIWL consists of three key techniques respectively dealing with the above-mentioned three problems: (1) Designed a hierarchical galaxy classification model based on an efficient backbone network; (2) Utilized a weighted sampling scheme to deal with the imbalance problem; (3) Adopted a label smoothing technique to alleviate the DDRGC problem. We applied this method to galaxy photometric images from the Galaxy Zoo-The Galaxy Challenge, exploring the recognition of completely round smooth, in between smooth, cigar-shaped, edge-on and spiral. The overall classification accuracy is 96.32\%, and some superiorities of the HIWL are shown based on recall, precision, and F1-Score in comparing with some related works. In addition, we also explored the visualization of the galaxy image features and model attention to understand the foundations of the proposed scheme.
    Dynamic Molecular Graph-based Implementation for Biophysical Properties Prediction. (arXiv:2212.09991v1 [cs.LG])
    Neural Networks (GNNs) have revolutionized the molecular discovery to understand patterns and identify unknown features that can aid in predicting biophysical properties and protein-ligand interactions. However, current models typically rely on 2-dimensional molecular representations as input, and while utilization of 2\3- dimensional structural data has gained deserved traction in recent years as many of these models are still limited to static graph representations. We propose a novel approach based on the transformer model utilizing GNNs for characterizing dynamic features of protein-ligand interactions. Our message passing transformer pre-trains on a set of molecular dynamic data based off of physics-based simulations to learn coordinate construction and make binding probability and affinity predictions as a downstream task. Through extensive testing we compare our results with the existing models, our MDA-PLI model was able to outperform the molecular interaction prediction models with an RMSE of 1.2958. The geometric encodings enabled by our transformer architecture and the addition of time series data add a new dimensionality to this form of research.
    A Survey on Pretrained Language Models for Neural Code Intelligence. (arXiv:2212.10079v1 [cs.SE])
    As the complexity of modern software continues to escalate, software engineering has become an increasingly daunting and error-prone endeavor. In recent years, the field of Neural Code Intelligence (NCI) has emerged as a promising solution, leveraging the power of deep learning techniques to tackle analytical tasks on source code with the goal of improving programming efficiency and minimizing human errors within the software industry. Pretrained language models have become a dominant force in NCI research, consistently delivering state-of-the-art results across a wide range of tasks, including code summarization, generation, and translation. In this paper, we present a comprehensive survey of the NCI domain, including a thorough review of pretraining techniques, tasks, datasets, and model architectures. We hope this paper will serve as a bridge between the natural language and programming language communities, offering insights for future research in this rapidly evolving field.
    Deep Multi-Emitter Spectrum Occupancy Mapping that is Robust to the Number of Sensors, Noise and Threshold. (arXiv:2212.10444v1 [eess.SP])
    One of the primary goals in spectrum occupancy mapping is to create a system that is robust to assumptions about the number of sensors, occupancy threshold (in dBm), sensor noise, number of emitters and the propagation environment. We show that such a system may be designed with neural networks using a process of aggregation to allow a variable number of sensors during training and testing. This process transforms the variable number of measurements into log-likelihood ratios (LLRs), which are fed as a fixed-resolution image into a neural network. The use of LLRs provides robustness to the effects of noise and occupancy threshold. In other words, a system may be trained for a nominal number of sensors, threshold and noise levels, and still operate well at various other levels without retraining. Our system operates without knowledge of the number of emitters and does not explicitly attempt to estimate their number or power. Receiver operating curves with realistic propagation environments using topographic maps with commercial network design tools show how performance of the neural network varies with the environment. The use of low-resolution sensors in this system does not significantly hurt performance.  ( 2 min )
    On the Applicability of Synthetic Data for Re-Identification. (arXiv:2212.10105v1 [cs.CV])
    This contribution demonstrates the feasibility of applying Generative Adversarial Networks (GANs) on images of EPAL pallet blocks for dataset enhancement in the context of re-identification. For many industrial applications of re-identification methods, datasets of sufficient volume would otherwise be unattainable in non-laboratory settings. Using a state-of-the-art GAN architecture, namely CycleGAN, images of pallet blocks rotated to their left-hand side were generated from images of visually centered pallet blocks, based on images of rotated pallet blocks that were recorded as part of a previously recorded and published dataset. In this process, the unique chipwood pattern of the pallet block surface structure was retained, only changing the orientation of the pallet block itself. By doing so, synthetic data for re-identification testing and training purposes was generated, in a manner that is distinct from ordinary data augmentation. In total, 1,004 new images of pallet blocks were generated. The quality of the generated images was gauged using a perspective classifier that was trained on the original images and then applied to the synthetic ones, comparing the accuracy between the two sets of images. The classification accuracy was 98% for the original images and 92% for the synthetic images. In addition, the generated images were also used in a re-identification task, in order to re-identify original images based on synthetic ones. The accuracy in this scenario was up to 88% for synthetic images, compared to 96% for original images. Through this evaluation, it is established, whether or not a generated pallet block image closely resembles its original counterpart.  ( 2 min )
    Calibrating Deep Neural Networks using Explicit Regularisation and Dynamic Data Pruning. (arXiv:2212.10005v1 [cs.LG])
    Deep neural networks (DNN) are prone to miscalibrated predictions, often exhibiting a mismatch between the predicted output and the associated confidence scores. Contemporary model calibration techniques mitigate the problem of overconfident predictions by pushing down the confidence of the winning class while increasing the confidence of the remaining classes across all test samples. However, from a deployment perspective, an ideal model is desired to (i) generate well-calibrated predictions for high-confidence samples with predicted probability say >0.95, and (ii) generate a higher proportion of legitimate high-confidence samples. To this end, we propose a novel regularization technique that can be used with classification losses, leading to state-of-the-art calibrated predictions at test time; From a deployment standpoint in safety-critical applications, only high-confidence samples from a well-calibrated model are of interest, as the remaining samples have to undergo manual inspection. Predictive confidence reduction of these potentially ``high-confidence samples'' is a downside of existing calibration approaches. We mitigate this by proposing a dynamic train-time data pruning strategy that prunes low-confidence samples every few epochs, providing an increase in "confident yet calibrated samples". We demonstrate state-of-the-art calibration performance across image classification benchmarks, reducing training time without much compromise in accuracy. We provide insights into why our dynamic pruning strategy that prunes low-confidence training samples leads to an increase in high-confidence samples at test time.  ( 2 min )
    Uncertainty Quantification of MLE for Entity Ranking with Covariates. (arXiv:2212.09961v1 [stat.ME])
    This paper concerns with statistical estimation and inference for the ranking problems based on pairwise comparisons with additional covariate information such as the attributes of the compared items. Despite extensive studies, few prior literatures investigate this problem under the more realistic setting where covariate information exists. To tackle this issue, we propose a novel model, Covariate-Assisted Ranking Estimation (CARE) model, that extends the well-known Bradley-Terry-Luce (BTL) model, by incorporating the covariate information. Specifically, instead of assuming every compared item has a fixed latent score $\{\theta_i^*\}_{i=1}^n$, we assume the underlying scores are given by $\{\alpha_i^*+{x}_i^\top\beta^*\}_{i=1}^n$, where $\alpha_i^*$ and ${x}_i^\top\beta^*$ represent latent baseline and covariate score of the $i$-th item, respectively. We impose natural identifiability conditions and derive the $\ell_{\infty}$- and $\ell_2$-optimal rates for the maximum likelihood estimator of $\{\alpha_i^*\}_{i=1}^{n}$ and $\beta^*$ under a sparse comparison graph, using a novel `leave-one-out' technique (Chen et al., 2019) . To conduct statistical inferences, we further derive asymptotic distributions for the MLE of $\{\alpha_i^*\}_{i=1}^n$ and $\beta^*$ with minimal sample complexity. This allows us to answer the question whether some covariates have any explanation power for latent scores and to threshold some sparse parameters to improve the ranking performance. We improve the approximation method used in (Gao et al., 2021) for the BLT model and generalize it to the CARE model. Moreover, we validate our theoretical results through large-scale numerical studies and an application to the mutual fund stock holding dataset.  ( 2 min )
    Plug & Play Directed Evolution of Proteins with Gradient-based Discrete MCMC. (arXiv:2212.09925v1 [cs.LG])
    A long-standing goal of machine-learning-based protein engineering is to accelerate the discovery of novel mutations that improve the function of a known protein. We introduce a sampling framework for evolving proteins in silico that supports mixing and matching a variety of unsupervised models, such as protein language models, and supervised models that predict protein function from sequence. By composing these models, we aim to improve our ability to evaluate unseen mutations and constrain search to regions of sequence space likely to contain functional proteins. Our framework achieves this without any model fine-tuning or re-training by constructing a product of experts distribution directly in discrete protein space. Instead of resorting to brute force search or random sampling, which is typical of classic directed evolution, we introduce a fast MCMC sampler that uses gradients to propose promising mutations. We conduct in silico directed evolution experiments on wide fitness landscapes and across a range of different pre-trained unsupervised models, including a 650M parameter protein language model. Our results demonstrate an ability to efficiently discover variants with high evolutionary likelihood as well as estimated activity multiple mutations away from a wild type protein, suggesting our sampler provides a practical and effective new paradigm for machine-learning-based protein engineering.  ( 2 min )
    Normalized Stochastic Gradient Descent Training of Deep Neural Networks. (arXiv:2212.09921v1 [cs.LG])
    In this paper, we introduce a novel optimization algorithm for machine learning model training called Normalized Stochastic Gradient Descent (NSGD) inspired by Normalized Least Mean Squares (NLMS) from adaptive filtering. When we train a high-complexity model on a large dataset, the learning rate is significantly important as a poor choice of optimizer parameters can lead to divergence. The algorithm updates the new set of network weights using the stochastic gradient but with $\ell_1$ and $\ell_2$-based normalizations on the learning rate parameter similar to the NLMS algorithm. Our main difference from the existing normalization methods is that we do not include the error term in the normalization process. We normalize the update term using the input vector to the neuron. Our experiments present that the model can be trained to a better accuracy level on different initial settings using our optimization algorithm. In this paper, we demonstrate the efficiency of our training algorithm using ResNet-20 and a toy neural network on different benchmark datasets with different initializations. The NSGD improves the accuracy of the ResNet-20 from 91.96\% to 92.20\% on the CIFAR-10 dataset.  ( 2 min )
    Future Sight: Dynamic Story Generation with Large Pretrained Language Models. (arXiv:2212.09947v1 [cs.CL])
    Recent advances in deep learning research, such as transformers, have bolstered the ability for automated agents to generate creative texts similar to those that a human would write. By default, transformer decoders can only generate new text with respect to previously generated text. The output distribution of candidate tokens at any position is conditioned on previously selected tokens using a self-attention mechanism to emulate the property of autoregression. This is inherently limiting for tasks such as controllable story generation where it may be necessary to condition on future plot events when writing a story. In this work, we propose Future Sight, a method for finetuning a pretrained generative transformer on the task of future conditioning. Transformer decoders are typically pretrained on the task of completing a context, one token at a time, by means of self-attention. Future Sight additionally enables a decoder to attend to an encoded future plot event. This motivates the decoder to expand on the context in a way that logically concludes with the provided future. During inference, the future plot event can be written by a human author to steer the narrative being generated in a certain direction. We evaluate the efficacy of our approach on a story generation task with human evaluators.  ( 2 min )
    VSVC: Backdoor attack against Keyword Spotting based on Voiceprint Selection and Voice Conversion. (arXiv:2212.10103v1 [cs.SD])
    Keyword spotting (KWS) based on deep neural networks (DNNs) has achieved massive success in voice control scenarios. However, training of such DNN-based KWS systems often requires significant data and hardware resources. Manufacturers often entrust this process to a third-party platform. This makes the training process uncontrollable, where attackers can implant backdoors in the model by manipulating third-party training data. An effective backdoor attack can force the model to make specified judgments under certain conditions, i.e., triggers. In this paper, we design a backdoor attack scheme based on Voiceprint Selection and Voice Conversion, abbreviated as VSVC. Experimental results demonstrated that VSVC is feasible to achieve an average attack success rate close to 97% in four victim models when poisoning less than 1% of the training data.  ( 2 min )
    Real-time Health Monitoring of Heat Exchangers using Hypernetworks and PINNs. (arXiv:2212.10032v1 [cs.LG])
    We demonstrate a Physics-informed Neural Network (PINN) based model for real-time health monitoring of a heat exchanger, that plays a critical role in improving energy efficiency of thermal power plants. A hypernetwork based approach is used to enable the domain-decomposed PINN learn the thermal behavior of the heat exchanger in response to dynamic boundary conditions, eliminating the need to re-train. As a result, we achieve orders of magnitude reduction in inference time in comparison to existing PINNs, while maintaining the accuracy on par with the physics-based simulations. This makes the approach very attractive for predictive maintenance of the heat exchanger in digital twin environments.  ( 2 min )
    Dataless Knowledge Fusion by Merging Weights of Language Models. (arXiv:2212.09849v1 [cs.CL])
    Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models. Oftentimes fine-tuned models are readily available but their training data is not, due to data privacy or intellectual property concerns. This creates a barrier to fusing knowledge across individual models to yield a better single model. In this paper, we study the problem of merging individual models built on different training data sets to obtain a single model that performs well both across all data set domains and can generalize on out-of-domain data. We propose a dataless knowledge fusion method that merges models in their parameter space, guided by weights that minimize prediction differences between the merged model and the individual models. Over a battery of evaluation settings, we show that the proposed method significantly outperforms baselines such as Fisher-weighted averaging or model ensembling. Further, we find that our method is a promising alternative to multi-task learning that can preserve or sometimes improve over the individual models without access to the training data. Finally, model merging is more efficient than training a multi-task model, thus making it applicable to a wider set of scenarios.  ( 2 min )
    Rumour detection using graph neural network and oversampling in benchmark Twitter dataset. (arXiv:2212.10080v1 [cs.CL])
    Recently, online social media has become a primary source for new information and misinformation or rumours. In the absence of an automatic rumour detection system the propagation of rumours has increased manifold leading to serious societal damages. In this work, we propose a novel method for building automatic rumour detection system by focusing on oversampling to alleviating the fundamental challenges of class imbalance in rumour detection task. Our oversampling method relies on contextualised data augmentation to generate synthetic samples for underrepresented classes in the dataset. The key idea exploits selection of tweets in a thread for augmentation which can be achieved by introducing a non-random selection criteria to focus the augmentation process on relevant tweets. Furthermore, we propose two graph neural networks(GNN) to model non-linear conversations on a thread. To enhance the tweet representations in our method we employed a custom feature selection technique based on state-of-the-art BERTweet model. Experiments of three publicly available datasets confirm that 1) our GNN models outperform the the current state-of-the-art classifiers by more than 20%(F1-score); 2) our oversampling technique increases the model performance by more than 9%;(F1-score) 3) focusing on relevant tweets for data augmentation via non-random selection criteria can further improve the results; and 4) our method has superior capabilities to detect rumours at very early stage.  ( 2 min )
    Policy learning "without'' overlap: Pessimism and generalized empirical Bernstein's inequality. (arXiv:2212.09900v1 [cs.LG])
    This paper studies offline policy learning, which aims at utilizing observations collected a priori (from either fixed or adaptively evolving behavior policies) to learn an optimal individualized decision rule that achieves the best overall outcomes for a given population. Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics are lower bounded in the offline dataset; put differently, the performance of the existing methods depends on the worst-case propensity in the offline dataset. As one has no control over the data collection process, this assumption can be unrealistic in many situations, especially when the behavior policies are allowed to evolve over time with diminishing propensities for certain actions. In this paper, we propose a new algorithm that optimizes lower confidence bounds (LCBs) -- instead of point estimates -- of the policy values. The LCBs are constructed using knowledge of the behavior policies for collecting the offline data. Without assuming any uniform overlap condition, we establish a data-dependent upper bound for the suboptimality of our algorithm, which only depends on (i) the overlap for the optimal policy, and (ii) the complexity of the policy class we optimize over. As an implication, for adaptively collected data, we ensure efficient policy learning as long as the propensities for optimal actions are lower bounded over time, while those for suboptimal ones are allowed to diminish arbitrarily fast. In our theoretical analysis, we develop a new self-normalized type concentration inequality for inverse-propensity-weighting estimators, generalizing the well-known empirical Bernstein's inequality to unbounded and non-i.i.d. data.  ( 2 min )
    Dexterous Manipulation from Images: Autonomous Real-World RL via Substep Guidance. (arXiv:2212.09902v1 [cs.LG])
    Complex and contact-rich robotic manipulation tasks, particularly those that involve multi-fingered hands and underactuated object manipulation, present a significant challenge to any control method. Methods based on reinforcement learning offer an appealing choice for such settings, as they can enable robots to learn to delicately balance contact forces and dexterously reposition objects without strong modeling assumptions. However, running reinforcement learning on real-world dexterous manipulation systems often requires significant manual engineering. This negates the benefits of autonomous data collection and ease of use that reinforcement learning should in principle provide. In this paper, we describe a system for vision-based dexterous manipulation that provides a "programming-free" approach for users to define new tasks and enable robots with complex multi-fingered hands to learn to perform them through interaction. The core principle underlying our system is that, in a vision-based setting, users should be able to provide high-level intermediate supervision that circumvents challenges in teleoperation or kinesthetic teaching which allow a robot to not only learn a task efficiently but also to autonomously practice. Our system includes a framework for users to define a final task and intermediate sub-tasks with image examples, a reinforcement learning procedure that learns the task autonomously without interventions, and experimental results with a four-finger robotic hand learning multi-stage object manipulation tasks directly in the real world, without simulation, manual modeling, or reward engineering.  ( 2 min )
    Using Machine Learning to Determine Morphologies of $z<1$ AGN Host Galaxies in the Hyper Suprime-Cam Wide Survey. (arXiv:2212.09984v1 [astro-ph.GA])
    We present a machine-learning framework to accurately characterize morphologies of Active Galactic Nucleus (AGN) host galaxies within $z<1$. We first use PSFGAN to decouple host galaxy light from the central point source, then we invoke the Galaxy Morphology Network (GaMorNet) to estimate whether the host galaxy is disk-dominated, bulge-dominated, or indeterminate. Using optical images from five bands of the HSC Wide Survey, we build models independently in three redshift bins: low $(0<z<0.25)$, medium $(0.25<z<0.5)$, and high $(0.5<z<1.0)$. By first training on a large number of simulated galaxies, then fine-tuning using far fewer classified real galaxies, our framework predicts the actual morphology for $\sim$ $60\%-70\%$ host galaxies from test sets, with a classification precision of $\sim$ $80\%-95\%$, depending on redshift bin. Specifically, our models achieve disk precision of $96\%/82\%/79\%$ and bulge precision of $90\%/90\%/80\%$ (for the 3 redshift bins), at thresholds corresponding to indeterminate fractions of $30\%/43\%/42\%$. The classification precision of our models has a noticeable dependency on host galaxy radius and magnitude. No strong dependency is observed on contrast ratio. Comparing classifications of real AGNs, our models agree well with traditional 2D fitting with GALFIT. The PSFGAN+GaMorNet framework does not depend on the choice of fitting functions or galaxy-related input parameters, runs orders of magnitude faster than GALFIT, and is easily generalizable via transfer learning, making it an ideal tool for studying AGN host galaxy morphology in forthcoming large imaging survey.  ( 2 min )
    Insights into undergraduate pathways using course load analytics. (arXiv:2212.09974v1 [cs.CY])
    Course load analytics (CLA) inferred from LMS and enrollment features can offer a more accurate representation of course workload to students than credit hours and potentially aid in their course selection decisions. In this study, we produce and evaluate the first machine-learned predictions of student course load ratings and generalize our model to the full 10,000 course catalog of a large public university. We then retrospectively analyze longitudinal differences in the semester load of student course selections throughout their degree. CLA by semester shows that a student's first semester at the university is among their highest load semesters, as opposed to a credit hour-based analysis, which would indicate it is among their lowest. Investigating what role predicted course load may play in program retention, we find that students who maintain a semester load that is low as measured by credit hours but high as measured by CLA are more likely to leave their program of study. This discrepancy in course load is particularly pertinent in STEM and associated with high prerequisite courses. Our findings have implications for academic advising, institutional handling of the freshman experience, and student-facing analytics to help students better plan, anticipate, and prepare for their selected courses.  ( 2 min )
    Benchmarking person re-identification datasets and approaches for practical real-world implementations. (arXiv:2212.09981v1 [cs.CV])
    Recently, Person Re-Identification (Re-ID) has received a lot of attention. Large datasets containing labeled images of various individuals have been released, allowing researchers to develop and test many successful approaches. However, when such Re-ID models are deployed in new cities or environments, the task of searching for people within a network of security cameras is likely to face an important domain shift, thus resulting in decreased performance. Indeed, while most public datasets were collected in a limited geographic area, images from a new city present different features (e.g., people's ethnicity and clothing style, weather, architecture, etc.). In addition, the whole frames of the video streams must be converted into cropped images of people using pedestrian detection models, which behave differently from the human annotators who created the dataset used for training. To better understand the extent of this issue, this paper introduces a complete methodology to evaluate Re-ID approaches and training datasets with respect to their suitability for unsupervised deployment for live operations. This method is used to benchmark four Re-ID approaches on three datasets, providing insight and guidelines that can help to design better Re-ID pipelines in the future.  ( 2 min )
    Distributional Robustness Bounds Generalization Errors. (arXiv:2212.09962v1 [cs.LG])
    Bayesian methods, distributionally robust optimization methods, and regularization methods are three pillars of trustworthy machine learning hedging against distributional uncertainty, e.g., the uncertainty of an empirical distribution compared to the true underlying distribution. This paper investigates the connections among the three frameworks and, in particular, explores why these frameworks tend to have smaller generalization errors. Specifically, first, we suggest a quantitative definition for "distributional robustness", propose the concept of "robustness measure", and formalize several philosophical concepts in distributionally robust optimization. Second, we show that Bayesian methods are distributionally robust in the probably approximately correct (PAC) sense; In addition, by constructing a Dirichlet-process-like prior in Bayesian nonparametrics, it can be proven that any regularized empirical risk minimization method is equivalent to a Bayesian method. Third, we show that generalization errors of machine learning models can be characterized using the distributional uncertainty of the nominal distribution and the robustness measures of these machine learning models, which is a new perspective to bound generalization errors, and therefore, explain the reason why distributionally robust machine learning models, Bayesian models, and regularization models tend to have smaller generalization errors.  ( 2 min )
    Continuous Semi-Supervised Nonnegative Matrix Factorization. (arXiv:2212.09858v1 [cs.CL])
    Nonnegative matrix factorization can be used to automatically detect topics within a corpus in an unsupervised fashion. The technique amounts to an approximation of a nonnegative matrix as the product of two nonnegative matrices of lower rank. In this paper, we show this factorization can be combined with regression on a continuous response variable. In practice, the method performs better than regression done after topics are identified and retrains interpretability.  ( 2 min )
  • Open

    GD-VAEs: Geometric Dynamic Variational Autoencoders for Learning Nonlinear Dynamics and Dimension Reductions. (arXiv:2206.05183v2 [cs.LG] UPDATED)
    We develop data-driven methods incorporating geometric and topological information to learn parsimonious representations of nonlinear dynamics from observations. We develop approaches for learning nonlinear state space models of the dynamics for general manifold latent spaces using training strategies related to Variational Autoencoders (VAEs). Our methods are referred to as Geometric Dynamic (GD) Variational Autoencoders (GD-VAEs). We learn encoders and decoders for the system states and evolution based on deep neural network architectures that include general Multilayer Perceptrons (MLPs), Convolutional Neural Networks (CNNs), and Transpose CNNs (T-CNNs). Motivated by problems arising in parameterized PDEs and physics, we investigate the performance of our methods on tasks for learning low dimensional representations of the nonlinear Burgers equations, constrained mechanical systems, and spatial fields of reaction-diffusion systems. GD-VAEs provide methods for obtaining representations for use in diverse learning tasks involving dynamics.
    Application-Driven Learning: A Closed-Loop Prediction and Optimization Approach Applied to Dynamic Reserves and Demand Forecasting. (arXiv:2102.13273v4 [math.OC] CROSS LISTED)
    Forecasting and decision-making are generally modeled as two sequential steps with no feedback, following an open-loop approach. In this paper, we present application-driven learning, a new closed-loop framework in which the processes of forecasting and decision-making are merged and co-optimized through a bilevel optimization problem. We present our methodology in a general format and prove that the solution converges to the best estimator in terms of the expected cost of the selected application. Then, we propose two solution methods: an exact method based on the KKT conditions of the second-level problem and a scalable heuristic approach suitable for decomposition methods. The proposed methodology is applied to the relevant problem of defining dynamic reserve requirements and conditional load forecasts, offering an alternative approach to current \emph{ad hoc} procedures implemented in industry practices. We benchmark our methodology with the standard sequential least-squares forecast and dispatch planning process. We apply the proposed methodology to an illustrative system and to a wide range of instances, from dozens of buses to large-scale realistic systems with thousands of buses. Our results show that the proposed methodology is scalable and yields consistently better performance than the standard open-loop approach.
    Group Meritocratic Fairness in Linear Contextual Bandits. (arXiv:2206.03150v3 [stat.ML] UPDATED)
    We study the linear contextual bandit problem where an agent has to select one candidate from a pool and each candidate belongs to a sensitive group. In this setting, candidates' rewards may not be directly comparable between groups, for example when the agent is an employer hiring candidates from different ethnic groups and some groups have a lower reward due to discriminatory bias and/or social injustice. We propose a notion of fairness that states that the agent's policy is fair when it selects a candidate with highest relative rank, which measures how good the reward is when compared to candidates from the same group. This is a very strong notion of fairness, since the relative rank is not directly observed by the agent and depends on the underlying reward model and on the distribution of rewards. Thus we study the problem of learning a policy which approximates a fair policy under the condition that the contexts are independent between groups and the distribution of rewards of each group is absolutely continuous. In particular, we design a greedy policy which at each round constructs a ridge regression estimate from the observed context-reward pairs, and then computes an estimate of the relative rank of each candidate using the empirical cumulative distribution function. We prove that, despite its simplicity and the lack of an initial exploration phase, the greedy policy achieves, up to log factors and with high probability, a fair pseudo-regret of order $\sqrt{dT}$ after $T$ rounds, where $d$ is the dimension of the context vectors. The policy also satisfies demographic parity at each round when averaged over all possible information available before the selection. Finally, we use simulated settings and experiments on the US census data to show that our policy achieves sub-linear fair pseudo-regret also in practice.
    Almost Cost-Free Communication in Federated Best Arm Identification. (arXiv:2208.09215v2 [cs.LG] UPDATED)
    We study the problem of best arm identification in a federated learning multi-armed bandit setup with a central server and multiple clients. Each client is associated with a multi-armed bandit in which each arm yields {\em i.i.d.}\ rewards following a Gaussian distribution with an unknown mean and known variance. The set of arms is assumed to be the same at all the clients. We define two notions of best arm -- local and global. The local best arm at a client is the arm with the largest mean among the arms local to the client, whereas the global best arm is the arm with the largest average mean across all the clients. We assume that each client can only observe the rewards from its local arms and thereby estimate its local best arm. The clients communicate with a central server on uplinks that entail a cost of $C\ge0$ units per usage per uplink. The global best arm is estimated at the server. The goal is to identify the local best arms and the global best arm with minimal total cost, defined as the sum of the total number of arm selections at all the clients and the total communication cost, subject to an upper bound on the error probability. We propose a novel algorithm {\sc FedElim} that is based on successive elimination and communicates only in exponential time steps and obtain a high probability instance-dependent upper bound on its total cost. The key takeaway from our paper is that for any $C\geq 0$ and error probabilities sufficiently small, the total number of arm selections (resp.\ the total cost) under {\sc FedElim} is at most~$2$ (resp.~$3$) times the maximum total number of arm selections under its variant that communicates in every time step. Additionally, we show that the latter is optimal in expectation up to a constant factor, thereby demonstrating that communication is almost cost-free in {\sc FedElim}. We numerically validate the efficacy of {\sc FedElim}.
    A general approximation lower bound in $L^p$ norm, with applications to feed-forward neural networks. (arXiv:2206.04360v2 [cs.LG] UPDATED)
    We study the fundamental limits to the expressive power of neural networks. Given two sets $F$, $G$ of real-valued functions, we first prove a general lower bound on how well functions in $F$ can be approximated in $L^p(\mu)$ norm by functions in $G$, for any $p \geq 1$ and any probability measure $\mu$. The lower bound depends on the packing number of $F$, the range of $F$, and the fat-shattering dimension of $G$. We then instantiate this bound to the case where $G$ corresponds to a piecewise-polynomial feed-forward neural network, and describe in details the application to two sets $F$: H{\"o}lder balls and multivariate monotonic functions. Beside matching (known or new) upper bounds up to log factors, our lower bounds shed some light on the similarities or differences between approximation in $L^p$ norm or in sup norm, solving an open question by DeVore et al. (2021). Our proof strategy differs from the sup norm case and uses a key probability result of Mendelson (2002).
    Local Identifiability of Deep ReLU Neural Networks: the Theory. (arXiv:2206.07424v2 [math.ST] UPDATED)
    Is a sample rich enough to determine, at least locally, the parameters of a neural network? To answer this question, we introduce a new local parameterization of a given deep ReLU neural network by fixing the values of some of its weights. This allows us to define local lifting operators whose inverses are charts of a smooth manifold of a high dimensional space. The function implemented by the deep ReLU neural network composes the local lifting with a linear operator which depends on the sample. We derive from this convenient representation a geometrical necessary and sufficient condition of local identifiability. Looking at tangent spaces, the geometrical condition provides: 1/ a sharp and testable necessary condition of identifiability and 2/ a sharp and testable sufficient condition of local identifiability. The validity of the conditions can be tested numerically using backpropagation and matrix rank computations.
    Probabilistic quantile factor analysis. (arXiv:2212.10301v1 [econ.EM])
    This paper extends quantile factor analysis to a probabilistic variant that incorporates regularization and computationally efficient variational approximations. By means of synthetic and real data experiments it is established that the proposed estimator can achieve, in many cases, better accuracy than a recently proposed loss-based estimator. We contribute to the literature on measuring uncertainty by extracting new indexes of low, medium and high economic policy uncertainty, using the probabilistic quantile factor methodology. Medium and high indexes have clear contractionary effects, while the low index is benign for the economy, showing that not all manifestations of uncertainty are the same.
    AskewSGD : An Annealed interval-constrained Optimisation method to train Quantized Neural Networks. (arXiv:2211.03741v2 [stat.ML] UPDATED)
    In this paper, we develop a new algorithm, Annealed Skewed SGD - AskewSGD - for training deep neural networks (DNNs) with quantized weights. First, we formulate the training of quantized neural networks (QNNs) as a smoothed sequence of interval-constrained optimization problems. Then, we propose a new first-order stochastic method, AskewSGD, to solve each constrained optimization subproblem. Unlike algorithms with active sets and feasible directions, AskewSGD avoids projections or optimization under the entire feasible set and allows iterates that are infeasible. The numerical complexity of AskewSGD is comparable to existing approaches for training QNNs, such as the straight-through gradient estimator used in BinaryConnect, or other state of the art methods (ProxQuant, LUQ). We establish convergence guarantees for AskewSGD (under general assumptions for the objective function). Experimental results show that the AskewSGD algorithm performs better than or on par with state of the art methods in classical benchmarks.
    Machine Learning based Framework for Robust Price-Sensitivity Estimation with Application to Airline Pricing. (arXiv:2205.01875v2 [stat.ML] UPDATED)
    We consider the problem of dynamic pricing of a product in the presence of feature-dependent price sensitivity. Developing practical algorithms that can estimate price elasticities robustly, especially when information about no purchases (losses) is not available, to drive such automated pricing systems is a challenge faced by many industries. Based on the Poisson semi-parametric approach, we construct a flexible yet interpretable demand model where the price related part is parametric while the remaining (nuisance) part of the model is non-parametric and can be modeled via sophisticated machine learning (ML) techniques. The estimation of price-sensitivity parameters of this model via direct one-stage regression techniques may lead to biased estimates due to regularization. To address this concern, we propose a two-stage estimation methodology which makes the estimation of the price-sensitivity parameters robust to biases in the estimators of the nuisance parameters of the model. In the first-stage we construct estimators of observed purchases and prices given the feature vector using sophisticated ML estimators such as deep neural networks. Utilizing the estimators from the first-stage, in the second-stage we leverage a Bayesian dynamic generalized linear model to estimate the price-sensitivity parameters. We test the performance of the proposed estimation schemes on simulated and real sales transaction data from the Airline industry. Our numerical studies demonstrate that our proposed two-stage approach reduces the estimation error in price-sensitivity parameters from 25\% to 4\% in realistic simulation settings. The two-stage estimation techniques proposed in this work allows practitioners to leverage modern ML techniques to robustly estimate price-sensitivities while still maintaining interpretability and allowing ease of validation of its various constituent parts.  ( 3 min )
    Nonparametric plug-in classifier for multiclass classification of S.D.E. paths. (arXiv:2212.10259v1 [math.ST])
    We study the multiclass classification problem where the features come from the mixture of time-homogeneous diffusions. Specifically, the classes are discriminated by their drift functions while the diffusion coefficient is common to all classes and unknown. In this framework, we build a plug-in classifier which relies on nonparametric estimators of the drift and diffusion functions. We first establish the consistency of our classification procedure under mild assumptions and then provide rates of cnvergence under different set of assumptions. Finally, a numerical study supports our theoretical findings.
    HyperBO+: Pre-training a universal prior for Bayesian optimization with hierarchical Gaussian processes. (arXiv:2212.10538v1 [cs.LG])
    Bayesian optimization (BO), while proved highly effective for many black-box function optimization tasks, requires practitioners to carefully select priors that well model their functions of interest. Rather than specifying by hand, researchers have investigated transfer learning based methods to automatically learn the priors, e.g. multi-task BO (Swersky et al., 2013), few-shot BO (Wistuba and Grabocka, 2021) and HyperBO (Wang et al., 2022). However, those prior learning methods typically assume that the input domains are the same for all tasks, weakening their ability to use observations on functions with different domains or generalize the learned priors to BO on different search spaces. In this work, we present HyperBO+: a pre-training approach for hierarchical Gaussian processes that enables the same prior to work universally for Bayesian optimization on functions with different domains. We propose a two-step pre-training method and analyze its appealing asymptotic properties and benefits to BO both theoretically and empirically. On real-world hyperparameter tuning tasks that involve multiple search spaces, we demonstrate that HyperBO+ is able to generalize to unseen search spaces and achieves lower regrets than competitive baselines.
    Online Statistical Inference for Stochastic Optimization via Kiefer-Wolfowitz Methods. (arXiv:2102.03389v4 [math.ST] UPDATED)
    This paper investigates the problem of online statistical inference of model parameters in stochastic optimization problems via the Kiefer-Wolfowitz algorithm with random search directions. We first present the asymptotic distribution for the Polyak-Ruppert-averaging type Kiefer-Wolfowitz (AKW) estimators, whose asymptotic covariance matrices depend on the distribution of search directions and the function-value query complexity. The distributional result reflects the trade-off between statistical efficiency and function query complexity. We further analyze the choice of random search directions to minimize certain summary statistics of the asymptotic covariance matrix. Based on the asymptotic distribution, we conduct online statistical inference by providing two construction procedures of valid confidence intervals.  ( 2 min )
    Deep Riemannian Networks for EEG Decoding. (arXiv:2212.10426v1 [cs.LG])
    State-of-the-art performance in electroencephalography (EEG) decoding tasks is currently often achieved with either Deep-Learning or Riemannian-Geometry-based decoders. Recently, there is growing interest in Deep Riemannian Networks (DRNs) possibly combining the advantages of both previous classes of methods. However, there are still a range of topics where additional insight is needed to pave the way for a more widespread application of DRNs in EEG. These include architecture design questions such as network size and end-to-end ability as well as model training questions. How these factors affect model performance has not been explored. Additionally, it is not clear how the data within these networks is transformed, and whether this would correlate with traditional EEG decoding. Our study aims to lay the groundwork in the area of these topics through the analysis of DRNs for EEG with a wide range of hyperparameters. Networks were tested on two public EEG datasets and compared with state-of-the-art ConvNets. Here we propose end-to-end EEG SPDNet (EE(G)-SPDNet), and we show that this wide, end-to-end DRN can outperform the ConvNets, and in doing so use physiologically plausible frequency regions. We also show that the end-to-end approach learns more complex filters than traditional band-pass filters targeting the classical alpha, beta, and gamma frequency bands of the EEG, and that performance can benefit from channel specific filtering approaches. Additionally, architectural analysis revealed areas for further improvement due to the possible loss of Riemannian specific information throughout the network. Our study thus shows how to design and train DRNs to infer task-related information from the raw EEG without the need of handcrafted filterbanks and highlights the potential of end-to-end DRNs such as EE(G)-SPDNet for high-performance EEG decoding.  ( 2 min )
    A Meta-Learning Approach for Training Explainable Graph Neural Networks. (arXiv:2109.09426v2 [cs.LG] UPDATED)
    In this paper, we investigate the degree of explainability of graph neural networks (GNNs). Existing explainers work by finding global/local subgraphs to explain a prediction, but they are applied after a GNN has already been trained. Here, we propose a meta-learning framework for improving the level of explainability of a GNN directly at training time, by steering the optimization procedure towards what we call `interpretable minima'. Our framework (called MATE, MetA-Train to Explain) jointly trains a model to solve the original task, e.g., node classification, and to provide easily processable outputs for downstream algorithms that explain the model's decisions in a human-friendly way. In particular, we meta-train the model's parameters to quickly minimize the error of an instance-level GNNExplainer trained on-the-fly on randomly sampled nodes. The final internal representation relies upon a set of features that can be `better' understood by an explanation algorithm, e.g., another instance of GNNExplainer. Our model-agnostic approach can improve the explanations produced for different GNN architectures and use any instance-based explainer to drive this process. Experiments on synthetic and real-world datasets for node and graph classification show that we can produce models that are consistently easier to explain by different algorithms. Furthermore, this increase in explainability comes at no cost for the accuracy of the model.
    Posterior and Computational Uncertainty in Gaussian Processes. (arXiv:2205.15449v3 [cs.LG] UPDATED)
    Gaussian processes scale prohibitively with the size of the dataset. In response, many approximation methods have been developed, which inevitably introduce approximation error. This additional source of uncertainty, due to limited computation, is entirely ignored when using the approximate posterior. Therefore in practice, GP models are often as much about the approximation method as they are about the data. Here, we develop a new class of methods that provides consistent estimation of the combined uncertainty arising from both the finite number of data observed and the finite amount of computation expended. The most common GP approximations map to an instance in this class, such as methods based on the Cholesky factorization, conjugate gradients, and inducing points. For any method in this class, we prove (i) convergence of its posterior mean in the associated RKHS, (ii) decomposability of its combined posterior covariance into mathematical and computational covariances, and (iii) that the combined variance is a tight worst-case bound for the squared error between the method's posterior mean and the latent function. Finally, we empirically demonstrate the consequences of ignoring computational uncertainty and show how implicitly modeling it improves generalization performance on benchmark datasets.
    Roto-translated Local Coordinate Frames For Interacting Dynamical Systems. (arXiv:2110.14961v2 [cs.LG] UPDATED)
    Modelling interactions is critical in learning complex dynamical systems, namely systems of interacting objects with highly non-linear and time-dependent behaviour. A large class of such systems can be formalized as $\textit{geometric graphs}$, $\textit{i.e.}$, graphs with nodes positioned in the Euclidean space given an $\textit{arbitrarily}$ chosen global coordinate system, for instance vehicles in a traffic scene. Notwithstanding the arbitrary global coordinate system, the governing dynamics of the respective dynamical systems are invariant to rotations and translations, also known as $\textit{Galilean invariance}$. As ignoring these invariances leads to worse generalization, in this work we propose local coordinate frames per node-object to induce roto-translation invariance to the geometric graph of the interacting dynamical system. Further, the local coordinate frames allow for a natural definition of anisotropic filtering in graph neural networks. Experiments in traffic scenes, 3D motion capture, and colliding particles demonstrate that the proposed approach comfortably outperforms the recent state-of-the-art.  ( 2 min )
    Fixed and adaptive landmark sets for finite pseudometric spaces. (arXiv:2212.09826v1 [cs.CG])
    Topological data analysis (TDA) is an expanding field that leverages principles and tools from algebraic topology to quantify structural features of data sets or transform them into more manageable forms. As its theoretical foundations have been developed, TDA has shown promise in extracting useful information from high-dimensional, noisy, and complex data such as those used in biomedicine. To operate efficiently, these techniques may employ landmark samplers, either random or heuristic. The heuristic maxmin procedure obtains a roughly even distribution of sample points by implicitly constructing a cover comprising sets of uniform radius. However, issues arise with data that vary in density or include points with multiplicities, as are common in biomedicine. We propose an analogous procedure, "lastfirst" based on ranked distances, which implies a cover comprising sets of uniform cardinality. We first rigorously define the procedure and prove that it obtains landmarks with desired properties. We then perform benchmark tests and compare its performance to that of maxmin, on feature detection and class prediction tasks involving simulated and real-world biomedical data. Lastfirst is more general than maxmin in that it can be applied to any data on which arbitrary (and not necessarily symmetric) pairwise distances can be computed. Lastfirst is more computationally costly, but our implementation scales at the same rate as maxmin. We find that lastfirst achieves comparable performance on prediction tasks and outperforms maxmin on homology detection tasks. Where the numerical values of similarity measures are not meaningful, as in many biomedical contexts, lastfirst sampling may also improve interpretability.  ( 2 min )
    A Generalized Variable Importance Metric and Estimator for Black Box Machine Learning Models. (arXiv:2212.09931v1 [stat.CO])
    The aim of this study is to define importance of predictors for black box machine learning methods, where the prediction function can be highly non-additive and cannot be represented by statistical parameters. In this paper we defined a ``Generalized Variable Importance Metric (GVIM)'' using the true conditional expectation function for a continuous or a binary response variable. We further showed that the defined GVIM can be represented as a function of the Conditional Average Treatment Effect (CATE) squared for multinomial and continuous predictors. Then we propose how the metric can be estimated using using any machine learning models. Finally we showed the properties of the estimator using multiple simulations.  ( 2 min )
    Generalized Simultaneous Perturbation Stochastic Approximation with Reduced Estimator Bias. (arXiv:2212.10477v1 [cs.LG])
    We present in this paper a family of generalized simultaneous perturbation stochastic approximation (G-SPSA) estimators that estimate the gradient of the objective using noisy function measurements, but where the number of function measurements and the form of the gradient estimator is guided by the desired estimator bias. In particular, estimators with more function measurements are seen to result in lower bias. We provide an analysis of convergence of the generalized SPSA algorithm, and point to possible future directions.  ( 2 min )
    Policy learning "without'' overlap: Pessimism and generalized empirical Bernstein's inequality. (arXiv:2212.09900v1 [cs.LG])
    This paper studies offline policy learning, which aims at utilizing observations collected a priori (from either fixed or adaptively evolving behavior policies) to learn an optimal individualized decision rule that achieves the best overall outcomes for a given population. Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics are lower bounded in the offline dataset; put differently, the performance of the existing methods depends on the worst-case propensity in the offline dataset. As one has no control over the data collection process, this assumption can be unrealistic in many situations, especially when the behavior policies are allowed to evolve over time with diminishing propensities for certain actions. In this paper, we propose a new algorithm that optimizes lower confidence bounds (LCBs) -- instead of point estimates -- of the policy values. The LCBs are constructed using knowledge of the behavior policies for collecting the offline data. Without assuming any uniform overlap condition, we establish a data-dependent upper bound for the suboptimality of our algorithm, which only depends on (i) the overlap for the optimal policy, and (ii) the complexity of the policy class we optimize over. As an implication, for adaptively collected data, we ensure efficient policy learning as long as the propensities for optimal actions are lower bounded over time, while those for suboptimal ones are allowed to diminish arbitrarily fast. In our theoretical analysis, we develop a new self-normalized type concentration inequality for inverse-propensity-weighting estimators, generalizing the well-known empirical Bernstein's inequality to unbounded and non-i.i.d. data.  ( 2 min )
    Beyond Surrogate Modeling: Learning the Local Volatility Via Shape Constraints. (arXiv:2212.09957v1 [q-fin.MF])
    We explore the abilities of two machine learning approaches for no-arbitrage interpolation of European vanilla option prices, which jointly yield the corresponding local volatility surface: a finite dimensional Gaussian process (GP) regression approach under no-arbitrage constraints based on prices, and a neural net (NN) approach with penalization of arbitrages based on implied volatilities. We demonstrate the performance of these approaches relative to the SSVI industry standard. The GP approach is proven arbitrage-free, whereas arbitrages are only penalized under the SSVI and NN approaches. The GP approach obtains the best out-of-sample calibration error and provides uncertainty quantification.The NN approach yields a smoother local volatility and a better backtesting performance, as its training criterion incorporates a local volatility regularization term.  ( 2 min )
    Cell-Free Data Power Control Via Scalable Multi-Objective Bayesian Optimisation. (arXiv:2212.10299v1 [eess.SY])
    Cell-free multi-user multiple input multiple output networks are a promising alternative to classical cellular architectures, since they have the potential to provide uniform service quality and high resource utilisation over the entire coverage area of the network. To realise this potential, previous works have developed radio resource management mechanisms using various optimisation engines. In this work, we consider the problem of overall ergodic spectral efficiency maximisation in the context of uplink-downlink data power control in cell-free networks. To solve this problem in large networks, and to address convergence-time limitations, we apply scalable multi-objective Bayesian optimisation. Furthermore, we discuss how an intersection of multi-fidelity emulation and Bayesian optimisation can improve radio resource management in cell-free networks.  ( 2 min )
    Distributional Robustness Bounds Generalization Errors. (arXiv:2212.09962v1 [cs.LG])
    Bayesian methods, distributionally robust optimization methods, and regularization methods are three pillars of trustworthy machine learning hedging against distributional uncertainty, e.g., the uncertainty of an empirical distribution compared to the true underlying distribution. This paper investigates the connections among the three frameworks and, in particular, explores why these frameworks tend to have smaller generalization errors. Specifically, first, we suggest a quantitative definition for "distributional robustness", propose the concept of "robustness measure", and formalize several philosophical concepts in distributionally robust optimization. Second, we show that Bayesian methods are distributionally robust in the probably approximately correct (PAC) sense; In addition, by constructing a Dirichlet-process-like prior in Bayesian nonparametrics, it can be proven that any regularized empirical risk minimization method is equivalent to a Bayesian method. Third, we show that generalization errors of machine learning models can be characterized using the distributional uncertainty of the nominal distribution and the robustness measures of these machine learning models, which is a new perspective to bound generalization errors, and therefore, explain the reason why distributionally robust machine learning models, Bayesian models, and regularization models tend to have smaller generalization errors.  ( 2 min )
    KINet: Keypoint Interaction Networks for Unsupervised Forward Modeling. (arXiv:2202.09006v2 [cs.CV] UPDATED)
    Object-centric representation is an essential abstraction for forward prediction. Most existing forward models learn this representation through extensive supervision (e.g., object class and bounding box) although such ground-truth information is not readily accessible in reality. To address this, we introduce KINet (Keypoint Interaction Network) -- an end-to-end unsupervised framework to reason about object interactions based on a keypoint representation. Using visual observations, our model learns to associate objects with keypoint coordinates and discovers a graph representation of the system as a set of keypoint embeddings and their relations. It then learns an action-conditioned forward model using contrastive estimation to predict future keypoint states. By learning to perform physical reasoning in the keypoint space, our model automatically generalizes to scenarios with a different number of objects, novel backgrounds, and unseen object geometries. Experiments demonstrate the effectiveness of our model in accurately performing forward prediction and learning plannable object-centric representations which can also be used in downstream robotic manipulation tasks.  ( 2 min )
    Uncertainty Quantification of MLE for Entity Ranking with Covariates. (arXiv:2212.09961v1 [stat.ME])
    This paper concerns with statistical estimation and inference for the ranking problems based on pairwise comparisons with additional covariate information such as the attributes of the compared items. Despite extensive studies, few prior literatures investigate this problem under the more realistic setting where covariate information exists. To tackle this issue, we propose a novel model, Covariate-Assisted Ranking Estimation (CARE) model, that extends the well-known Bradley-Terry-Luce (BTL) model, by incorporating the covariate information. Specifically, instead of assuming every compared item has a fixed latent score $\{\theta_i^*\}_{i=1}^n$, we assume the underlying scores are given by $\{\alpha_i^*+{x}_i^\top\beta^*\}_{i=1}^n$, where $\alpha_i^*$ and ${x}_i^\top\beta^*$ represent latent baseline and covariate score of the $i$-th item, respectively. We impose natural identifiability conditions and derive the $\ell_{\infty}$- and $\ell_2$-optimal rates for the maximum likelihood estimator of $\{\alpha_i^*\}_{i=1}^{n}$ and $\beta^*$ under a sparse comparison graph, using a novel `leave-one-out' technique (Chen et al., 2019) . To conduct statistical inferences, we further derive asymptotic distributions for the MLE of $\{\alpha_i^*\}_{i=1}^n$ and $\beta^*$ with minimal sample complexity. This allows us to answer the question whether some covariates have any explanation power for latent scores and to threshold some sparse parameters to improve the ranking performance. We improve the approximation method used in (Gao et al., 2021) for the BLT model and generalize it to the CARE model. Moreover, we validate our theoretical results through large-scale numerical studies and an application to the mutual fund stock holding dataset.  ( 2 min )

  • Open

    Building a list of advanced, GPT-3 tier chatbots
    submitted by /u/gakowalski [link] [comments]  ( 48 min )
    AI Dream 73 - BEST AI ANIMATION 2022 - MASTERPIECE
    submitted by /u/LordPewPew777 [link] [comments]  ( 48 min )
    is there an AI bot for Chat that I can self-host?
    hay, I've been playing with stable definition for a while and I was wondering if there was an AI chat bot I could self-host, if so what would the hardware requirements be for that AI bot? submitted by /u/Gnar8520 [link] [comments]  ( 49 min )
    AI assistants help developers produce buggy code , study shows
    well finally I have something in common with AI, I produce buggy code, it does too!! We are like brothers now Computer scientists from Stanford University have found that programmers who accept help from AI tools like GitHub Copilot produce less secure code than those who fly solo. In a paper titled, "Do Users Write More Insecure Code with AI Assistants?", Stanford boffins Neil Perry, Megha Srivastava, Deepak Kumar, and Dan Boneh answer this question. Worse still, they found that AI help tends to delude developers about the quality of their output. "We found that participants with access to an AI assistant often produced more security vulnerabilities than those without access, with particularly significant results for string encryption and SQL injection," the authors state in their paper. "Surprisingly, we also found that participants provided access to an AI assistant were more likely to believe that they wrote secure code than those without access to the AI assistant.” The Stanford user study involved 47 people with varying levels of experience, including undergraduate students, graduate students, and industry professionals. Participants were asked to write code in response to five prompts while being monitored. By the end of the study, the authors concluded that AI assistants should be viewed with caution because they can mislead inexperienced developers and create security vulnerabilities. But they also hoped this study will force the developers of these AIs to work on the coding abilities of their systems before deploying. As one study participant is said to have remarked about AI assistance, "I hope this gets deployed. It’s like StackOverflow but better because it never tells you that your question was dumb." Comedy gold right there! ​ This is from the AI With Vibes Newsletter, read the full issue here: https://aiwithvibes.beehiiv.com/p/openai-releases-new-3d-generator-ai-study-shows-ai-helps-devs-produce-buggy-code submitted by /u/Mk_Makanaki [link] [comments]  ( 54 min )
    Photos Used to Generate AI Images for Client So Photographer Never Shoots Again
    submitted by /u/magenta_placenta [link] [comments]  ( 55 min )
    I created a complete (audio) book in 10+ languages in a few days using generative AI: Here is what I learned
    submitted by /u/fillsoko [link] [comments]  ( 49 min )
    ChatGPT will Change How We Live, Work & Create Forever
    submitted by /u/deen1802 [link] [comments]  ( 49 min )
    Build this idea: An Intercom like assistant with hundreds of AI + non-AI apps
    The purpose of this post is to encourage people in AI to consider an idea that I unfortunately am not in a position to execute. Idea: An intercom-like assistant with hundreds of AI and non-AI apps like image generator, writing generator, avatar generation, background verification etc. Today there's dozens of AI app platforms like Jasper AI, RunwayML etc. The challenge these guys face is companies like Notion, Canva building AI directly into their platform. Canva built an image generation app and Notion built an embedded writing AI. So how do the new AI companies compete? By exploiting the distribution of incumbents. Since every website wants to have these AI apps why not give them a little Javascript embed that gives them all these AI apps directly on their site. Imagine a small inter…  ( 64 min )
    Is there an AI I can use to fill in a photo that has been cut off?
    I had a beautiful image of me generated by Lensa AI but sadly the very top of my hair was cut off and so it’s not usable as a profile picture or anything really. Shame because it’s a stunning image. Is there an AI I can use to autocomplete or extend an image to fill out more space? Like add the top of my hair and extend the background out a bit past that? submitted by /u/Swftness503 [link] [comments]  ( 50 min )
    OpenAI's Point-E can generate 3D models based on input text in a few minutes
    submitted by /u/qptbook [link] [comments]  ( 53 min )
    Debunking the great AI lie | Noam Chomsky, Gary Marcus, Jeremy Kahn
    submitted by /u/budgie [link] [comments]  ( 85 min )
    A.I.’s impact on yesterday’s creative business models
    submitted by /u/unalivehouseplant [link] [comments]  ( 48 min )
    How I generated a full song with AI in less than 10 minutes
    How I generated a full song with AI in less than 10 minutes: Ask ChatGPT for the lyrics. Turn the text into audio with Uberduck. Find a free for profit beat in YouTube. Mix both audios until it sounds good. It's ridiculously easy and entertaining! submitted by /u/TheVellerShow [link] [comments]  ( 69 min )
    The worst technology of 2022
    submitted by /u/UpvoteBeast [link] [comments]  ( 46 min )
    New hardware offers faster computation for artificial intelligence, with much less energy
    submitted by /u/Chipdoc [link] [comments]  ( 49 min )
    The worst technology of 2022
    submitted by /u/atomlib_com [link] [comments]  ( 48 min )
    Best Tasks for ParlAI?
    I feel like ParlAI doesn't get a lot of attention in the machine learning community. Its no r/ChatGPT but it has a good framework to work with. I've been tinkering with it this week by setting up a Linux Guest System on my virtualbox and with the guidance of r/ChatGPT I was able to get a working chatbot. Its not perfect but at least its a good toy model that works. I supplied my own dataset by generating synthetic data from an endless conversation between EmersonAI and Blenderbot 3 via Selenium and I collected the text data and am currently expanding it as we speak but I decided to train it further, allocating more memory, increasing the max training time from 10 minutes to 30 minutes and increasing the batch size to 32. So what I want to know is has anyone else worked on ParlAI before? If so, what is the best task you've come across? ParlAI has a lot of tasks available, such as empathetic dialogues, convai2 and personachat. Which one would you prefer? submitted by /u/swagonflyyyy [link] [comments]  ( 51 min )
    Kubernetes for Data Science practice (MLOps workflow)
    submitted by /u/skj8 [link] [comments]  ( 50 min )
    [R][N] What is non-myopia in ML-EDM ?
    submitted by /u/ML-EDM [link] [comments]  ( 51 min )
    [R][N] The first 2 introductory videos to ML-EDM :-)
    submitted by /u/ML-EDM [link] [comments]  ( 50 min )
    Open challenges for Machine Learning based Early Decision-Making research
    submitted by /u/ML-EDM [link] [comments]  ( 51 min )
    Vazy, an ultra-intuitive chat to access all your data - Join the waitlist !
    submitted by /u/Miserness [link] [comments]  ( 48 min )
    Yay. Chad history have arrived. Including your old history. Been waiting for this
    submitted by /u/sEi_ [link] [comments]  ( 49 min )
    ChatGPT Just Got An Upgrade
    submitted by /u/arnolds112 [link] [comments]  ( 48 min )
  • Open

    [P] Extracting and Structuring Recipes Using GPT3
    I've been experimenting with GPT3 for different use cases over the past few weeks, the latest one was seeing how well it could parse out structured data from recipe free text, and how well it could further enrich this data. The general idea was to have a few different prompts to the model, with output from one prompt inputting into the next prompt: Extract ingredients and instructions from the recipe Given the ingredients, group them together into categories Given the full structured recipe generated above, enrich it further with additional metadata (time to cook, healthiness, etc) This worked out better than I expected - given an input recipe I'm able to consistently (and accurately) extract the constituent parts and group the ingredients together logically (like grains, dairy, etc). I wrote about it here: https://binal.pub/2022/12/extracting-and-structuring-recipes-using-gpt3/ One thing that I was surprised by as well was this turned out to be a decent recipe generator. So instead of using a full recipe I could input "Pumpkin Pie" and the structured response at the end would be the ingredients and instructions to bake a pumpkin pie with quantities/timings that seemed to be about what you'd expect. submitted by /u/caesarten [link] [comments]  ( 67 min )
    [D] Build a home PC to Run Large GPT Models or use AWS
    Hi all, I need to train or fine-tune GPT on my customs data. So, I am looking if it is better to build a customs pc with gpu or use aws. I am will run a lot of experiments (new tech startup), so looking for cheap option. If a home pc is the best option, which gpu should I buy? thanks for your advice. submitted by /u/No-Trifle2470 [link] [comments]  ( 69 min )
    [D] Opera Blobs
    I apologise if this has already been posted, but I came across this and thought you guys would be interested. You can make blobs sing in Opera style singing with the help of machine learning and you can record and share your singing blobs! https://artsandculture.google.com/experiment/blob-opera/AAHWrq360NcGbw?cp=eyJyIjoieUtyM0I3N1hPZ0lOIn0.&hl=en ​ https://preview.redd.it/skyw5to9ea7a1.png?width=1874&format=png&auto=webp&s=ef4a6024cecbd2657c1c1d4faaf261b727dfd434 submitted by /u/Kortax [link] [comments]  ( 66 min )
    Reduce paramter count in an NN without sacrificing performance [P]
    Hi, For my final year project for my BSc CompSci and AI course I’m implementing the world models paper to play games. Essentially a variational autoencoder and another network to predict future latent states of the game environment. The emphasis of my project is to reduce the number of parameters, and consequently the training time (making a case for reducing the energy consumption). I’ll use existing models and their size alongside game performance to compare with my own. I’ve had trouble finding existing literature as to how this can be done. Obviously there isn’t a way to find an ‘optimal’ number required to solve a task, but wanted to find techniques to reduce excess bulk in a NN without sacrificing performance. Does anyone have any ideas or know of any resources? TIA submitted by /u/ackbladder_ [link] [comments]  ( 67 min )
    [D] Historical Stock Price datasets
    Is anyone aware of any huge datasets I can use? I've done some Googling and I found a few of them but they all seem to be for 2017 and earlier. submitted by /u/Careful-Temporary388 [link] [comments]  ( 64 min )
    [P] Whisperer: Diarize and Transcribe audios easily for audio-dataset creation
    Whisperer A tool to make audio-text datasets automatically for your ML Projects. Two weeks ago, I shared an early draft of a project based on the newly released OpenAI's Whisper. Today I'm sharing the finished version of Whisperer, which adds diarization, with same-speaker detection across multiple audio files. Key Features: Automatic Speaker Diarization Automatic Speaker Identification e.g: same speakers across audio files Automatic Transcription Forces Gaussian Distributions of the dataset see notebook Modular and Configurable EDIT: Live on twitch if anyone has any questions. submitted by /u/pigmentedink [link] [comments]  ( 68 min )
    [D] Cheat sheet for pandas techniques for cleaning/prepping data
    There's a lot of small things like filling NaNs, altering columns, normalizing and splitting data, etc. Does anyone have a small code sheet cheat to refer to when prepping any new data? submitted by /u/_xxx420xblazexitx___ [link] [comments]  ( 72 min )
    [D] Different types of pooling in Neural Nets
    There is a whole more to pooling than Max and average pooling. A few other Pooling methods are: Mixed Pooling L_p Pooling Stochastic Pooling Spatial pyramid Pooling Multi-scale order less Pooling Super Pixel Pooling Compact bilinear Pooling Edge-aware Pyramid Pooling Spectral Pooling Per-Pixel Pyramid Pooling Rank Based average Pooling Weighted Pooling Genetic-based Pooling Read the full article here: https://medium.com/aiguys/pooling-layers-in-neural-nets-and-their-variants-f6129fc4628b ​ https://preview.redd.it/6plxeqvts87a1.png?width=517&format=png&auto=webp&s=cb6b9da6c18f0c571cfbe8b0844e0443b311101e submitted by /u/Difficult-Race-1188 [link] [comments]  ( 66 min )
    [N] Point-E: a new Dalle-like model that generates 3D Point Clouds from Prompts
    It's only been a month since OpenAI released ChatGPT, and yesterday they launched Point-E, a new Dalle-like model that generates 3D Point Clouds from Complex Prompts. As someone who is always interested in the latest advancements in machine learning, I was really excited to dig into this paper and see what it had to offer. One of the key features of Point-E is its use of diffusion models to generate synthetic views and 3D point clouds. These models use text input to generate an image, which is then used as a reference for generating the 3D point cloud. This process takes only 1-2 minutes on a single GPU, making it much faster than previous state-of-the-art methods. While the quality of the samples produced by Point-E may be lower than those produced by other methods, the speed of generation makes it a practical option for certain use cases. If you're interested in learning more about this new model and how it was developed, I highly recommend giving the full paper a read. But if you're more into reading the gist of it, I added a link to an overview blog I published about. The blog: https://dagshub.com/blog/overview-of-point-e/ The paper: https://arxiv.org/abs/2212.08751 I'm sure I have yet to reach all the insights while writing the blog, and I'd love to get your thoughts about the model and how OpenAI developed it. submitted by /u/RepresentativeCod613 [link] [comments]  ( 72 min )
    [D] How to find conference citation from arxiv
    I believe it is always better to cite the version of a paper that has been published in a conference or journal, if such a version exists. ​ However, I often find myself adding the arxiv version to my bibliography manager when I quickly want to save a paper. I presume many of you do the same. ​ Before submitting my paper, I can manually search for every reference on google or semanticscholar (google scholar does not always find the conference version in my experience) and replace the arxiv reference with the conference reference. Sounds like that could be automated, right? ​ Whats your workflow in that regard? Are you aware of a tool to automate the process? Or do you even care which version you cite? submitted by /u/_Arsenie_Boca_ [link] [comments]  ( 68 min )
    [D] BLIP is now available on transformers, what are the cool apps you can build on top of it?
    BLIP from Salesforce is now available on Hugging Face transformers! Here is a list of cool applications you can build on top of it: https://twitter.com/younesbelkada/status/1605489647395540992 With (I think) most interesting application being building image-captioning APIs and Stable Diffusion-related applications (generate image-text datasets to fine-tune Stable Diffusion on it & image to music app) Any other thing you have in mind that can be built using BLIP? submitted by /u/younesbelkada [link] [comments]  ( 69 min )
    [D] Comparisons of finetuning efficiency/effectiveness on recent llms architectures?
    Many recently published llms (gpt, bloom, palm, etc) evaluate on a plethora of tasks in a k-shot setting, and offer evaluations on different model sizes with the same architecture. However, I haven't seen much literature evaluating these models' capabilities for finetuning to individual tasks. Are there resources out there that do this comparison? It would help for deciding what pretrained backbone to use for finetuning in production usecases. submitted by /u/idioticfuse [link] [comments]  ( 69 min )
    [R] Search for an article title
    Hi Guys, I'm looking for a paper, which was published in NIPS around 2016, in which the authors used a neural network, to predict the weights of another neural network, I don't have the context in which this was applied :(. Does this sound familiar? submitted by /u/schlodinger [link] [comments]  ( 67 min )
    [D] Running large language models on a home PC?
    I'm trying to figure out how to go about running something like GPT-J, FLAN-T5, etc, on my PC, without using cloud compute services (because privacy and other reasons). However, GPT-J-6B needs either ~14 GB of VRAM or 4x as much plain RAM. Upgrading my PC for 48 GB of RAM is possible, and 16, 24 GB graphics cards are available for general public (though they cost as much as a car), but anything beyond that is in the realm of HPC, datacenter hardware and "GPU accelerators"... I.e. 128 GB GPUs exist out there somewhere, but the distributors don't even list a price, it's just "get a quote" and "contact us"... meaning it's super expensive and you need to be a CEO of medium-sized company for them to even talk to you? I'm trying to figure out if it's possible to run the larger models (e.g. 175B GPT-3 equivalents) on consumer hardware, perhaps by doing a very slow emulation using one or several PCs such that their collective RAM (or swap SDD space) matches the VRAM needed for those beasts. So the question is "will it run super slowly" or "will it fail immediately due to completely incompatible software / being impossible to configure for anything other than real datacenter hardware"? submitted by /u/Zondartul [link] [comments]  ( 70 min )
    [D] What GPT-esque model/platform returns peer-reviewed sources with outputs?
    The main issue I have with GPT-3 is that the output can be compelling, yet factually incorrect. I remember discovering a platform that generates answers alongside sources, but I can't recall the name. submitted by /u/EntireInflation8663 [link] [comments]  ( 77 min )
  • Open

    EHR-Safe: Generating High-Fidelity and Privacy-Preserving Synthetic Electronic Health Records
    Posted by Jinsung Yoon and Sercan O. Arik, Research Scientists, Google Research, Cloud AI Team Analysis of Electronic Health Records (EHR) has a tremendous potential for enhancing patient care, quantitatively measuring performance of clinical practices, and facilitating clinical research. Statistical estimation and machine learning (ML) models trained on EHR data can be used to predict the probability of various diseases (such as diabetes), track patient wellness, and predict how patients respond to specific drugs. For such models, researchers and practitioners need access to EHR data. However, it can be challenging to leverage EHR data while ensuring data privacy and conforming to patient confidentiality regulations (such as HIPAA). Conventional methods to anonymize data (e.g., de-id…  ( 93 min )
  • Open

    I created a complete (audio) book in 10+ languages in a few days using generative AI: Here is what I learned
    submitted by /u/fillsoko [link] [comments]  ( 46 min )
  • Open

    Predictable Noise : A Flaw in Machine Judgement
    Is machine judgement truly noise-free?  ( 14 min )
    Day 3 : Advance SQL For Data Science
    No content preview
    Using Auto-encoder for Fraud detection implemented in Knime
    Auto-encoders are an unsupervised learning technique using neural networks to learn representations.  ( 11 min )
    Implementing Neural Networks in Knime Workflows
    On the well-known iris dataset, we will perform the neural network operation here without writing a single line of Python code. Sounds…  ( 11 min )
    Is Math Really Required for Machine Learning, or Is It Just an Drama?
    Introduction:  ( 6 min )
    Why Will an AI tool like ChatGPT be Trending in 2023?
    Artificial intelligence (AI) has been a hot topic in the tech industry for years, and it’s no surprise that AI tools like ChatGPT are…  ( 8 min )
    Introduction to statistical models: Linear Regression
    Linear regression is one of the main algorithms that you must master as a data scientist, you will learn how to build your first model…  ( 11 min )
  • Open

    Announcing the updated Salesforce connector (V2) for Amazon Kendra
    Amazon Kendra is a highly accurate and simple-to-use intelligent search service powered by machine learning (ML). Amazon Kendra offers a suite of data source connectors to simplify the process of ingesting and indexing your content, wherever it resides. Valuable data in organizations is stored in both structured and unstructured repositories. An enterprise search solution should […]  ( 7 min )
    ­­Speed ML development using SageMaker Feature Store and Apache Iceberg offline store compaction
    Today, companies are establishing feature stores to provide a central repository to scale ML development across business units and data science teams. As feature data grows in size and complexity, data scientists need to be able to efficiently query these feature stores to extract datasets for experimentation, model training, and batch scoring. Amazon SageMaker Feature […]  ( 10 min )
    Announcing the updated ServiceNow connector (V2) for Amazon Kendra
    Amazon Kendra is a highly accurate and simple-to-use intelligent search service powered by machine learning (ML). Amazon Kendra offers a suite of data source connectors to simplify the process of ingesting and indexing your content, wherever it resides. Valuable data in organizations is stored in both structured and unstructured repositories. An enterprise search solution should […]  ( 7 min )
  • Open

    Top 5 Robots of 2022: Watch Them Change the World
    Robots have rolled into action for sustainability in farms, lower energy in food delivery, efficiency in retail inventory, improved throughput in warehouses and just about everything in between — what’s not to love? In addition to reshaping industries and helping end users, robots play a vital role in the transition away from fossil fuels. The Read article > The post Top 5 Robots of 2022: Watch Them Change the World appeared first on NVIDIA Blog.  ( 6 min )
    Doing the Best They Can: EverestLabs Ensures Fewer Recyclables Go to Landfills
    All of us recycle. Or, at least, all of us should. Now, AI is joining the effort. On the latest episode of the NVIDIA AI Podcast, host Noah Kravitz spoke with JD Ambadti, founder and CEO of EverestLabs, developer of RecycleOS, the first AI-enabled operating system for recycling. The company reports that an average of Read article > The post Doing the Best They Can: EverestLabs Ensures Fewer Recyclables Go to Landfills appeared first on NVIDIA Blog.  ( 4 min )
  • Open

    Simulated Contextual Bandits for Personalization Tasks from Recommendation Datasets. (arXiv:2210.10631v2 [cs.IR] UPDATED)
    We propose a method for generating simulated contextual bandit environments for personalization tasks from recommendation datasets like MovieLens, Netflix, Last.fm, Million Song, etc. This allows for personalization environments to be developed based on real-life data to reflect the nuanced nature of real-world user interactions. The obtained environments can be used to develop methods for solving personalization tasks, algorithm benchmarking, model simulation, and more. We demonstrate our approach with numerical examples on MovieLens and IMDb datasets.
    NASA: Neural Architecture Search and Acceleration for Hardware Inspired Hybrid Networks. (arXiv:2210.13361v2 [cs.AR] UPDATED)
    Multiplication is arguably the most cost-dominant operation in modern deep neural networks (DNNs), limiting their achievable efficiency and thus more extensive deployment in resource-constrained applications. To tackle this limitation, pioneering works have developed handcrafted multiplication-free DNNs, which require expert knowledge and time-consuming manual iteration, calling for fast development tools. To this end, we propose a Neural Architecture Search and Acceleration framework dubbed NASA, which enables automated multiplication-reduced DNN development and integrates a dedicated multiplication-reduced accelerator for boosting DNNs' achievable efficiency. Specifically, NASA adopts neural architecture search (NAS) spaces that augment the state-of-the-art one with hardware-inspired multiplication-free operators, such as shift and adder, armed with a novel progressive pretrain strategy (PGP) together with customized training recipes to automatically search for optimal multiplication-reduced DNNs; On top of that, NASA further develops a dedicated accelerator, which advocates a chunk-based template and auto-mapper dedicated for NASA-NAS resulting DNNs to better leverage their algorithmic properties for boosting hardware efficiency. Experimental results and ablation studies consistently validate the advantages of NASA's algorithm-hardware co-design framework in terms of achievable accuracy and efficiency tradeoffs. Codes are available at https://github.com/GATECH-EIC/NASA.
    A Detailed Study of Interpretability of Deep Neural Network based Top Taggers. (arXiv:2210.04371v2 [hep-ex] UPDATED)
    Recent developments in the methods of explainable AI (XAI) methods allow researchers to explore the inner workings of deep neural networks (DNNs), revealing crucial information about input-output relationships and realizing how data connects with machine learning models. In this paper we explore interpretability of DNN models designed to identify jets coming from top quark decay in high energy proton-proton collisions at the Large Hadron Collider (LHC). We review a subset of existing top tagger models and explore different quantitative methods to identify which features play the most important roles in identifying the top jets. We also investigate how and why feature importance varies across different XAI metrics, how feature correlations impact their explainability, and how latent space representations encode information as well as correlate with physically meaningful quantities. Our studies uncover some major pitfalls of existing XAI methods and illustrate how they can be overcome to obtain consistent and meaningful interpretation of these models. We additionally illustrate the activity of hidden layers as Neural Activation Pattern (NAP) diagrams and demonstrate how they can be used to understand how DNNs relay information across the layers and how this understanding can help to make such models significantly simpler by allowing effective model reoptimization and hyperparameter tuning. By incorporating observations from the interpretability studies, we obtain state-of-the-art top tagging performance from augmented implementation of existing network
    SalKG: Learning From Knowledge Graph Explanations for Commonsense Reasoning. (arXiv:2104.08793v5 [cs.CL] CROSS LISTED)
    Augmenting pre-trained language models with knowledge graphs (KGs) has achieved success on various commonsense reasoning tasks. However, for a given task instance, the KG, or certain parts of the KG, may not be useful. Although KG-augmented models often use attention to focus on specific KG components, the KG is still always used, and the attention mechanism is never explicitly taught which KG components should be used. Meanwhile, saliency methods can measure how much a KG feature (e.g., graph, node, path) influences the model to make the correct prediction, thus explaining which KG features are useful. This paper explores how saliency explanations can be used to improve KG-augmented models' performance. First, we propose to create coarse (Is the KG useful?) and fine (Which nodes/paths in the KG are useful?) saliency explanations. Second, to motivate saliency-based supervision, we analyze oracle KG-augmented models which directly use saliency explanations as extra inputs for guiding their attention. Third, we propose SalKG, a framework for KG-augmented models to learn from coarse and/or fine saliency explanations. Given saliency explanations created from a task's training set, SalKG jointly trains the model to predict the explanations, then solve the task by attending to KG features highlighted by the predicted explanations. On three commonsense QA benchmarks (CSQA, OBQA, CODAH) and a range of KG-augmented models, we show that SalKG can yield considerable performance gains -- up to 2.76% absolute improvement on CSQA.
    Towards Quantum Advantage on Noisy Quantum Computers. (arXiv:2209.09371v3 [quant-ph] UPDATED)
    Quantum computers offer the potential of achieving significant speedup for certain computational problems. Yet, many existing quantum algorithms with notable asymptotic speedups require a degree of fault tolerance that is currently unavailable. The quantum algorithm for topological data analysis (TDA) by Lloyd et al. is believed to be one such algorithm. TDA is a powerful technique for extracting complex and valuable shape-related summaries of high-dimensional data. However, the computational demands of classical TDA algorithms are exorbitant, and become impractical for high-order characteristics. In this paper, we present NISQ-TDA, the first fully implemented end-to-end quantum machine learning algorithm needing only a short circuit-depth, that is applicable to non-handcrafted high-dimensional classical data, and with provable asymptotic speedup for certain classes of problems. The algorithm neither suffers from the data-loading problem nor does it need to store the input data on the quantum computer explicitly. Our approach includes three key innovations: an efficient realization of the full boundary operator; a quantum rejection sampling and projection approach to restrict a quantum state to the simplices of the desired order in the given complex; and a stochastic rank estimation method to estimate the topological features in the form of approximate Betti numbers. We present theoretical results that establish additive error guarantees, along with computational cost and circuit-depth complexities for normalized output estimates, up to the error tolerance. The algorithm was successfully executed on quantum computing devices, as well as on noisy quantum simulators, applied to small datasets. Preliminary empirical results suggest that the algorithm is robust to noise. Finally, we provide target depths and noise level estimates to realize near-term, non-fault-tolerant quantum advantage.
    Learn to explain yourself, when you can: Equipping Concept Bottleneck Models with the ability to abstain on their concept predictions. (arXiv:2211.11690v2 [cs.LG] UPDATED)
    The Concept Bottleneck Models (CBMs) of Koh et al. [2020] provide a means to ensure that a neural network based classifier bases its predictions solely on human understandable concepts. The concept labels, or rationales as we refer to them, are learned by the concept labeling component of the CBM. Another component learns to predict the target classification label from these predicted concept labels. Unfortunately, these models are heavily reliant on human provided concept labels for each datapoint. To enable CBMs to behave robustly when these labels are not readily available, we show how to equip them with the ability to abstain from predicting concepts when the concept labeling component is uncertain. In other words, our model learns to provide rationales for its predictions, but only whenever it is sure the rationale is correct.
    ColoristaNet for Photorealistic Video Style Transfer. (arXiv:2212.09247v1 [cs.CV])
    Photorealistic style transfer aims to transfer the artistic style of an image onto an input image or video while keeping photorealism. In this paper, we think it's the summary statistics matching scheme in existing algorithms that leads to unrealistic stylization. To avoid employing the popular Gram loss, we propose a self-supervised style transfer framework, which contains a style removal part and a style restoration part. The style removal network removes the original image styles, and the style restoration network recovers image styles in a supervised manner. Meanwhile, to address the problems in current feature transformation methods, we propose decoupled instance normalization to decompose feature transformation into style whitening and restylization. It works quite well in ColoristaNet and can transfer image styles efficiently while keeping photorealism. To ensure temporal coherency, we also incorporate optical flow methods and ConvLSTM to embed contextual information. Experiments demonstrates that ColoristaNet can achieve better stylization effects when compared with state-of-the-art algorithms.
    LMentry: A Language Model Benchmark of Elementary Language Tasks. (arXiv:2211.02069v2 [cs.CL] UPDATED)
    As the performance of large language models rapidly improves, benchmarks are getting larger and more complex as well. We present LMentry, a benchmark that avoids this "arms race" by focusing on a compact set of tasks that are trivial to humans, e.g. writing a sentence containing a specific word, identifying which words in a list belong to a specific category, or choosing which of two words is longer. LMentry is specifically designed to provide quick and interpretable insights into the capabilities and robustness of large language models. Our experiments reveal a wide variety of failure cases that, while immediately obvious to humans, pose a considerable challenge for large language models, including OpenAI's latest 175B-parameter instruction-tuned model, TextDavinci002. LMentry complements contemporary evaluation approaches of large language models, providing a quick, automatic, and easy-to-run "unit test", without resorting to large benchmark suites of complex tasks.
    PU GNN: Chargeback Fraud Detection in P2E MMORPGs via Graph Attention Networks with Imbalanced PU Labels. (arXiv:2211.08604v2 [cs.LG] UPDATED)
    The recent advent of play-to-earn (P2E) systems in massively multiplayer online role-playing games (MMORPGs) has made in-game goods interchangeable with real-world values more than ever before. The goods in the P2E MMORPGs can be directly exchanged with cryptocurrencies such as Bitcoin, Ethereum, or Klaytn via blockchain networks. Unlike traditional in-game goods, once they had been written to the blockchains, P2E goods cannot be restored by the game operation teams even with chargeback fraud such as payment fraud, cancellation, or refund. To tackle the problem, we propose a novel chargeback fraud prediction method, PU GNN, which leverages graph attention networks with PU loss to capture both the players' in-game behavior with P2E token transaction patterns. With the adoption of modified GraphSMOTE, the proposed model handles the imbalanced distribution of labels in chargeback fraud datasets. The conducted experiments on two real-world P2E MMORPG datasets demonstrate that PU GNN achieves superior performances over previously suggested methods.
    'Rarely' a problem? Language models exhibit inverse scaling in their predictions following 'few'-type quantifiers. (arXiv:2212.08700v1 [cs.CL])
    Language Models appear to perform poorly on quantification. We ask how badly. 'Few'-type quantifiers, as in 'few children like vegetables' might pose a particular challenge for Language Models, since the sentence components without the quantifier are likely to co-occur, and because 'few'-type quantifiers are rare. We present 960 sentences stimuli from two human neurolinguistic experiments to 22 autoregressive transformer models of differing sizes. Not only do the models perform poorly on 'few'-type quantifiers, but overall the larger the model, the worse its performance. We interpret this inverse scaling as suggesting that larger models increasingly reflect online rather than offline human processing, and argue that decreasing performance of larger models may challenge uses of Language Models as the basis for Natural Language Systems.
    Variational Inference for Model-Free and Model-Based Reinforcement Learning. (arXiv:2209.01693v2 [cs.LG] UPDATED)
    Variational inference (VI) is a specific type of approximate Bayesian inference that approximates an intractable posterior distribution with a tractable one. VI casts the inference problem as an optimization problem, more specifically, the goal is to maximize a lower bound of the logarithm of the marginal likelihood with respect to the parameters of the approximate posterior. Reinforcement learning (RL) on the other hand deals with autonomous agents and how to make them act optimally such as to maximize some notion of expected future cumulative reward. In the non-sequential setting where agents' actions do not have an impact on future states of the environment, RL is covered by contextual bandits and Bayesian optimization. In a proper sequential scenario, however, where agents' actions affect future states, instantaneous rewards need to be carefully traded off against potential long-term rewards. This manuscript shows how the apparently different subjects of VI and RL are linked in two fundamental ways. First, the optimization objective of RL to maximize future cumulative rewards can be recovered via a VI objective under a soft policy constraint in both the non-sequential and the sequential setting. This policy constraint is not just merely artificial but has proven as a useful regularizer in many RL tasks yielding significant improvements in agent performance. And second, in model-based RL where agents aim to learn about the environment they are operating in, the model-learning part can be naturally phrased as an inference problem over the process that governs environment dynamics. We are going to distinguish between two scenarios for the latter: VI when environment states are fully observable by the agent and VI when they are only partially observable through an observation distribution.
    Instruction-driven history-aware policies for robotic manipulations. (arXiv:2209.04899v3 [cs.RO] UPDATED)
    In human environments, robots are expected to accomplish a variety of manipulation tasks given simple natural language instructions. Yet, robotic manipulation is extremely challenging as it requires fine-grained motor control, long-term memory as well as generalization to previously unseen tasks and environments. To address these challenges, we propose a unified transformer-based approach that takes into account multiple inputs. In particular, our transformer architecture integrates (i) natural language instructions and (ii) multi-view scene observations while (iii) keeping track of the full history of observations and actions. Such an approach enables learning dependencies between history and instructions and improves manipulation precision using multiple views. We evaluate our method on the challenging RLBench benchmark and on a real-world robot. Notably, our approach scales to 74 diverse RLBench tasks and outperforms the state of the art. We also address instruction-conditioned tasks and demonstrate excellent generalization to previously unseen variations.
    Medical Diagnosis with Large Scale Multimodal Transformers -- Leveraging Diverse Data for More Accurate Diagnosis. (arXiv:2212.09162v1 [cs.LG])
    Multimodal deep learning has been used to predict clinical endpoints and diagnoses from clinical routine data. However, these models suffer from scaling issues: they have to learn pairwise interactions between each piece of information in each data type, thereby escalating model complexity beyond manageable scales. This has so far precluded a widespread use of multimodal deep learning. Here, we present a new technical approach of "learnable synergies", in which the model only selects relevant interactions between data modalities and keeps an "internal memory" of relevant data. Our approach is easily scalable and naturally adapts to multimodal data inputs from clinical routine. We demonstrate this approach on three large multimodal datasets from radiology and ophthalmology and show that it outperforms state-of-the-art models in clinically relevant diagnosis tasks. Our new approach is transferable and will allow the application of multimodal deep learning to a broad set of clinically relevant problems.
    Quantum policy gradient algorithms. (arXiv:2212.09328v1 [quant-ph])
    Understanding the power and limitations of quantum access to data in machine learning tasks is primordial to assess the potential of quantum computing in artificial intelligence. Previous works have already shown that speed-ups in learning are possible when given quantum access to reinforcement learning environments. Yet, the applicability of quantum algorithms in this setting remains very limited, notably in environments with large state and action spaces. In this work, we design quantum algorithms to train state-of-the-art reinforcement learning policies by exploiting quantum interactions with an environment. However, these algorithms only offer full quadratic speed-ups in sample complexity over their classical analogs when the trained policies satisfy some regularity conditions. Interestingly, we find that reinforcement learning policies derived from parametrized quantum circuits are well-behaved with respect to these conditions, which showcases the benefit of a fully-quantum reinforcement learning framework.
    Principal Trade-off Analysis. (arXiv:2206.07520v2 [cs.GT] UPDATED)
    This paper develops Principal Trade-off Analysis (PTA), a decomposition method, analogous to Principal Component Analysis (PCA), which permits the representation of any game as the weighted sum of disc games (continuous R-P-S games). Applying PTA to empirically generated tournament graphs produces a sequence of embeddings into orthogonal 2D feature planes representing independent strategic trade-offs. Each trade-off generates a mode of cyclic competition. Like PCA, PTA provides optimal low rank estimates of the tournament graphs that can be truncated for approximation. The complexity of cyclic competition can be quantified by computing the number of significant cyclic modes. We illustrate the PTA via application to a pair of games (Blotto, Pokemon). The resulting 2D disc game representations are shown to be well suited for visualization and are easily interpretable. In Blotto, PTA identifies game symmetries, and specifies strategic trade-offs associated with distinct win conditions. For Pokemon, PTA embeddings produce clusters in the embedding space that naturally correspond to Pokemon types, a design in the game that produces cyclic trade offs.
    Improving Generalization of Pre-trained Language Models via Stochastic Weight Averaging. (arXiv:2212.05956v2 [cs.CL] UPDATED)
    Knowledge Distillation (KD) is a commonly used technique for improving the generalization of compact Pre-trained Language Models (PLMs) on downstream tasks. However, such methods impose the additional burden of training a separate teacher model for every new dataset. Alternatively, one may directly work on the improvement of the optimization procedure of the compact model toward better generalization. Recent works observe that the flatness of the local minimum correlates well with better generalization. In this work, we adapt Stochastic Weight Averaging (SWA), a method encouraging convergence to a flatter minimum, to fine-tuning PLMs. We conduct extensive experiments on various NLP tasks (text classification, question answering, and generation) and different model architectures and demonstrate that our adaptation improves the generalization without extra computation cost. Moreover, we observe that this simple optimization technique is able to outperform the state-of-the-art KD methods for compact models.
    Mind the Knowledge Gap: A Survey of Knowledge-enhanced Dialogue Systems. (arXiv:2212.09252v1 [cs.CL])
    Many dialogue systems (DSs) lack characteristics humans have, such as emotion perception, factuality, and informativeness. Enhancing DSs with knowledge alleviates this problem, but, as many ways of doing so exist, keeping track of all proposed methods is difficult. Here, we present the first survey of knowledge-enhanced DSs. We define three categories of systems - internal, external, and hybrid - based on the knowledge they use. We survey the motivation for enhancing DSs with knowledge, used datasets, and methods for knowledge search, knowledge encoding, and knowledge incorporation. Finally, we propose how to improve existing systems based on theories from linguistics and cognitive science.
    A Fine-Grained Dataset and its Efficient Semantic Segmentation for Unstructured Driving Scenarios. (arXiv:2103.13109v1 [cs.CV] CROSS LISTED)
    Research in autonomous driving for unstructured environments suffers from a lack of semantically labeled datasets compared to its urban counterpart. Urban and unstructured outdoor environments are challenging due to the varying lighting and weather conditions during a day and across seasons. In this paper, we introduce TAS500, a novel semantic segmentation dataset for autonomous driving in unstructured environments. TAS500 offers fine-grained vegetation and terrain classes to learn drivable surfaces and natural obstacles in outdoor scenes effectively. We evaluate the performance of modern semantic segmentation models with an additional focus on their efficiency. Our experiments demonstrate the advantages of fine-grained semantic classes to improve the overall prediction accuracy, especially along the class boundaries. The dataset and pretrained model are available at mucar3.de/icpr2020-tas500.
    Support Vector Regression: Risk Quadrangle Framework. (arXiv:2212.09178v1 [stat.ML])
    This paper investigates Support Vector Regression (SVR) in the context of the fundamental risk quadrangle paradigm. It is shown that both formulations of SVR, $\varepsilon$-SVR and $\nu$-SVR, correspond to the minimization of equivalent regular error measures (Vapnik error and superquantile (CVaR) norm, respectively) with a regularization penalty. These error measures, in turn, give rise to corresponding risk quadrangles. Additionally, the technique used for the construction of quadrangles serves as a powerful tool in proving the equivalence between $\varepsilon$-SVR and $\nu$-SVR. By constructing the fundamental risk quadrangle, which corresponds to SVR, we show that SVR is the asymptotically unbiased estimator of the average of two symmetric conditional quantiles. Additionally, SVR is formulated as a regular deviation minimization problem with a regularization penalty by invoking Error Shaping Decomposition of Regression. Finally, the dual formulation of SVR in the risk quadrangle framework is derived.
    A Review of Speech-centric Trustworthy Machine Learning: Privacy, Safety, and Fairness. (arXiv:2212.09006v1 [cs.SD])
    Speech-centric machine learning systems have revolutionized many leading domains ranging from transportation and healthcare to education and defense, profoundly changing how people live, work, and interact with each other. However, recent studies have demonstrated that many speech-centric ML systems may need to be considered more trustworthy for broader deployment. Specifically, concerns over privacy breaches, discriminating performance, and vulnerability to adversarial attacks have all been discovered in ML research fields. In order to address the above challenges and risks, a significant number of efforts have been made to ensure these ML systems are trustworthy, especially private, safe, and fair. In this paper, we conduct the first comprehensive survey on speech-centric trustworthy ML topics related to privacy, safety, and fairness. In addition to serving as a summary report for the research community, we point out several promising future research directions to inspire the researchers who wish to explore further in this area.
    Assign Experiment Variants at Scale in Online Controlled Experiments. (arXiv:2212.08771v1 [stat.AP])
    Online controlled experiments (A/B tests) have become the gold standard for learning the impact of new product features in technology companies. Randomization enables the inference of causality from an A/B test. The randomized assignment maps end users to experiment buckets and balances user characteristics between the groups. Therefore, experiments can attribute any outcome differences between the experiment groups to the product feature under experiment. Technology companies run A/B tests at scale -- hundreds if not thousands of A/B tests concurrently, each with millions of users. The large scale poses unique challenges to randomization. First, the randomized assignment must be fast since the experiment service receives hundreds of thousands of queries per second. Second, the variant assignments must be independent between experiments. Third, the assignment must be consistent when users revisit or an experiment enrolls more users. We present a novel assignment algorithm and statistical tests to validate the randomized assignments. Our results demonstrate that not only is this algorithm computationally fast but also satisfies the statistical requirements -- unbiased and independent.
    Enhanced word embeddings using multi-semantic representation through lexical chains. (arXiv:2101.09023v2 [cs.CL] UPDATED)
    The relationship between words in a sentence often tells us more about the underlying semantic content of a document than its actual words, individually. In this work, we propose two novel algorithms, called Flexible Lexical Chain II and Fixed Lexical Chain II. These algorithms combine the semantic relations derived from lexical chains, prior knowledge from lexical databases, and the robustness of the distributional hypothesis in word embeddings as building blocks forming a single system. In short, our approach has three main contributions: (i) a set of techniques that fully integrate word embeddings and lexical chains; (ii) a more robust semantic representation that considers the latent relation between words in a document; and (iii) lightweight word embeddings models that can be extended to any natural language task. We intend to assess the knowledge of pre-trained models to evaluate their robustness in the document classification task. The proposed techniques are tested against seven word embeddings algorithms using five different machine learning classifiers over six scenarios in the document classification task. Our results show the integration between lexical chains and word embeddings representations sustain state-of-the-art results, even against more complex systems.
    Point-E: A System for Generating 3D Point Clouds from Complex Prompts. (arXiv:2212.08751v1 [cs.CV])
    While recent work on text-conditional 3D object generation has shown promising results, the state-of-the-art methods typically require multiple GPU-hours to produce a single sample. This is in stark contrast to state-of-the-art generative image models, which produce samples in a number of seconds or minutes. In this paper, we explore an alternative method for 3D object generation which produces 3D models in only 1-2 minutes on a single GPU. Our method first generates a single synthetic view using a text-to-image diffusion model, and then produces a 3D point cloud using a second diffusion model which conditions on the generated image. While our method still falls short of the state-of-the-art in terms of sample quality, it is one to two orders of magnitude faster to sample from, offering a practical trade-off for some use cases. We release our pre-trained point cloud diffusion models, as well as evaluation code and models, at https://github.com/openai/point-e.
    Two-Scale Gradient Descent Ascent Dynamics Finds Mixed Nash Equilibria of Continuous Games: A Mean-Field Perspective. (arXiv:2212.08791v1 [math.OC])
    Finding the mixed Nash equilibria (MNE) of a two-player zero sum continuous game is an important and challenging problem in machine learning. A canonical algorithm to finding the MNE is the noisy gradient descent ascent method which in the infinite particle limit gives rise to the {\em Mean-Field Gradient Descent Ascent} (GDA) dynamics on the space of probability measures. In this paper, we first study the convergence of a two-scale Mean-Field GDA dynamics for finding the MNE of the entropy-regularized objective. More precisely we show that for any fixed positive temperature (or regularization parameter), the two-scale Mean-Field GDA with a {\em finite} scale ratio converges to exponentially to the unique MNE without assuming the convexity or concavity of the interaction potential. The key ingredient of our proof lies in the construction of new Lyapunov functions that dissipate exponentially along the Mean-Field GDA. We further study the simulated annealing of the Mean-Field GDA dynamics. We show that with a temperature schedule that decays logarithmically in time the annealed Mean-Field GDA converges to the MNE of the original unregularized objective function.
    Hidden State Approximation in Recurrent Neural Networks Using Continuous Particle Filtering. (arXiv:2212.09008v1 [cs.LG])
    Using historical data to predict future events has many applications in the real world, such as stock price prediction; the robot localization. In the past decades, the Convolutional long short-term memory (LSTM) networks have achieved extraordinary success with sequential data in the related field. However, traditional recurrent neural networks (RNNs) keep the hidden states in a deterministic way. In this paper, we use the particles to approximate the distribution of the latent state and show how it can extend into a more complex form, i.e., the Encoder-Decoder mechanism. With the proposed continuous differentiable scheme, our model is capable of adaptively extracting valuable information and updating the latent state according to the Bayes rule. Our empirical studies demonstrate the effectiveness of our method in the prediction tasks.
    Pre-Trained Image Encoder for Generalizable Visual Reinforcement Learning. (arXiv:2212.08860v1 [cs.LG])
    Learning generalizable policies that can adapt to unseen environments remains challenging in visual Reinforcement Learning (RL). Existing approaches try to acquire a robust representation via diversifying the appearances of in-domain observations for better generalization. Limited by the specific observations of the environment, these methods ignore the possibility of exploring diverse real-world image datasets. In this paper, we investigate how a visual RL agent would benefit from the off-the-shelf visual representations. Surprisingly, we find that the early layers in an ImageNet pre-trained ResNet model could provide rather generalizable representations for visual RL. Hence, we propose Pre-trained Image Encoder for Generalizable visual reinforcement learning (PIE-G), a simple yet effective framework that can generalize to the unseen visual scenarios in a zero-shot manner. Extensive experiments are conducted on DMControl Generalization Benchmark, DMControl Manipulation Tasks, Drawer World, and CARLA to verify the effectiveness of PIE-G. Empirical evidence suggests PIE-G improves sample efficiency and significantly outperforms previous state-of-the-art methods in terms of generalization performance. In particular, PIE-G boasts a 55% generalization performance gain on average in the challenging video background setting. Project Page: https://sites.google.com/view/pie-g/home.
    A Unified Single-loop Alternating Gradient Projection Algorithm for Nonconvex-Concave and Convex-Nonconcave Minimax Problems. (arXiv:2006.02032v4 [math.OC] UPDATED)
    Much recent research effort has been directed to the development of efficient algorithms for solving minimax problems with theoretical convergence guarantees due to the relevance of these problems to a few emergent applications. In this paper, we propose a unified single-loop alternating gradient projection (AGP) algorithm for solving smooth nonconvex-(strongly) concave and (strongly) convex-nonconcave minimax problems. AGP employs simple gradient projection steps for updating the primal and dual variables alternatively at each iteration. We show that it can find an $\varepsilon$-stationary point of the objective function in $\mathcal{O}\left( \varepsilon ^{-2} \right)$ (resp. $\mathcal{O}\left( \varepsilon ^{-4} \right)$) iterations under nonconvex-strongly concave (resp. nonconvex-concave) setting. Moreover, its gradient complexity to obtain an $\varepsilon$-stationary point of the objective function is bounded by $\mathcal{O}\left( \varepsilon ^{-2} \right)$ (resp., $\mathcal{O}\left( \varepsilon ^{-4} \right)$) under the strongly convex-nonconcave (resp., convex-nonconcave) setting. To the best of our knowledge, this is the first time that a simple and unified single-loop algorithm is developed for solving both nonconvex-(strongly) concave and (strongly) convex-nonconcave minimax problems. Moreover, the complexity results for solving the latter (strongly) convex-nonconcave minimax problems have never been obtained before in the literature. Numerical results show the efficiency of the proposed AGP algorithm. Furthermore, we extend the AGP algorithm by presenting a block alternating proximal gradient (BAPG) algorithm for solving more general multi-block nonsmooth nonconvex-(strongly) concave and (strongly) convex-nonconcave minimax problems. We can similarly establish the gradient complexity of the proposed algorithm under these four different settings.
    Machine Learning Strategies to Improve Generalization in EEG-based Emotion Assessment: \\a Systematic Review. (arXiv:2212.08744v1 [cs.LG])
    A systematic review on machine-learning strategies for improving generalizability (cross-subjects and cross-sessions) electroencephalography (EEG) based in emotion classification was realized. In this context, the non-stationarity of EEG signals is a critical issue and can lead to the Dataset Shift problem. Several architectures and methods have been proposed to address this issue, mainly based on transfer learning methods. 418 papers were retrieved from the Scopus, IEEE Xplore and PubMed databases through a search query focusing on modern machine learning techniques for generalization in EEG-based emotion assessment. Among these papers, 75 were found eligible based on their relevance to the problem. Studies lacking a specific cross-subject and cross-session validation strategy and making use of other biosignals as support were excluded. On the basis of the selected papers' analysis, a taxonomy of the studies employing Machine Learning (ML) methods was proposed, together with a brief discussion on the different ML approaches involved. The studies with the best results in terms of average classification accuracy were identified, supporting that transfer learning methods seem to perform better than other approaches. A discussion is proposed on the impact of (i) the emotion theoretical models and (ii) psychological screening of the experimental sample on the classifier performances.  ( 2 min )
    Counterfactual Explanations for Misclassified Images: How Human and Machine Explanations Differ. (arXiv:2212.08733v1 [cs.LG])
    Counterfactual explanations have emerged as a popular solution for the eXplainable AI (XAI) problem of elucidating the predictions of black-box deep-learning systems due to their psychological validity, flexibility across problem domains and proposed legal compliance. While over 100 counterfactual methods exist, claiming to generate plausible explanations akin to those preferred by people, few have actually been tested on users ($\sim7\%$). So, the psychological validity of these counterfactual algorithms for effective XAI for image data is not established. This issue is addressed here using a novel methodology that (i) gathers ground truth human-generated counterfactual explanations for misclassified images, in two user studies and, then, (ii) compares these human-generated ground-truth explanations to computationally-generated explanations for the same misclassifications. Results indicate that humans do not "minimally edit" images when generating counterfactual explanations. Instead, they make larger, "meaningful" edits that better approximate prototypes in the counterfactual class.  ( 2 min )
    Variational Wasserstein Barycenters with c-Cyclical Monotonicity. (arXiv:2110.11707v2 [cs.LG] UPDATED)
    Wasserstein barycenter, built on the theory of optimal transport, provides a powerful framework to aggregate probability distributions, and it has increasingly attracted great attention within the machine learning community. However, it suffers from severe computational burden, especially for high dimensional and continuous settings. To this end, we develop a novel continuous approximation method for the Wasserstein barycenters problem given sample access to the input distributions. The basic idea is to introduce a variational distribution as the approximation of the true continuous barycenter, so as to frame the barycenters computation problem as an optimization problem, where parameters of the variational distribution adjust the proxy distribution to be similar to the barycenter. Leveraging the variational distribution, we construct a tractable dual formulation for the regularized Wasserstein barycenter problem with c-cyclical monotonicity, which can be efficiently solved by stochastic optimization. We provide theoretical analysis on convergence and demonstrate the practical effectiveness of our method on real applications of subset posterior aggregation and synthetic data.  ( 2 min )
    Distribution-aware Goal Prediction and Conformant Model-based Planning for Safe Autonomous Driving. (arXiv:2212.08729v1 [cs.RO])
    The feasibility of collecting a large amount of expert demonstrations has inspired growing research interests in learning-to-drive settings, where models learn by imitating the driving behaviour from experts. However, exclusively relying on imitation can limit agents' generalisability to novel scenarios that are outside the support of the training data. In this paper, we address this challenge by factorising the driving task, based on the intuition that modular architectures are more generalisable and more robust to changes in the environment compared to monolithic, end-to-end frameworks. Specifically, we draw inspiration from the trajectory forecasting community and reformulate the learning-to-drive task as obstacle-aware perception and grounding, distribution-aware goal prediction, and model-based planning. Firstly, we train the obstacle-aware perception module to extract salient representation of the visual context. Then, we learn a multi-modal goal distribution by performing conditional density-estimation using normalising flow. Finally, we ground candidate trajectory predictions road geometry, and plan the actions based on on vehicle dynamics. Under the CARLA simulator, we report state-of-the-art results on the CARNOVEL benchmark.
    MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation. (arXiv:2208.08227v4 [cs.LG] UPDATED)
    Large language models have demonstrated the ability to generate both natural language and programming language text. Such models open up the possibility of multi-language code generation: could code generation models generalize knowledge from one language to another? Although contemporary code generation models can generate semantically correct Python code, little is known about their abilities with other languages. We propose MultiPL-E, a system for translating unit test-driven code generation benchmarks to new languages. We create the first massively multilingual code generation benchmark by using MultiPL-E to translate two popular Python code generation benchmarks to 18 additional programming languages. We use MultiPL-E to extend the HumanEval benchmark and MBPP benchmark to 18 languages that encompass a range of programming paradigms and popularity. Using these new parallel benchmarks, we evaluate the multi-language performance of three state-of-the-art code generation models: Codex, CodeGen, and InCoder. We find that Codex matches or even exceeds its performance on Python for several other languages. The range of programming languages represented in MultiPL-E allow us to explore the impact of language frequency and language features on model performance. Finally, the MultiPL-E approach of compiling code generation benchmarks to new programming languages is both scalable and extensible, making it straightforward to evaluate new models, benchmarks, and languages.
    UNIREX: A Unified Learning Framework for Language Model Rationale Extraction. (arXiv:2112.08802v2 [cs.CL] CROSS LISTED)
    An extractive rationale explains a language model's (LM's) prediction on a given task instance by highlighting the text inputs that most influenced the prediction. Ideally, rationale extraction should be faithful (reflective of LM's actual behavior) and plausible (convincing to humans), without compromising the LM's (i.e., task model's) task performance. Although attribution algorithms and select-predict pipelines are commonly used in rationale extraction, they both rely on certain heuristics that hinder them from satisfying all three desiderata. In light of this, we propose UNIREX, a flexible learning framework which generalizes rationale extractor optimization as follows: (1) specify architecture for a learned rationale extractor; (2) select explainability objectives (i.e., faithfulness and plausibility criteria); and (3) jointly the train task model and rationale extractor on the task using selected objectives. UNIREX enables replacing prior works' heuristic design choices with a generic learned rationale extractor in (1) and optimizing it for all three desiderata in (2)-(3). To facilitate comparison between methods w.r.t. multiple desiderata, we introduce the Normalized Relative Gain (NRG) metric. Across five text classification datasets, our best UNIREX configuration outperforms baselines by an average of 32.9% NRG. Plus, we find that UNIREX-trained rationale extractors can even generalize to unseen datasets and tasks.
    Short-term Prediction of Household Electricity Consumption Using Customized LSTM and GRU Models. (arXiv:2212.08757v1 [cs.LG])
    With the evolution of power systems as it is becoming more intelligent and interactive system while increasing in flexibility with a larger penetration of renewable energy sources, demand prediction on a short-term resolution will inevitably become more and more crucial in designing and managing the future grid, especially when it comes to an individual household level. Projecting the demand for electricity for a single energy user, as opposed to the aggregated power consumption of residential load on a wide scale, is difficult because of a considerable number of volatile and uncertain factors. This paper proposes a customized GRU (Gated Recurrent Unit) and Long Short-Term Memory (LSTM) architecture to address this challenging problem. LSTM and GRU are comparatively newer and among the most well-adopted deep learning approaches. The electricity consumption datasets were obtained from individual household smart meters. The comparison shows that the LSTM model performs better for home-level forecasting than alternative prediction techniques-GRU in this case. To compare the NN-based models with contrast to the conventional statistical technique-based model, ARIMA based model was also developed and benchmarked with LSTM and GRU model outcomes in this study to show the performance of the proposed model on the collected time series data.
    Contextually Enhanced ES-dRNN with Dynamic Attention for Short-Term Load Forecasting. (arXiv:2212.09030v1 [cs.LG])
    In this paper, we propose a new short-term load forecasting (STLF) model based on contextually enhanced hybrid and hierarchical architecture combining exponential smoothing (ES) and a recurrent neural network (RNN). The model is composed of two simultaneously trained tracks: the context track and the main track. The context track introduces additional information to the main track. It is extracted from representative series and dynamically modulated to adjust to the individual series forecasted by the main track. The RNN architecture consists of multiple recurrent layers stacked with hierarchical dilations and equipped with recently proposed attentive dilated recurrent cells. These cells enable the model to capture short-term, long-term and seasonal dependencies across time series as well as to weight dynamically the input information. The model produces both point forecasts and predictive intervals. The experimental part of the work performed on 35 forecasting problems shows that the proposed model outperforms in terms of accuracy its predecessor as well as standard statistical models and state-of-the-art machine learning models.
    Energy-Based Models for Continual Learning. (arXiv:2011.12216v3 [cs.LG] UPDATED)
    We motivate Energy-Based Models (EBMs) as a promising model class for continual learning problems. Instead of tackling continual learning via the use of external memory, growing models, or regularization, EBMs change the underlying training objective to cause less interference with previously learned information. Our proposed version of EBMs for continual learning is simple, efficient, and outperforms baseline methods by a large margin on several benchmarks. Moreover, our proposed contrastive divergence-based training objective can be combined with other continual learning methods, resulting in substantial boosts in their performance. We further show that EBMs are adaptable to a more general continual learning setting where the data distribution changes without the notion of explicitly delineated tasks. These observations point towards EBMs as a useful building block for future continual learning methods.
    Annotation by Clicks: A Point-Supervised Contrastive Variance Method for Medical Semantic Segmentation. (arXiv:2212.08774v1 [cs.CV])
    Medical image segmentation methods typically rely on numerous dense annotated images for model training, which are notoriously expensive and time-consuming to collect. To alleviate this burden, weakly supervised techniques have been exploited to train segmentation models with less expensive annotations. In this paper, we propose a novel point-supervised contrastive variance method (PSCV) for medical image semantic segmentation, which only requires one pixel-point from each organ category to be annotated. The proposed method trains the base segmentation network by using a novel contrastive variance (CV) loss to exploit the unlabeled pixels and a partial cross-entropy loss on the labeled pixels. The CV loss function is designed to exploit the statistical spatial distribution properties of organs in medical images and their variance distribution map representations to enforce discriminative predictions over the unlabeled pixels. Experimental results on two standard medical image datasets demonstrate that the proposed method outperforms the state-of-the-art weakly supervised methods on point-supervised medical image semantic segmentation tasks.  ( 2 min )
    Language model acceptability judgements are not always robust to context. (arXiv:2212.08979v1 [cs.CL])
    Targeted syntactic evaluations of language models ask whether models show stable preferences for syntactically acceptable content over minimal-pair unacceptable inputs. Most targeted syntactic evaluation datasets ask models to make these judgements with just a single context-free sentence as input. This does not match language models' training regime, in which input sentences are always highly contextualized by the surrounding corpus. This mismatch raises an important question: how robust are models' syntactic judgements in different contexts? In this paper, we investigate the stability of language models' performance on targeted syntactic evaluations as we vary properties of the input context: the length of the context, the types of syntactic phenomena it contains, and whether or not there are violations of grammaticality. We find that model judgements are generally robust when placed in randomly sampled linguistic contexts. However, they are substantially unstable for contexts containing syntactic structures matching those in the critical test content. Among all tested models (GPT-2 and five variants of OPT), we significantly improve models' judgements by providing contexts with matching syntactic structures, and conversely significantly worsen them using unacceptable contexts with matching but violated syntactic structures. This effect is amplified by the length of the context, except for unrelated inputs. We show that these changes in model performance are not explainable by simple features matching the context and the test inputs, such as lexical overlap and dependency overlap. This sensitivity to highly specific syntactic features of the context can only be explained by the models' implicit in-context learning abilities.
    Minimizing Maximum Model Discrepancy for Transferable Black-box Targeted Attacks. (arXiv:2212.09035v1 [cs.CV])
    In this work, we study the black-box targeted attack problem from the model discrepancy perspective. On the theoretical side, we present a generalization error bound for black-box targeted attacks, which gives a rigorous theoretical analysis for guaranteeing the success of the attack. We reveal that the attack error on a target model mainly depends on empirical attack error on the substitute model and the maximum model discrepancy among substitute models. On the algorithmic side, we derive a new algorithm for black-box targeted attacks based on our theoretical analysis, in which we additionally minimize the maximum model discrepancy(M3D) of the substitute models when training the generator to generate adversarial examples. In this way, our model is capable of crafting highly transferable adversarial examples that are robust to the model variation, thus improving the success rate for attacking the black-box model. We conduct extensive experiments on the ImageNet dataset with different classification models, and our proposed approach outperforms existing state-of-the-art methods by a significant margin. Our codes will be released.
    Risk of Bias in Chest X-ray Foundation Models. (arXiv:2209.02965v2 [cs.LG] UPDATED)
    Foundation models are considered a breakthrough in all applications of AI, promising robust and reusable mechanisms for feature extraction, alleviating the need for large amounts of high quality annotated training data for task-specific prediction models. However, foundation models may potentially encode and even reinforce existing biases present in historic datasets. Given the limited ability to scrutinize foundation models, it remains unclear whether the opportunities outweigh the risks in safety critical applications such as clinical decision making. In our statistical bias analysis of a recently published, and publicly accessible chest X-ray foundation model, we found reasons for concern as the model seems to encode protected characteristics including biological sex and racial identity. When used for the downstream application of disease detection, we observed substantial degradation of performance of the foundation model compared to a standard model with specific disparities in protected subgroups. While research into foundation models for healthcare applications is in an early stage, we hope to raise awareness of the risks by highlighting the importance of conducting thorough bias and subgroup performance analyses.
    Boost Event-Driven Tactile Learning with Location Spiking Neurons. (arXiv:2210.04277v3 [cs.NE] UPDATED)
    Tactile sensing is essential for a variety of daily tasks. And recent advances in event-driven tactile sensors and Spiking Neural Networks (SNNs) spur the research in related fields. However, SNN-enabled event-driven tactile learning is still in its infancy due to the limited representation abilities of existing spiking neurons and high spatio-temporal complexity in the event-driven tactile data. In this paper, to improve the representation capability of existing spiking neurons, we propose a novel neuron model called "location spiking neuron", which enables us to extract features of event-based data in a novel way. Specifically, based on the classical Time Spike Response Model (TSRM), we develop the Location Spike Response Model (LSRM). In addition, based on the most commonly-used Time Leaky Integrate-and-Fire (TLIF) model, we develop the Location Leaky Integrate-and-Fire (LLIF) model. Moreover, to demonstrate the representation effectiveness of our proposed neurons and capture the complex spatio-temporal dependencies in the event-driven tactile data, we exploit the location spiking neurons to propose two hybrid models for event-driven tactile learning. Specifically, the first hybrid model combines a fully-connected SNN with TSRM neurons and a fully-connected SNN with LSRM neurons. And the second hybrid model fuses the spatial spiking graph neural network with TLIF neurons and the temporal spiking graph neural network with LLIF neurons. Extensive experiments demonstrate the significant improvements of our models over the state-of-the-art methods on event-driven tactile learning. Moreover, compared to the counterpart artificial neural networks (ANNs), our SNN models are 10x to 100x energy-efficient, which shows the superior energy efficiency of our models and may bring new opportunities to the spike-based learning community and neuromorphic engineering.
    APOLLO: A Simple Approach for Adaptive Pretraining of Language Models for Logical Reasoning. (arXiv:2212.09282v1 [cs.CL])
    Logical reasoning of text is an important ability that requires understanding the information present in the text, their interconnections, and then reasoning through them to infer new conclusions. Prior works on improving the logical reasoning ability of language models require complex processing of training data (e.g., aligning symbolic knowledge to text), yielding task-specific data augmentation solutions that restrict the learning of general logical reasoning skills. In this work, we propose APOLLO, an adaptively pretrained language model that has improved logical reasoning abilities. We select a subset of Wikipedia, based on a set of logical inference keywords, for continued pretraining of a language model. We use two self-supervised loss functions: a modified masked language modeling loss where only specific parts-of-speech words, that would likely require more reasoning than basic language understanding, are masked, and a sentence-level classification loss that teaches the model to distinguish between entailment and contradiction types of sentences. The proposed training paradigm is both simple and independent of task formats. We demonstrate the effectiveness of APOLLO by comparing it with prior baselines on two logical reasoning datasets. APOLLO performs comparably on ReClor and outperforms baselines on LogiQA.  ( 2 min )
    Latent Variable Representation for Reinforcement Learning. (arXiv:2212.08765v1 [cs.LG])
    Deep latent variable models have achieved significant empirical successes in model-based reinforcement learning (RL) due to their expressiveness in modeling complex transition dynamics. On the other hand, it remains unclear theoretically and empirically how latent variable models may facilitate learning, planning, and exploration to improve the sample efficiency of RL. In this paper, we provide a representation view of the latent variable models for state-action value functions, which allows both tractable variational learning algorithm and effective implementation of the optimism/pessimism principle in the face of uncertainty for exploration. In particular, we propose a computationally efficient planning algorithm with UCB exploration by incorporating kernel embeddings of latent variable models. Theoretically, we establish the sample complexity of the proposed approach in the online and offline settings. Empirically, we demonstrate superior performance over current state-of-the-art algorithms across various benchmarks.  ( 2 min )
    On the Connection between Invariant Learning and Adversarial Training for Out-of-Distribution Generalization. (arXiv:2212.09082v1 [cs.LG])
    Despite impressive success in many tasks, deep learning models are shown to rely on spurious features, which will catastrophically fail when generalized to out-of-distribution (OOD) data. Invariant Risk Minimization (IRM) is proposed to alleviate this issue by extracting domain-invariant features for OOD generalization. Nevertheless, recent work shows that IRM is only effective for a certain type of distribution shift (e.g., correlation shift) while it fails for other cases (e.g., diversity shift). Meanwhile, another thread of method, Adversarial Training (AT), has shown better domain transfer performance, suggesting that it has the potential to be an effective candidate for extracting domain-invariant features. This paper investigates this possibility by exploring the similarity between the IRM and AT objectives. Inspired by this connection, we propose Domainwise Adversarial Training (DAT), an AT-inspired method for alleviating distribution shift by domain-specific perturbations. Extensive experiments show that our proposed DAT can effectively remove domain-varying features and improve OOD generalization under both correlation shift and diversity shift.
    Convergence Analysis for Training Stochastic Neural Networks via Stochastic Gradient Descent. (arXiv:2212.08924v1 [math.NA])
    In this paper, we carry out numerical analysis to prove convergence of a novel sample-wise back-propagation method for training a class of stochastic neural networks (SNNs). The structure of the SNN is formulated as discretization of a stochastic differential equation (SDE). A stochastic optimal control framework is introduced to model the training procedure, and a sample-wise approximation scheme for the adjoint backward SDE is applied to improve the efficiency of the stochastic optimal control solver, which is equivalent to the back-propagation for training the SNN. The convergence analysis is derived with and without convexity assumption for optimization of the SNN parameters. Especially, our analysis indicates that the number of SNN training steps should be proportional to the square of the number of layers in the convex optimization case. Numerical experiments are carried out to validate the analysis results, and the performance of the sample-wise back-propagation method for training SNNs is examined by benchmark machine learning examples.
    Adapting Triplet Importance of Implicit Feedback for Personalized Recommendation. (arXiv:2208.01709v4 [cs.IR] UPDATED)
    Implicit feedback is frequently used for developing personalized recommendation services due to its ubiquity and accessibility in real-world systems. In order to effectively utilize such information, most research adopts the pairwise ranking method on constructed training triplets (user, positive item, negative item) and aims to distinguish between positive items and negative items for each user. However, most of these methods treat all the training triplets equally, which ignores the subtle difference between different positive or negative items. On the other hand, even though some other works make use of the auxiliary information (e.g., dwell time) of user behaviors to capture this subtle difference, such auxiliary information is hard to obtain. To mitigate the aforementioned problems, we propose a novel training framework named Triplet Importance Learning (TIL), which adaptively learns the importance score of training triplets. We devise two strategies for the importance score generation and formulate the whole procedure as a bilevel optimization, which does not require any rule-based design. We integrate the proposed training procedure with several Matrix Factorization (MF)- and Graph Neural Network (GNN)-based recommendation models, demonstrating the compatibility of our framework. Via a comparison using three real-world datasets with many state-of-the-art methods, we show that our proposed method outperforms the best existing models by 3-21\% in terms of Recall@k for the top-k recommendation.
    Multiple Robust Learning for Recommendation. (arXiv:2207.10796v4 [cs.IR] UPDATED)
    In recommender systems, a common problem is the presence of various biases in the collected data, which deteriorates the generalization ability of the recommendation models and leads to inaccurate predictions. Doubly robust (DR) learning has been studied in many tasks in RS, with the advantage that unbiased learning can be achieved when either a single imputation or a single propensity model is accurate. In this paper, we propose a multiple robust (MR) estimator that can take the advantage of multiple candidate imputation and propensity models to achieve unbiasedness. Specifically, the MR estimator is unbiased when any of the imputation or propensity models, or a linear combination of these models is accurate. Theoretical analysis shows that the proposed MR is an enhanced version of DR when only having a single imputation and propensity model, and has a smaller bias. Inspired by the generalization error bound of MR, we further propose a novel multiple robust learning approach with stabilization. We conduct extensive experiments on real-world and semi-synthetic datasets, which demonstrates the superiority of the proposed approach over state-of-the-art methods.
    BEATs: Audio Pre-Training with Acoustic Tokenizers. (arXiv:2212.09058v1 [eess.AS])
    The massive growth of self-supervised learning (SSL) has been witnessed in language, vision, speech, and audio domains over the past few years. While discrete label prediction is widely adopted for other modalities, the state-of-the-art audio SSL models still employ reconstruction loss for pre-training. Compared with reconstruction loss, semantic-rich discrete label prediction encourages the SSL model to abstract the high-level audio semantics and discard the redundant details as in human perception. However, a semantic-rich acoustic tokenizer for general audio pre-training is usually not straightforward to obtain, due to the continuous property of audio and unavailable phoneme sequences like speech. To tackle this challenge, we propose BEATs, an iterative audio pre-training framework to learn Bidirectional Encoder representation from Audio Transformers, where an acoustic tokenizer and an audio SSL model are optimized by iterations. In the first iteration, we use random projection as the acoustic tokenizer to train an audio SSL model in a mask and label prediction manner. Then, we train an acoustic tokenizer for the next iteration by distilling the semantic knowledge from the pre-trained or fine-tuned audio SSL model. The iteration is repeated with the hope of mutual promotion of the acoustic tokenizer and audio SSL model. The experimental results demonstrate our acoustic tokenizers can generate discrete labels with rich audio semantics and our audio SSL models achieve state-of-the-art results across various audio classification benchmarks, even outperforming previous models that use more training data and model parameters significantly. Specifically, we set a new state-of-the-art mAP 50.6% on AudioSet-2M for audio-only models without using any external data, and 98.1% accuracy on ESC-50. The code and pre-trained models are available at https://aka.ms/beats.
    GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records. (arXiv:2203.03540v3 [cs.CL] UPDATED)
    There is an increasing interest in developing artificial intelligence (AI) systems to process and interpret electronic health records (EHRs). Natural language processing (NLP) powered by pretrained language models is the key technology for medical AI systems utilizing clinical narratives. However, there are few clinical language models, the largest of which trained in the clinical domain is comparatively small at 110 million parameters (compared with billions of parameters in the general domain). It is not clear how large clinical language models with billions of parameters can help medical AI systems utilize unstructured EHRs. In this study, we develop from scratch a large clinical language model - GatorTron - using >90 billion words of text (including >82 billion words of de-identified clinical text) and systematically evaluate it on 5 clinical NLP tasks including clinical concept extraction, medical relation extraction, semantic textual similarity, natural language inference (NLI), and medical question answering (MQA). We examine how (1) scaling up the number of parameters and (2) scaling up the size of the training data could benefit these NLP tasks. GatorTron models scale up the clinical language model from 110 million to 8.9 billion parameters and improve 5 clinical NLP tasks (e.g., 9.6% and 9.5% improvement in accuracy for NLI and MQA), which can be applied to medical AI systems to improve healthcare delivery. The GatorTron models are publicly available at: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/models/gatortron_og.
    More is Better (Mostly): On the Backdoor Attacks in Federated Graph Neural Networks. (arXiv:2202.03195v4 [cs.CR] UPDATED)
    Graph Neural Networks (GNNs) are a class of deep learning-based methods for processing graph domain information. GNNs have recently become a widely used graph analysis method due to their superior ability to learn representations for complex graph data. However, due to privacy concerns and regulation restrictions, centralized GNNs can be difficult to apply to data-sensitive scenarios. Federated learning (FL) is an emerging technology developed for privacy-preserving settings when several parties need to train a shared global model collaboratively. Although several research works have applied FL to train GNNs (Federated GNNs), there is no research on their robustness to backdoor attacks. This paper bridges this gap by conducting two types of backdoor attacks in Federated GNNs: centralized backdoor attacks (CBA) and distributed backdoor attacks (DBA). Our experiments show that the DBA attack success rate is higher than CBA in almost all evaluated cases. For CBA, the attack success rate of all local triggers is similar to the global trigger even if the training set of the adversarial party is embedded with the global trigger. To further explore the properties of two backdoor attacks in Federated GNNs, we evaluate the attack performance for a different number of clients, trigger sizes, poisoning intensities, and trigger densities. Moreover, we explore the robustness of DBA and CBA against two defenses. We find that both attacks are robust against the investigated defenses, necessitating the need to consider backdoor attacks in Federated GNNs as a novel threat that requires custom defenses.
    ID and OOD Performance Are Sometimes Inversely Correlated on Real-world Datasets. (arXiv:2209.00613v3 [cs.LG] UPDATED)
    Several studies have empirically compared in-distribution (ID) and out-of-distribution (OOD) performance of various models. They report frequent positive correlations on benchmarks in computer vision and NLP. Surprisingly, they never observe inverse correlations suggesting necessary trade-offs. This matters to determine whether ID performance can serve as a proxy for OOD generalization. This paper shows that inverse correlations between ID and OOD performance do happen in real-world benchmarks. They could be missed in past studies because of a biased selection of models. We show an example on the WILDS-Camelyon17 dataset, using models from multiple training epochs and random seeds. Our observations are particularly striking with models trained with a regularizer that diversifies the solutions to the ERM objective. We nuance recommendations and conclusions made in past studies. (1) High OOD performance may sometimes require trading off ID performance.(2) Focusing on ID performance alone may not lead to optimal OOD performance: it can lead to diminishing and eventually negative returns in OOD performance. (3) Our example reminds that empirical studies only chart regimes achievable with existing methods: care is warranted in deriving prescriptive recommendations.
    Trusting the Explainers: Teacher Validation of Explainable Artificial Intelligence for Course Design. (arXiv:2212.08955v1 [cs.CY])
    Deep learning models for learning analytics have become increasingly popular over the last few years; however, these approaches are still not widely adopted in real-world settings, likely due to a lack of trust and transparency. In this paper, we tackle this issue by implementing explainable AI methods for black-box neural networks. This work focuses on the context of online and blended learning and the use case of student success prediction models. We use a pairwise study design, enabling us to investigate controlled differences between pairs of courses. Our analyses cover five course pairs that differ in one educationally relevant aspect and two popular instance-based explainable AI methods (LIME and SHAP). We quantitatively compare the distances between the explanations across courses and methods. We then validate the explanations of LIME and SHAP with 26 semi-structured interviews of university-level educators regarding which features they believe contribute most to student success, which explanations they trust most, and how they could transform these insights into actionable course design decisions. Our results show that quantitatively, explainers significantly disagree with each other about what is important, and qualitatively, experts themselves do not agree on which explanations are most trustworthy. All code, extended results, and the interview protocol are provided at https://github.com/epfl-ml4ed/trusting-explainers.
    Enhancing Cyber Resilience of Networked Microgrids using Vertical Federated Reinforcement Learning. (arXiv:2212.08973v1 [cs.LG])
    This paper presents a novel federated reinforcement learning (Fed-RL) methodology to enhance the cyber resiliency of networked microgrids. We formulate a resilient reinforcement learning (RL) training setup which (a) generates episodic trajectories injecting adversarial actions at primary control reference signals of the grid forming (GFM) inverters and (b) trains the RL agents (or controllers) to alleviate the impact of the injected adversaries. To circumvent data-sharing issues and concerns for proprietary privacy in multi-party-owned networked grids, we bring in the aspects of federated machine learning and propose a novel Fed-RL algorithm to train the RL agents. To this end, the conventional horizontal Fed-RL approaches using decoupled independent environments fail to capture the coupled dynamics in a networked microgrid, which leads us to propose a multi-agent vertically federated variation of actor-critic algorithms, namely federated soft actor-critic (FedSAC) algorithm. We created a customized simulation setup encapsulating microgrid dynamics in the GridLAB-D/HELICS co-simulation platform compatible with the OpenAI Gym interface for training RL agents. Finally, the proposed methodology is validated with numerical examples of modified IEEE 123-bus benchmark test systems consisting of three coupled microgrids.
    Machine Learning Assessment: implications to cybersecurity. (arXiv:1907.12851v5 [stat.ML] UPDATED)
    This chapter is dedicated to the assessment and performance estimation of machine learning (ML) algorithms, a topic that is equally important to the construction of these algorithms, in particular in the context of cyberphysical security design. The literature is full of nonparametric methods to estimate a statistic from just one available dataset through resampling techniques, e.g., jackknife, bootstrap and cross validation (CV). Special statistics of great interest are the error rate and the area under the ROC curve (AUC) of a classification rule. The importance of these resampling methods stems from the fact that they require no knowledge about the probability distribution of the data or the construction details of the ML algorithm. This chapter provides a concise review of this literature to establish a coherent theoretical framework for these methods that can estimate both the error rate (a one-sample statistic) and the AUC (a two-sample statistic). The resampling methods are usually computationally expensive, because they rely on repeating the training and testing of a ML algorithm after each resampling iteration. Therefore, the practical applicability of some of these methods may be limited to the traditional ML algorithms rather than the very computationally demanding approaches of the recent deep neural networks (DNN). In the field of cyberphysical security, many applications generate structured (tabular) data, which can be fed to all traditional ML approaches. This is in contrast to the DNN approaches, which favor unstructured data, e.g., images, text, voice, etc.; hence, the relevance of this chapter to this field.%
    Asymptotics of $\ell_2$ Regularized Network Embeddings. (arXiv:2201.01689v3 [stat.ML] UPDATED)
    A common approach to solving prediction tasks on large networks, such as node classification or link prediction, begin by learning a Euclidean embedding of the nodes of the network, from which traditional machine learning methods can then be applied. This includes methods such as DeepWalk and node2vec, which learn embeddings by optimizing stochastic losses formed over subsamples of the graph at each iteration of stochastic gradient descent. In this paper, we study the effects of adding an $\ell_2$ penalty of the embedding vectors to the training loss of these types of methods. We prove that, under some exchangeability assumptions on the graph, this asymptotically leads to learning a graphon with a nuclear-norm-type penalty, and give guarantees for the asymptotic distribution of the learned embedding vectors. In particular, the exact form of the penalty depends on the choice of subsampling method used as part of stochastic gradient descent. We also illustrate empirically that concatenating node covariates to $\ell_2$ regularized node2vec embeddings leads to comparable, when not superior, performance to methods which incorporate node covariates and the network structure in a non-linear manner.
    Robust Anomaly Map Assisted Multiple Defect Detection with Supervised Classification Techniques. (arXiv:2212.09352v1 [cs.CV])
    Industry 4.0 aims to optimize the manufacturing environment by leveraging new technological advances, such as new sensing capabilities and artificial intelligence. The DRAEM technique has shown state-of-the-art performance for unsupervised classification. The ability to create anomaly maps highlighting areas where defects probably lie can be leveraged to provide cues to supervised classification models and enhance their performance. Our research shows that the best performance is achieved when training a defect detection model by providing an image and the corresponding anomaly map as input. Furthermore, such a setting provides consistent performance when framing the defect detection as a binary or multiclass classification problem and is not affected by class balancing policies. We performed the experiments on three datasets with real-world data provided by Philips Consumer Lifestyle BV.  ( 2 min )
    Training Robots to Evaluate Robots: Example-Based Interactive Reward Functions for Policy Learning. (arXiv:2212.08961v1 [cs.LG])
    Physical interactions can often help reveal information that is not readily apparent. For example, we may tug at a table leg to evaluate whether it is built well, or turn a water bottle upside down to check that it is watertight. We propose to train robots to acquire such interactive behaviors automatically, for the purpose of evaluating the result of an attempted robotic skill execution. These evaluations in turn serve as "interactive reward functions" (IRFs) for training reinforcement learning policies to perform the target skill, such as screwing the table leg tightly. In addition, even after task policies are fully trained, IRFs can serve as verification mechanisms that improve online task execution. For any given task, our IRFs can be conveniently trained using only examples of successful outcomes, and no further specification is needed to train the task policy thereafter. In our evaluations on door locking and weighted block stacking in simulation, and screw tightening on a real robot, IRFs enable large performance improvements, even outperforming baselines with access to demonstrations or carefully engineered rewards. Project website: https://sites.google.com/view/lirf-corl-2022/  ( 2 min )
    Level-$k$ Meta-Learning for Pedestrian-Aware Self-Driving. (arXiv:2212.08800v1 [cs.RO])
    One challenge for self-driving cars is their interactions not only with other vehicles but also with pedestrians in urban environments. The unpredictability of pedestrian behaviors at intersections can lead to a high rate of accidents. The first pedestrian fatality caused by autonomous vehicles was reported in 2018 when a self-driving Uber vehicle struck a woman crossing an intersection in Tempe, Arizona in the nighttime. There is a need for creating machine intelligence that allows autonomous vehicles to control the car and adapt to different pedestrian behaviors to prevent accidents. In this work, (a) We develop a Level-$k$ Meta Reinforcement Learning model for the vehicle-human interactions and define its solution concept; (b) We test our LK-MRL structure in level-$0$ pedestrians interacting with level-$1$ car scenario, compare the trained policy with multiple baseline methods, and demonstrate its advantage in road safety; (c) Furthermore, based on the properties of level-$k$ thinking, we test our LK-MRL structure in level-$1$ pedestrians interacting with level-$2$ car scenario and verify by experimental results that LK-MRL maintains its advantageous with the using of reinforcement learning of producing different level of agents with strategies of the best response of their lower level thinkers, which provides us possible to create higher level scenarios.  ( 2 min )
    Subgraph nomination: Query by Example Subgraph Retrieval in Networks. (arXiv:2101.12430v2 [cs.LG] UPDATED)
    This paper introduces the subgraph nomination inference task, in which example subgraphs of interest are used to query a network for similarly interesting subgraphs. This type of problem appears time and again in real world problems connected to, for example, user recommendation systems and structural retrieval tasks in social and biological/connectomic networks. We formally define the subgraph nomination framework with an emphasis on the notion of a user-in-the-loop in the subgraph nomination pipeline. In this setting, a user can provide additional post-nomination light supervision that can be incorporated into the retrieval task. After introducing and formalizing the retrieval task, we examine the nuanced effect that user-supervision can have on performance, both analytically and across real and simulated data examples.  ( 2 min )
    Hard Sample Aware Network for Contrastive Deep Graph Clustering. (arXiv:2212.08665v1 [cs.LG])
    Contrastive deep graph clustering, which aims to divide nodes into disjoint groups via contrastive mechanisms, is a challenging research spot. Among the recent works, hard sample mining-based algorithms have achieved great attention for their promising performance. However, we find that the existing hard sample mining methods have two problems as follows. 1) In the hardness measurement, the important structural information is overlooked for similarity calculation, degrading the representativeness of the selected hard negative samples. 2) Previous works merely focus on the hard negative sample pairs while neglecting the hard positive sample pairs. Nevertheless, samples within the same cluster but with low similarity should also be carefully learned. To solve the problems, we propose a novel contrastive deep graph clustering method dubbed Hard Sample Aware Network (HSAN) by introducing a comprehensive similarity measure criterion and a general dynamic sample weighing strategy. Concretely, in our algorithm, the similarities between samples are calculated by considering both the attribute embeddings and the structure embeddings, better revealing sample relationships and assisting hardness measurement. Moreover, under the guidance of the carefully collected high-confidence clustering information, our proposed weight modulating function will first recognize the positive and negative samples and then dynamically up-weight the hard sample pairs while down-weighting the easy ones. In this way, our method can mine not only the hard negative samples but also the hard positive sample, thus improving the discriminative capability of the samples further. Extensive experiments and analyses demonstrate the superiority and effectiveness of our proposed method.
    Discovering Language Model Behaviors with Model-Written Evaluations. (arXiv:2212.09251v1 [cs.CL])
    As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.
    Sequence Models for Drone vs Bird Classification. (arXiv:2207.10409v2 [cs.CV] UPDATED)
    Drone detection has become an essential task in object detection as drone costs have decreased and drone technology has improved. It is, however, difficult to detect distant drones when there is weak contrast, long range, and low visibility. In this work, we propose several sequence classification architectures to reduce the detected false-positive ratio of drone tracks. Moreover, we propose a new drone vs. bird sequence classification dataset to train and evaluate the proposed architectures. 3D CNN, LSTM, and Transformer based sequence classification architectures have been trained on the proposed dataset to show the effectiveness of the proposed idea. As experiments show, using sequence information, bird classification and overall F1 scores can be increased by up to 73% and 35%, respectively. Among all sequence classification models, R(2+1)D-based fully convolutional model yields the best transfer learning and fine-tuning results.
    Meta-Learning Priors for Safe Bayesian Optimization. (arXiv:2210.00762v2 [cs.LG] UPDATED)
    In robotics, optimizing controller parameters under safety constraints is an important challenge. Safe Bayesian optimization (BO) quantifies uncertainty in the objective and constraints to safely guide exploration in such settings. Hand-designing a suitable probabilistic model can be challenging, however. In the presence of unknown safety constraints, it is crucial to choose reliable model hyper-parameters to avoid safety violations. Here, we propose a data-driven approach to this problem by meta-learning priors for safe BO from offline data. We build on a meta-learning algorithm, F-PACOH, capable of providing reliable uncertainty quantification in settings of data scarcity. As core contribution, we develop a novel framework for choosing safety-compliant priors in a data-riven manner via empirical uncertainty metrics and a frontier search algorithm. On benchmark functions and a high-precision motion system, we demonstrate that our meta-learned priors accelerate the convergence of safe BO approaches while maintaining safety.
    Unified, User and Task (UUT) Centered Artificial Intelligence for Metaverse Edge Computing. (arXiv:2212.09295v1 [cs.AI])
    The Metaverse can be considered the extension of the present-day web, which integrates the physical and virtual worlds, delivering hyper-realistic user experiences. The inception of the Metaverse brings forth many ecosystem services such as content creation, social entertainment, in-world value transfer, intelligent traffic, healthcare. These services are compute-intensive and require computation offloading onto a Metaverse edge computing server (MECS). Existing Metaverse edge computing approaches do not efficiently and effectively handle resource allocation to ensure a fluid, seamless and hyper-realistic Metaverse experience required for Metaverse ecosystem services. Therefore, we introduce a new Metaverse-compatible, Unified, User and Task (UUT) centered artificial intelligence (AI)- based mobile edge computing (MEC) paradigm, which serves as a concept upon which future AI control algorithms could be built to develop a more user and task-focused MEC.
    Fast Entropy-Based Methods of Word-Level Confidence Estimation for End-To-End Automatic Speech Recognition. (arXiv:2212.08703v1 [eess.AS])
    This paper presents a class of new fast non-trainable entropy-based confidence estimation methods for automatic speech recognition. We show how per-frame entropy values can be normalized and aggregated to obtain a confidence measure per unit and per word for Connectionist Temporal Classification (CTC) and Recurrent Neural Network Transducer (RNN-T) models. Proposed methods have similar computational complexity to the traditional method based on the maximum per-frame probability, but they are more adjustable, have a wider effective threshold range, and better push apart the confidence distributions of correct and incorrect words. We evaluate the proposed confidence measures on LibriSpeech test sets, and show that they are up to 2 and 4 times better than confidence estimation based on the maximum per-frame probability at detecting incorrect words for Conformer-CTC and Conformer-RNN-T models, respectively.  ( 2 min )
    Risk-Sensitive Reinforcement Learning with Exponential Criteria. (arXiv:2212.09010v1 [eess.SY])
    While risk-neutral reinforcement learning has shown experimental success in a number of applications, it is well-known to be non-robust with respect to noise and perturbations in the parameters of the system. For this reason, risk-sensitive reinforcement learning algorithms have been studied to introduce robustness and sample efficiency, and lead to better real-life performance. In this work, we introduce new model-free risk-sensitive reinforcement learning algorithms as variations of widely-used Policy Gradient algorithms with similar implementation properties. In particular, we study the effect of exponential criteria on the risk-sensitivity of the policy of a reinforcement learning agent, and develop variants of the Monte Carlo Policy Gradient algorithm and the online (temporal-difference) Actor-Critic algorithm. Analytical results showcase that the use of exponential criteria generalize commonly used ad-hoc regularization approaches. The implementation, performance, and robustness properties of the proposed methods are evaluated in simulated experiments.  ( 2 min )
    Leveraging Wastewater Monitoring for COVID-19 Forecasting in the US: a Deep Learning study. (arXiv:2212.08798v1 [cs.LG])
    The outburst of COVID-19 in late 2019 was the start of a health crisis that shook the world and took millions of lives in the ensuing years. Many governments and health officials failed to arrest the rapid circulation of infection in their communities. The long incubation period and the large proportion of asymptomatic cases made COVID-19 particularly elusive to track. However, wastewater monitoring soon became a promising data source in addition to conventional indicators such as confirmed daily cases, hospitalizations, and deaths. Despite the consensus on the effectiveness of wastewater viral load data, there is a lack of methodological approaches that leverage viral load to improve COVID-19 forecasting. This paper proposes using deep learning to automatically discover the relationship between daily confirmed cases and viral load data. We trained one Deep Temporal Convolutional Networks (DeepTCN) and one Temporal Fusion Transformer (TFT) model to build a global forecasting model. We supplement the daily confirmed cases with viral loads and other socio-economic factors as covariates to the models. Our results suggest that TFT outperforms DeepTCN and learns a better association between viral load and daily cases. We demonstrated that equipping the models with the viral load improves their forecasting performance significantly. Moreover, viral load is shown to be the second most predictive input, following the containment and health index. Our results reveal the feasibility of training a location-agnostic deep-learning model to capture the dynamics of infection diffusion when wastewater viral load data is provided.  ( 2 min )
    Analysis and Detectability of Offline Data Poisoning Attacks on Linear Systems. (arXiv:2211.08804v3 [eess.SY] UPDATED)
    In recent years, there has been a growing interest in the effects of data poisoning attacks on data-driven control methods. Poisoning attacks are well-known to the Machine Learning community, which, however, make use of assumptions, such as cross-sample independence, that in general do not hold for linear dynamical systems. Consequently, these systems require different attack and detection methods than those developed for supervised learning problems in the i.i.d.\ setting. Since most data-driven control algorithms make use of the least-squares estimator, we study how poisoning impacts the least-squares estimate through the lens of statistical testing, and question in what way data poisoning attacks can be detected. We establish under which conditions the set of models compatible with the data includes the true model of the system, and we analyze different poisoning strategies for the attacker. On the basis of the arguments hereby presented, we propose a stealthy data poisoning attack on the least-squares estimator that can escape classical statistical tests, and conclude by showing the efficiency of the proposed attack.
    Learning Performance Graphs from Demonstrations via Task-Based Evaluations. (arXiv:2204.05909v2 [cs.RO] UPDATED)
    In the learning from demonstration (LfD) paradigm, understanding and evaluating the demonstrated behaviors plays a critical role in extracting control policies for robots. Without this knowledge, a robot may infer incorrect reward functions that lead to undesirable or unsafe control policies. Recent work has proposed an LfD framework where a user provides a set of formal task specifications to guide LfD, to address the challenge of reward shaping. However, in this framework, specifications are manually ordered in a performance graph (a partial order that specifies relative importance between the specifications). The main contribution of this paper is an algorithm to learn the performance graph directly from the user-provided demonstrations, and show that the reward functions generated using the learned performance graph generate similar policies to those from manually specified performance graphs. We perform a user study that shows that priorities specified by users on behaviors in a simulated highway driving domain match the automatically inferred performance graph. This establishes that we can accurately evaluate user demonstrations with respect to task specifications without expert criteria.  ( 2 min )
    Two-sample test based on Self-Organizing Maps. (arXiv:2212.08960v1 [cs.LG])
    Machine-learning classifiers can be leveraged as a two-sample statistical test. Suppose each sample is assigned a different label and that a classifier can obtain a better-than-chance result discriminating them. In this case, we can infer that both samples originate from different populations. However, many types of models, such as neural networks, behave as a black-box for the user: they can reject that both samples originate from the same population, but they do not offer insight into how both samples differ. Self-Organizing Maps are a dimensionality reduction initially devised as a data visualization tool that displays emergent properties, being also useful for classification tasks. Since they can be used as classifiers, they can be used also as a two-sample statistical test. But since their original purpose is visualization, they can also offer insights.  ( 2 min )
    Index Tracking via Learning to Predict Market Sensitivities. (arXiv:2209.00780v3 [q-fin.PM] UPDATED)
    Index funds are substantially preferred by investors nowadays, and market sensitivities are instrumental in managing index funds. An index fund is a mutual fund aiming to track the returns of a predefined market index (e.g., the S&P 500). A basic strategy to manage an index fund is replicating the index's constituents and weights identically, which is, however, cost-ineffective and impractical. To address this issue, it is required to replicate the index partially with accurately predicted market sensitivities. Accordingly, we propose a novel partial-replication method via learning to predict market sensitivities. We first examine deep-learning models to predict market sensitivities in a supervised manner with our data-processing methods. Then, we propose a partial-index-tracking optimization model controlling the net predicted market sensitivities of the portfolios and index to be the same. These processes' efficacy is corroborated by our experiments on the Korea Stock Price Index 200. Our experiments show a significant reduction of the prediction errors compared with historical estimations and competitive tracking errors of replicating the index utilizing fewer than half of the entire constituents. Therefore, we show that applying deep learning to predict market sensitivities is promising and that our portfolio construction methods are practically effective. Additionally, to our knowledge, this is the first study addressing market sensitivities focused on deep learning.
    Improving Levenberg-Marquardt Algorithm for Neural Networks. (arXiv:2212.08769v1 [cs.LG])
    We explore the usage of the Levenberg-Marquardt (LM) algorithm for regression (non-linear least squares) and classification (generalized Gauss-Newton methods) tasks in neural networks. We compare the performance of the LM method with other popular first-order algorithms such as SGD and Adam, as well as other second-order algorithms such as L-BFGS , Hessian-Free and KFAC. We further speed up the LM method by using adaptive momentum, learning rate line search, and uphill step acceptance.
    Face Generation and Editing with StyleGAN: A Survey. (arXiv:2212.09102v1 [cs.CV])
    Our goal with this survey is to provide an overview of the state of the art deep learning technologies for face generation and editing. We will cover popular latest architectures and discuss key ideas that make them work, such as inversion, latent representation, loss functions, training procedures, editing methods, and cross domain style transfer. We particularly focus on GAN-based architectures that have culminated in the StyleGAN approaches, which allow generation of high-quality face images and offer rich interfaces for controllable semantics editing and preserving photo quality. We aim to provide an entry point into the field for readers that have basic knowledge about the field of deep learning and are looking for an accessible introduction and overview.
    Modeling Global Distribution for Federated Learning with Label Distribution Skew. (arXiv:2212.08883v1 [cs.LG])
    Federated learning achieves joint training of deep models by connecting decentralized data sources, which can significantly mitigate the risk of privacy leakage. However, in a more general case, the distributions of labels among clients are different, called ``label distribution skew''. Directly applying conventional federated learning without consideration of label distribution skew issue significantly hurts the performance of the global model. To this end, we propose a novel federated learning method, named FedMGD, to alleviate the performance degradation caused by the label distribution skew issue. It introduces a global Generative Adversarial Network to model the global data distribution without access to local datasets, so the global model can be trained using the global information of data distribution without privacy leakage. The experimental results demonstrate that our proposed method significantly outperforms the state-of-the-art on several public benchmarks. Code is available at \url{https://github.com/Sheng-T/FedMGD}.  ( 2 min )
    Spectral Regularized Kernel Two-Sample Tests. (arXiv:2212.09201v1 [math.ST])
    Over the last decade, an approach that has gained a lot of popularity to tackle non-parametric testing problems on general (i.e., non-Euclidean) domains is based on the notion of reproducing kernel Hilbert space (RKHS) embedding of probability distributions. The main goal of our work is to understand the optimality of two-sample tests constructed based on this approach. First, we show that the popular MMD (maximum mean discrepancy) two-sample test is not optimal in terms of the separation boundary measured in Hellinger distance. Second, we propose a modification to the MMD test based on spectral regularization by taking into account the covariance information (which is not captured by the MMD test) and prove the proposed test to be minimax optimal with a smaller separation boundary than that achieved by the MMD test. Third, we propose an adaptive version of the above test which involves a data-driven strategy to choose the regularization parameter and show the adaptive test to be almost minimax optimal up to a logarithmic factor. Moreover, our results hold for the permutation variant of the test where the test threshold is chosen elegantly through the permutation of the samples. Through numerical experiments on synthetic and real-world data, we demonstrate the superior performance of the proposed test in comparison to the MMD test.
    COVID-19 Detection Based on Self-Supervised Transfer Learning Using Chest X-Ray Images. (arXiv:2212.09276v1 [eess.IV])
    Purpose: Considering several patients screened due to COVID-19 pandemic, computer-aided detection has strong potential in assisting clinical workflow efficiency and reducing the incidence of infections among radiologists and healthcare providers. Since many confirmed COVID-19 cases present radiological findings of pneumonia, radiologic examinations can be useful for fast detection. Therefore, chest radiography can be used to fast screen COVID-19 during the patient triage, thereby determining the priority of patient's care to help saturated medical facilities in a pandemic situation. Methods: In this paper, we propose a new learning scheme called self-supervised transfer learning for detecting COVID-19 from chest X-ray (CXR) images. We compared six self-supervised learning (SSL) methods (Cross, BYOL, SimSiam, SimCLR, PIRL-jigsaw, and PIRL-rotation) with the proposed method. Additionally, we compared six pretrained DCNNs (ResNet18, ResNet50, ResNet101, CheXNet, DenseNet201, and InceptionV3) with the proposed method. We provide quantitative evaluation on the largest open COVID-19 CXR dataset and qualitative results for visual inspection. Results: Our method achieved a harmonic mean (HM) score of 0.985, AUC of 0.999, and four-class accuracy of 0.953. We also used the visualization technique Grad-CAM++ to generate visual explanations of different classes of CXR images with the proposed method to increase the interpretability. Conclusions: Our method shows that the knowledge learned from natural images using transfer learning is beneficial for SSL of the CXR images and boosts the performance of representation learning for COVID-19 detection. Our method promises to reduce the incidence of infections among radiologists and healthcare providers.  ( 3 min )
    Time-reversal equivariant neural network potential and Hamiltonian for magnetic materials. (arXiv:2211.11403v2 [cond-mat.mtrl-sci] UPDATED)
    This work presents Time-reversal Equivariant Neural Network (TENN) framework. With TENN, the time-reversal symmetry is considered in the equivariant neural network (ENN), which generalizes the ENN to consider physical quantities related to time-reversal symmetry such as spin and velocity of atoms. TENN-e3, as the time-reversal-extension of E(3) equivariant neural network, is developed to keep the Time-reversal E(3) equivariant with consideration of whether to include the spin-orbit effect for both collinear and non-collinear magnetic moments situations for magnetic material. TENN-e3 can construct spin neural network potential and the Hamiltonian of magnetic material from ab-initio calculations. Time-reversal-E(3)-equivariant convolutions for interactions of spinor and geometric tensors are employed in TENN-e3. Compared to the popular ENN, TENN-e3 can describe the complex spin-lattice coupling with high accuracy and keep time-reversal symmetry which is not preserved in the existing E(3)-equivariant model. Also, the Hamiltonian of magnetic material with time-reversal symmetry can be built with TENN-e3. TENN paves a new way to spin-lattice dynamics simulations over long-time scales and electronic structure calculations of large-scale magnetic materials.
    iCub! Do you recognize what I am doing?: multimodal human action recognition on multisensory-enabled iCub robot. (arXiv:2212.08859v1 [cs.RO])
    This study uses multisensory data (i.e., color and depth) to recognize human actions in the context of multimodal human-robot interaction. Here we employed the iCub robot to observe the predefined actions of the human partners by using four different tools on 20 objects. We show that the proposed multimodal ensemble learning leverages complementary characteristics of three color cameras and one depth sensor that improves, in most cases, recognition accuracy compared to the models trained with a single modality. The results indicate that the proposed models can be deployed on the iCub robot that requires multimodal action recognition, including social tasks such as partner-specific adaptation, and contextual behavior understanding, to mention a few.
    A General Stochastic Optimization Framework for Convergence Bidding. (arXiv:2210.06543v2 [math.OC] UPDATED)
    Convergence (virtual) bidding is an important part of two-settlement electric power markets as it can effectively reduce discrepancies between the day-ahead and real-time markets. Consequently, there is extensive research into the bidding strategies of virtual participants aiming to obtain optimal bids to submit to the day-ahead market. In this paper, we introduce a price-based general stochastic optimization framework to obtain optimal convergence bid curves. Within this framework, we develop a computationally tractable linear programming-based optimization model, which produces bid prices and volumes simultaneously. We also show that different approximations and simplifications in the general model lead naturally to state-of-the-art convergence bidding approaches, such as self-scheduling and opportunistic approaches. Our general framework also provides a straightforward way to compare the performance of these models, which is demonstrated by numerical experiments on the California (CAISO) market.
    2D Pose Estimation based Child Action Recognition. (arXiv:2212.09027v1 [cs.CV])
    We present a graph convolutional network with 2D pose estimation for the first time on child action recognition task achieving on par results with an RGB modality based model on a novel benchmark dataset containing unconstrained environment based videos.
    Cascaded Compositional Residual Learning for Complex Interactive Behaviors. (arXiv:2212.08954v1 [cs.RO])
    Real-world autonomous missions often require rich interaction with nearby objects, such as doors or switches, along with effective navigation. However, such complex behaviors are difficult to learn because they involve both high-level planning and low-level motor control. We present a novel framework, Cascaded Compositional Residual Learning (CCRL), which learns composite skills by recursively leveraging a library of previously learned control policies. Our framework learns multiplicative policy composition, task-specific residual actions, and synthetic goal information simultaneously while freezing the prerequisite policies. We further explicitly control the style of the motion by regularizing residual actions. We show that our framework learns joint-level control policies for a diverse set of motor skills ranging from basic locomotion to complex interactive navigation, including navigating around obstacles, pushing objects, crawling under a table, pushing a door open with its leg, and holding it open while walking through it. The proposed CCRL framework leads to policies with consistent styles and lower joint torques, which we successfully transfer to a real Unitree A1 robot without any additional fine-tuning.
    Impact of Sentiment Analysis in Fake Review Detection. (arXiv:2212.08995v1 [cs.CL])
    Fake review identification is an important topic and has gained the interest of experts all around the world. Identifying fake reviews is challenging for researchers, and there are several primary challenges to fake review detection. We propose developing an initial research paper for investigating fake reviews by using sentiment analysis. Ten research papers are identified that show fake reviews, and they discuss currently available solutions for predicting or detecting fake reviews. They also show the distribution of fake and truthful reviews through the analysis of sentiment. We summarize and compare previous studies related to fake reviews. We highlight the most significant challenges in the sentiment evaluation process and demonstrate that there is a significant impact on sentiment scores used to identify fake feedback.
    Disease2Vec: Representing Alzheimer's Progression via Disease Embedding Tree. (arXiv:2102.06847v2 [q-bio.NC] UPDATED)
    For decades, a variety of predictive approaches have been proposed and evaluated in terms of their prediction capability for Alzheimer's Disease (AD) and its precursor - mild cognitive impairment (MCI). Most of them focused on prediction or identification of statistical differences among different clinical groups or phases (e.g., longitudinal studies). The continuous nature of AD development and transition states between successive AD related stages have been overlooked, especially in binary or multi-class classification. Though a few progression models of AD have been studied recently, they were mainly designed to determine and compare the order of specific biomarkers. How to effectively predict the individual patient's status within a wide spectrum of continuous AD progression has been largely overlooked. In this work, we developed a novel learning-based embedding framework to encode the intrinsic relations among AD related clinical stages by a set of meaningful embedding vectors in the latent space (Disease2Vec). We named this process as disease embedding. By disease em-bedding, the framework generates a disease embedding tree (DETree) which effectively represents different clinical stages as a tree trajectory reflecting AD progression and thus can be used to predict clinical status by projecting individuals onto this continuous trajectory. Through this model, DETree can not only perform efficient and accurate prediction for patients at any stages of AD development (across five clinical groups instead of typical two groups), but also provide richer status information by examining the projecting locations within a wide and continuous AD progression process.
    Managing Temporal Resolution in Continuous Value Estimation: A Fundamental Trade-off. (arXiv:2212.08949v1 [cs.LG])
    A default assumption in reinforcement learning and optimal control is that experience arrives at discrete time points on a fixed clock cycle. Many applications, however, involve continuous systems where the time discretization is not fixed but instead can be managed by a learning algorithm. By analyzing Monte-Carlo value estimation for LQR systems in both finite-horizon and infinite-horizon settings, we uncover a fundamental trade-off between approximation and statistical error in value estimation. Importantly, these two errors behave differently with respect to time discretization, which implies that there is an optimal choice for the temporal resolution that depends on the data budget. These findings show how adapting the temporal resolution can provably improve value estimation quality in LQR systems from finite data. Empirically, we demonstrate the trade-off in numerical simulations of LQR instances and several non-linear environments.
    Probabilistic machine learning based predictive and interpretable digital twin for dynamical systems. (arXiv:2212.09240v1 [stat.ML])
    A framework for creating and updating digital twins for dynamical systems from a library of physics-based functions is proposed. The sparse Bayesian machine learning is used to update and derive an interpretable expression for the digital twin. Two approaches for updating the digital twin are proposed. The first approach makes use of both the input and output information from a dynamical system, whereas the second approach utilizes output-only observations to update the digital twin. Both methods use a library of candidate functions representing certain physics to infer new perturbation terms in the existing digital twin model. In both cases, the resulting expressions of updated digital twins are identical, and in addition, the epistemic uncertainties are quantified. In the first approach, the regression problem is derived from a state-space model, whereas in the latter case, the output-only information is treated as a stochastic process. The concepts of It\^o calculus and Kramers-Moyal expansion are being utilized to derive the regression equation. The performance of the proposed approaches is demonstrated using highly nonlinear dynamical systems such as the crack-degradation problem. Numerical results demonstrated in this paper almost exactly identify the correct perturbation terms along with their associated parameters in the dynamical system. The probabilistic nature of the proposed approach also helps in quantifying the uncertainties associated with updated models. The proposed approaches provide an exact and explainable description of the perturbations in digital twin models, which can be directly used for better cyber-physical integration, long-term future predictions, degradation monitoring, and model-agnostic control.
    Machine-Learning Compression for Particle Physics Discoveries. (arXiv:2210.11489v2 [hep-ph] UPDATED)
    In collider-based particle and nuclear physics experiments, data are produced at such extreme rates that only a subset can be recorded for later analysis. Typically, algorithms select individual collision events for preservation and store the complete experimental response. A relatively new alternative strategy is to additionally save a partial record for a larger subset of events, allowing for later specific analysis of a larger fraction of events. We propose a strategy that bridges these paradigms by compressing entire events for generic offline analysis but at a lower fidelity. An optimal-transport-based $\beta$ Variational Autoencoder (VAE) is used to automate the compression and the hyperparameter $\beta$ controls the compression fidelity. We introduce a new approach for multi-objective learning functions by simultaneously learning a VAE appropriate for all values of $\beta$ through parameterization. We present an example use case, a di-muon resonance search at the Large Hadron Collider (LHC), where we show that simulated data compressed by our $\beta$-VAE has enough fidelity to distinguish distinct signal morphologies.
    Towards Developing Safety Assurance Cases for Learning-Enabled Medical Cyber-Physical Systems. (arXiv:2211.15413v2 [cs.LG] UPDATED)
    Machine Learning (ML) technologies have been increasingly adopted in Medical Cyber-Physical Systems (MCPS) to enable smart healthcare. Assuring the safety and effectiveness of learning-enabled MCPS is challenging, as such systems must account for diverse patient profiles and physiological dynamics and handle operational uncertainties. In this paper, we develop a safety assurance case for ML controllers in learning-enabled MCPS, with an emphasis on establishing confidence in the ML-based predictions. We present the safety assurance case in detail for Artificial Pancreas Systems (APS) as a representative application of learning-enabled MCPS, and provide a detailed analysis by implementing a deep neural network for the prediction in APS. We check the sufficiency of the ML data and analyze the correctness of the ML-based prediction using formal verification. Finally, we outline open research problems based on our experience in this paper.
    TopoImb: Toward Topology-level Imbalance in Learning from Graphs. (arXiv:2212.08689v1 [cs.LG])
    Graph serves as a powerful tool for modeling data that has an underlying structure in non-Euclidean space, by encoding relations as edges and entities as nodes. Despite developments in learning from graph-structured data over the years, one obstacle persists: graph imbalance. Although several attempts have been made to target this problem, they are limited to considering only class-level imbalance. In this work, we argue that for graphs, the imbalance is likely to exist at the sub-class topology group level. Due to the flexibility of topology structures, graphs could be highly diverse, and learning a generalizable classification boundary would be difficult. Therefore, several majority topology groups may dominate the learning process, rendering others under-represented. To address this problem, we propose a new framework {\method} and design (1 a topology extractor, which automatically identifies the topology group for each instance with explicit memory cells, (2 a training modulator, which modulates the learning process of the target GNN model to prevent the case of topology-group-wise under-representation. {\method} can be used as a key component in GNN models to improve their performances under the data imbalance setting. Analyses on both topology-level imbalance and the proposed {\method} are provided theoretically, and we empirically verify its effectiveness with both node-level and graph-level classification as the target tasks.
    JFP: Joint Future Prediction with Interactive Multi-Agent Modeling for Autonomous Driving. (arXiv:2212.08710v1 [cs.MA])
    We propose JFP, a Joint Future Prediction model that can learn to generate accurate and consistent multi-agent future trajectories. For this task, many different methods have been proposed to capture social interactions in the encoding part of the model, however, considerably less focus has been placed on representing interactions in the decoder and output stages. As a result, the predicted trajectories are not necessarily consistent with each other, and often result in unrealistic trajectory overlaps. In contrast, we propose an end-to-end trainable model that learns directly the interaction between pairs of agents in a structured, graphical model formulation in order to generate consistent future trajectories. It sets new state-of-the-art results on Waymo Open Motion Dataset (WOMD) for the interactive setting. We also investigate a more complex multi-agent setting for both WOMD and a larger internal dataset, where our approach improves significantly on the trajectory overlap metrics while obtaining on-par or better performance on single-agent trajectory metrics.
    Leveraging Natural Language Processing to Mine Issues on Twitter During the COVID-19 Pandemic. (arXiv:2011.00377v2 [cs.IR] CROSS LISTED)
    The recent global outbreak of the coronavirus disease (COVID-19) has spread to all corners of the globe. The international travel ban, panic buying, and the need for self-quarantine are among the many other social challenges brought about in this new era. Twitter platforms have been used in various public health studies to identify public opinion about an event at the local and global scale. To understand the public concerns and responses to the pandemic, a system that can leverage machine learning techniques to filter out irrelevant tweets and identify the important topics of discussion on social media platforms like Twitter is needed. In this study, we constructed a system to identify the relevant tweets related to the COVID-19 pandemic throughout January 1st, 2020 to April 30th, 2020, and explored topic modeling to identify the most discussed topics and themes during this period in our data set. Additionally, we analyzed the temporal changes in the topics with respect to the events that occurred during this pandemic. We found out that eight topics were sufficient to identify the themes in our corpus. These topics depicted a temporal trend. The dominant topics vary over time and align with the events related to the COVID-19 pandemic.
    Distributed Distributionally Robust Optimization with Non-Convex Objectives. (arXiv:2210.07588v2 [cs.LG] UPDATED)
    Distributionally Robust Optimization (DRO), which aims to find an optimal decision that minimizes the worst case cost over the ambiguity set of probability distribution, has been widely applied in diverse applications, e.g., network behavior analysis, risk management, etc. However, existing DRO techniques face three key challenges: 1) how to deal with the asynchronous updating in a distributed environment; 2) how to leverage the prior distribution effectively; 3) how to properly adjust the degree of robustness according to different scenarios. To this end, we propose an asynchronous distributed algorithm, named Asynchronous Single-looP alternatIve gRadient projEction (ASPIRE) algorithm with the itErative Active SEt method (EASE) to tackle the distributed distributionally robust optimization (DDRO) problem. Furthermore, a new uncertainty set, i.e., constrained D-norm uncertainty set, is developed to effectively leverage the prior distribution and flexibly control the degree of robustness. Finally, our theoretical analysis elucidates that the proposed algorithm is guaranteed to converge and the iteration complexity is also analyzed. Extensive empirical studies on real-world datasets demonstrate that the proposed method can not only achieve fast convergence, and remain robust against data heterogeneity as well as malicious attacks, but also tradeoff robustness with performance.
    JEMMA: An Extensible Java Dataset for ML4Code Applications. (arXiv:2212.09132v1 [cs.SE])
    Machine Learning for Source Code (ML4Code) is an active research field in which extensive experimentation is needed to discover how to best use source code's richly structured information. With this in mind, we introduce JEMMA, an Extensible Java Dataset for ML4Code Applications, which is a large-scale, diverse, and high-quality dataset targeted at ML4Code. Our goal with JEMMA is to lower the barrier to entry in ML4Code by providing the building blocks to experiment with source code models and tasks. JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties (e.g., metrics, static analysis results) for 50,000 Java projects from the 50KC dataset, with over 1.2 million classes and over 8 million methods. JEMMA is also extensible allowing users to add new properties and representations to the dataset, and evaluate tasks on them. Thus, JEMMA becomes a workbench that researchers can use to experiment with novel representations and tasks operating on source code. To demonstrate the utility of the dataset, we also report results from two empirical studies on our data, ultimately showing that significant work lies ahead in the design of context-aware source code models that can reason over a broader network of source code entities in a software project, the very task that JEMMA is designed to help with.
    An unfolding method based on conditional Invertible Neural Networks (cINN) using iterative training. (arXiv:2212.08674v1 [hep-ph])
    The unfolding of detector effects is crucial for the comparison of data to theory predictions. While traditional methods are limited to representing the data in a low number of dimensions, machine learning has enabled new unfolding techniques while retaining the full dimensionality. Generative networks like invertible neural networks~(INN) enable a probabilistic unfolding, which map individual events to their corresponding unfolded probability distribution. The accuracy of such methods is however limited by how well simulated training samples model the actual data that is unfolded. We introduce the iterative conditional INN~(IcINN) for unfolding that adjusts for deviations between simulated training samples and data. The IcINN unfolding is first validated on toy data and then applied to pseudo-data for the $pp \to Z \gamma \gamma$ process.
    Asymptotics of Network Embeddings Learned via Subsampling. (arXiv:2107.02363v3 [stat.ML] UPDATED)
    Network data are ubiquitous in modern machine learning, with tasks of interest including node classification, node clustering and link prediction. A frequent approach begins by learning an Euclidean embedding of the network, to which algorithms developed for vector-valued data are applied. For large networks, embeddings are learned using stochastic gradient methods where the sub-sampling scheme can be freely chosen. Despite the strong empirical performance of such methods, they are not well understood theoretically. Our work encapsulates representation methods using a subsampling approach, such as node2vec, into a single unifying framework. We prove, under the assumption that the graph is exchangeable, that the distribution of the learned embedding vectors asymptotically decouples. Moreover, we characterize the asymptotic distribution and provided rates of convergence, in terms of the latent parameters, which includes the choice of loss function and the embedding dimension. This provides a theoretical foundation to understand what the embedding vectors represent and how well these methods perform on downstream tasks. Notably, we observe that typically used loss functions may lead to shortcomings, such as a lack of Fisher consistency.  ( 2 min )
    Bounding Membership Inference. (arXiv:2202.12232v4 [cs.LG] UPDATED)
    Differential Privacy (DP) is the de facto standard for reasoning about the privacy guarantees of a training algorithm. Despite the empirical observation that DP reduces the vulnerability of models to existing membership inference (MI) attacks, a theoretical underpinning as to why this is the case is largely missing in the literature. In practice, this means that models need to be trained with DP guarantees that greatly decrease their accuracy. In this paper, we provide a tighter bound on the positive accuracy (i.e., attack precision) of any MI adversary when a training algorithm provides $(\varepsilon, \delta)$-DP. Our bound informs the design of a novel privacy amplification scheme: an effective training set is sub-sampled from a larger set prior to the beginning of training. We find this greatly reduces the bound on MI positive accuracy. As a result, our scheme allows the use of looser DP guarantees to limit the success of any MI adversary; this ensures that the model's accuracy is less impacted by the privacy guarantee. While this clearly benefits entities working with far more data than they need to train on, it can also improve the accuracy-privacy trade-off on benchmarks studied in the academic literature. Consequently, we also find that subsampling decreases the effectiveness of a state-of-the-art MI attack (LiRA) much more effectively than training with stronger DP guarantees on MNIST and CIFAR10. We conclude by discussing implications of our MI bound on the field of machine unlearning.  ( 2 min )
    Iso-Dream: Isolating and Leveraging Noncontrollable Visual Dynamics in World Models. (arXiv:2205.13817v3 [cs.LG] UPDATED)
    World models learn the consequences of actions in vision-based interactive systems. However, in practical scenarios such as autonomous driving, there commonly exists noncontrollable dynamics independent of the action signals, making it difficult to learn effective world models. To tackle this problem, we present a novel reinforcement learning approach named Iso-Dream, which improves the Dream-to-Control framework in two aspects. First, by optimizing the inverse dynamics, we encourage the world model to learn controllable and noncontrollable sources of spatiotemporal changes on isolated state transition branches. Second, we optimize the behavior of the agent on the decoupled latent imaginations of the world model. Specifically, to estimate state values, we roll-out the noncontrollable states into the future and associate them with the current controllable state. In this way, the isolation of dynamics sources can greatly benefit long-horizon decision-making of the agent, such as a self-driving car that can avoid potential risks by anticipating the movement of other vehicles. Experiments show that Iso-Dream is effective in decoupling the mixed dynamics and remarkably outperforms existing approaches in a wide range of visual control and prediction domains.  ( 2 min )
    The One-Inclusion Graph Algorithm is not Always Optimal. (arXiv:2212.09270v1 [cs.LG])
    The one-inclusion graph algorithm of Haussler, Littlestone, and Warmuth achieves an optimal in-expectation risk bound in the standard PAC classification setup. In one of the first COLT open problems, Warmuth conjectured that this prediction strategy always implies an optimal high probability bound on the risk, and hence is also an optimal PAC algorithm. We refute this conjecture in the strongest sense: for any practically interesting Vapnik-Chervonenkis class, we provide an in-expectation optimal one-inclusion graph algorithm whose high probability risk bound cannot go beyond that implied by Markov's inequality. Our construction of these poorly performing one-inclusion graph algorithms uses Varshamov-Tenengolts error correcting codes. Our negative result has several implications. First, it shows that the same poor high-probability performance is inherited by several recent prediction strategies based on generalizations of the one-inclusion graph algorithm. Second, our analysis shows yet another statistical problem that enjoys an estimator that is provably optimal in expectation via a leave-one-out argument, but fails in the high-probability regime. This discrepancy occurs despite the boundedness of the binary loss for which arguments based on concentration inequalities often provide sharp high probability risk bounds.  ( 2 min )
    Estimating the Adversarial Robustness of Attributions in Text with Transformers. (arXiv:2212.09155v1 [cs.LG])
    Explanations are crucial parts of deep neural network (DNN) classifiers. In high stakes applications, faithful and robust explanations are important to understand and gain trust in DNN classifiers. However, recent work has shown that state-of-the-art attribution methods in text classifiers are susceptible to imperceptible adversarial perturbations that alter explanations significantly while maintaining the correct prediction outcome. If undetected, this can critically mislead the users of DNNs. Thus, it is crucial to understand the influence of such adversarial perturbations on the networks' explanations and their perceptibility. In this work, we establish a novel definition of attribution robustness (AR) in text classification, based on Lipschitz continuity. Crucially, it reflects both attribution change induced by adversarial input alterations and perceptibility of such alterations. Moreover, we introduce a wide set of text similarity measures to effectively capture locality between two text samples and imperceptibility of adversarial perturbations in text. We then propose our novel TransformerExplanationAttack (TEA), a strong adversary that provides a tight estimation for attribution robustness in text classification. TEA uses state-of-the-art language models to extract word substitutions that result in fluent, contextual adversarial samples. Finally, with experiments on several text classification architectures, we show that TEA consistently outperforms current state-of-the-art AR estimators, yielding perturbations that alter explanations to a greater extent while being more fluent and less perceptible.  ( 2 min )
    Censored Quantile Regression Neural Networks for Distribution-Free Survival Analysis. (arXiv:2205.13496v3 [stat.ML] UPDATED)
    This paper considers doing quantile regression on censored data using neural networks (NNs). This adds to the survival analysis toolkit by allowing direct prediction of the target variable, along with a distribution-free characterisation of uncertainty, using a flexible function approximator. We begin by showing how an algorithm popular in linear models can be applied to NNs. However, the resulting procedure is inefficient, requiring sequential optimisation of an individual NN at each desired quantile. Our major contribution is a novel algorithm that simultaneously optimises a grid of quantiles output by a single NN. To offer theoretical insight into our algorithm, we show firstly that it can be interpreted as a form of expectation-maximisation, and secondly that it exhibits a desirable `self-correcting' property. Experimentally, the algorithm produces quantiles that are better calibrated than existing methods on 10 out of 12 real datasets.  ( 2 min )
    A Complete Characterization of Linear Estimators for Offline Policy Evaluation. (arXiv:2203.04236v2 [cs.LG] UPDATED)
    Offline policy evaluation is a fundamental statistical problem in reinforcement learning that involves estimating the value function of some decision-making policy given data collected by a potentially different policy. In order to tackle problems with complex, high-dimensional observations, there has been significant interest from theoreticians and practitioners alike in understanding the possibility of function approximation in reinforcement learning. Despite significant study, a sharp characterization of when we might expect offline policy evaluation to be tractable, even in the simplest setting of linear function approximation, has so far remained elusive, with a surprising number of strong negative results recently appearing in the literature. In this work, we identify simple control-theoretic and linear-algebraic conditions that are necessary and sufficient for classical methods, in particular Fitted Q-iteration (FQI) and least squares temporal difference learning (LSTD), to succeed at offline policy evaluation. Using this characterization, we establish a precise hierarchy of regimes under which these estimators succeed. We prove that LSTD works under strictly weaker conditions than FQI. Furthermore, we establish that if a problem is not solvable via LSTD, then it cannot be solved by a broad class of linear estimators, even in the limit of infinite data. Taken together, our results provide a complete picture of the behavior of linear estimators for offline policy evaluation, unify previously disparate analyses of canonical algorithms, and provide significantly sharper notions of the underlying statistical complexity of offline policy evaluation.  ( 2 min )
    Context-sensitive neocortical neurons transform the effectiveness and efficiency of neural information processing. (arXiv:2207.07338v4 [cs.NE] UPDATED)
    Deep learning (DL) can arguably achieve superhuman performance in many real-world domains but at the cost of unsustainably high energy levels. We hypothesise that the fundamental problem lies in its intrinsic dependence on simplified 'point' neurons that inherently maximise the transmission of information irrespective of whether the information is relevant to other neurons or for the long-term benefit of the whole network. This leads to unnecessary neural firing and conflicting messages to higher perceptual layers, which makes DL energy inefficient and hard to train. We can circumvent this limitation of DL by mimicking a context-sensitive two-point neocortical neuron that at one point receives input from diverse neurons as context to amplify and suppress the transmission of coherent and incoherent feedforward (FF) information received at the other point, respectively. We show that a deep network composed of such local processors seeks to maximise agreement between the active neurons, thus restricting the transmission of conflicting information to higher levels and reducing the amount of neural activity required to process large amounts of heterogeneous real-world data. As shown to be far more effective and efficient than current forms of DL, this two-point neuron study offers a step-change in transforming the cellular foundations of deep network architectures.  ( 2 min )
    Large-Scale Retrieval for Reinforcement Learning. (arXiv:2206.05314v2 [cs.LG] UPDATED)
    Effective decision making involves flexibly relating past experiences and relevant contextual information to a novel situation. In deep reinforcement learning (RL), the dominant paradigm is for an agent to amortise information that helps decision making into its network weights via gradient descent on training losses. Here, we pursue an alternative approach in which agents can utilise large-scale context sensitive database lookups to support their parametric computations. This allows agents to directly learn in an end-to-end manner to utilise relevant information to inform their outputs. In addition, new information can be attended to by the agent, without retraining, by simply augmenting the retrieval dataset. We study this approach for offline RL in 9x9 Go, a challenging game for which the vast combinatorial state space privileges generalisation over direct matching to past experiences. We leverage fast, approximate nearest neighbor techniques in order to retrieve relevant data from a set of tens of millions of expert demonstration states. Attending to this information provides a significant boost to prediction accuracy and game-play performance over simply using these demonstrations as training trajectories, providing a compelling demonstration of the value of large-scale retrieval in offline RL agents.  ( 2 min )
    Agile Effort Estimation: Have We Solved the Problem Yet? Insights From A Replication Study. (arXiv:2201.05401v2 [cs.SE] UPDATED)
    In the last decade, several studies have explored automated techniques to estimate the effort of agile software development. We perform a close replication and extension of a seminal work proposing the use of Deep Learning for Agile Effort Estimation (namely Deep-SE), which has set the state-of-the-art since. Specifically, we replicate three of the original research questions aiming at investigating the effectiveness of Deep-SE for both within-project and cross-project effort estimation. We benchmark Deep-SE against three baselines (i.e., Random, Mean and Median effort estimators) and a previously proposed method to estimate agile software project development effort (dubbed TF/IDF-SVM), as done in the original study. To this end, we use the data from the original study and an additional dataset of 31,960 issues mined from TAWOS, as using more data allows us to strengthen the confidence in the results, and to further mitigate external validity threats. The results of our replication show that Deep-SE outperforms the Median baseline estimator and TF/IDF-SVM in only very few cases with statistical significance (8/42 and 9/32 cases, respectively), thus confounding previous findings on the efficacy of Deep-SE. The two additional RQs revealed that neither augmenting the training set nor pre-training Deep-SE play lead to an improvement of its accuracy and convergence speed. These results suggest that using semantic similarity is not enough to differentiate user stories with respect to their story points; thus, future work has yet to explore and find new techniques and features that obtain accurate agile software development estimates.  ( 2 min )
    SuperTickets: Drawing Task-Agnostic Lottery Tickets from Supernets via Jointly Architecture Searching and Parameter Pruning. (arXiv:2207.03677v4 [cs.CV] UPDATED)
    Neural architecture search (NAS) has demonstrated amazing success in searching for efficient deep neural networks (DNNs) from a given supernet. In parallel, the lottery ticket hypothesis has shown that DNNs contain small subnetworks that can be trained from scratch to achieve a comparable or higher accuracy than original DNNs. As such, it is currently a common practice to develop efficient DNNs via a pipeline of first search and then prune. Nevertheless, doing so often requires a search-train-prune-retrain process and thus prohibitive computational cost. In this paper, we discover for the first time that both efficient DNNs and their lottery subnetworks (i.e., lottery tickets) can be directly identified from a supernet, which we term as SuperTickets, via a two-in-one training scheme with jointly architecture searching and parameter pruning. Moreover, we develop a progressive and unified SuperTickets identification strategy that allows the connectivity of subnetworks to change during supernet training, achieving better accuracy and efficiency trade-offs than conventional sparse training. Finally, we evaluate whether such identified SuperTickets drawn from one task can transfer well to other tasks, validating their potential of handling multiple tasks simultaneously. Extensive experiments and ablation studies on three tasks and four benchmark datasets validate that our proposed SuperTickets achieve boosted accuracy and efficiency trade-offs than both typical NAS and pruning pipelines, regardless of having retraining or not. Codes and pretrained models are available at https://github.com/RICE-EIC/SuperTickets.  ( 2 min )
    Machine Learning Construction: implications to cybersecurity. (arXiv:1906.10019v4 [cs.LG] UPDATED)
    Statistical learning is the process of estimating an unknown probabilistic input-output relationship of a system using a limited number of observations. A statistical learning machine (SLM) is the algorithm, function, model, or rule, that learns such a process; and machine learning (ML) is the conventional name of this field. ML and its applications are ubiquitous in the modern world. Systems such as Automatic target recognition (ATR) in military applications, computer aided diagnosis (CAD) in medical imaging, DNA microarrays in genomics, optical character recognition (OCR), speech recognition (SR), spam email filtering, stock market prediction, etc., are few examples and applications for ML; diverse fields but one theory. In particular, ML has gained a lot of attention in the field of cyberphysical security, especially in the last decade. It is of great importance to this field to design detection algorithms that have the capability of learning from security data to be able to hunt threats, achieve better monitoring, master the complexity of the threat intelligence feeds, and achieve timely remediation of security incidents. The field of ML can be decomposed into two basic subfields: \textit{construction} and \textit{assessment}. We mean by \textit{construction} designing or inventing an appropriate algorithm that learns from the input data and achieves a good performance according to some optimality criterion. We mean by \textit{assessment} attributing some performance measures to the constructed ML algorithm, along with their estimators, to objectively assess this algorithm. \textit{Construction} and \textit{assessment} of a ML algorithm require familiarity with different other fields: probability, statistics, matrix theory, optimization, algorithms, and programming, among others.f  ( 3 min )
    AlphaMLDigger: A Novel Machine Learning Solution to Explore Excess Return on Investment. (arXiv:2206.11072v2 [q-fin.CP] UPDATED)
    How to quickly and automatically mine effective information and serve investment decisions has attracted more and more attention from academia and industry. And new challenges have arisen with the global pandemic. This paper proposes a two-phase AlphaMLDigger that effectively finds excessive returns in a highly fluctuated market. In phase 1, a deep sequential natural language processing (NLP) model is proposed to transfer Sina Microblog blogs to market sentiment. In phase 2, the predicted market sentiment is combined with social network indicator features and stock market history features to predict the stock movements with different Machine Learning models and optimizers. The results show that the ensemble models achieve an accuracy of 0.984 and significantly outperform the baseline model. In addition, we find that COVID-19 brings data shift to China's stock market.  ( 2 min )
    Differentiable Neural Architecture Search for Extremely Lightweight Image Super-Resolution. (arXiv:2105.03939v2 [eess.IV] UPDATED)
    Single Image Super-Resolution (SISR) tasks have achieved significant performance with deep neural networks. However, the large number of parameters in CNN-based met-hods for SISR tasks require heavy computations. Although several efficient SISR models have been recently proposed, most are handcrafted and thus lack flexibility. In this work, we propose a novel differentiable Neural Architecture Search (NAS) approach on both the cell-level and network-level to search for lightweight SISR models. Specifically, the cell-level search space is designed based on an information distillation mechanism, focusing on the combinations of lightweight operations and aiming to build a more lightweight and accurate SR structure. The network-level search space is designed to consider the feature connections among the cells and aims to find which information flow benefits the cell most to boost the performance. Unlike the existing Reinforcement Learning (RL) or Evolutionary Algorithm (EA) based NAS methods for SISR tasks, our search pipeline is fully differentiable, and the lightweight SISR models can be efficiently searched on both the cell-level and network-level jointly on a single GPU. Experiments show that our methods can achieve state-of-the-art performance on the benchmark datasets in terms of PSNR, SSIM, and model complexity with merely 68G Multi-Adds for $\times 2$ and 18G Multi-Adds for $\times 4$ SR tasks.  ( 2 min )
    Causal Structure Learning: a Combinatorial Perspective. (arXiv:2206.01152v2 [stat.ME] UPDATED)
    In this review, we discuss approaches for learning causal structure from data, also called causal discovery. In particular, we focus on approaches for learning directed acyclic graphs (DAGs) and various generalizations which allow for some variables to be unobserved in the available data. We devote special attention to two fundamental combinatorial aspects of causal structure learning. First, we discuss the structure of the search space over causal graphs. Second, we discuss the structure of equivalence classes over causal graphs, i.e., sets of graphs which represent what can be learned from observational data alone, and how these equivalence classes can be refined by adding interventional data.  ( 2 min )
    CASSOCK: Viable Backdoor Attacks against DNN in The Wall of Source-Specific Backdoor Defences. (arXiv:2206.00145v2 [cs.CR] UPDATED)
    As a critical threat to deep neural networks (DNNs), backdoor attacks can be categorized into two types, i.e., source-agnostic backdoor attacks (SABAs) and source-specific backdoor attacks (SSBAs). Compared to traditional SABAs, SSBAs are more advanced in that they have superior stealthier in bypassing mainstream countermeasures that are effective against SABAs. Nonetheless, existing SSBAs suffer from two major limitations. First, they can hardly achieve a good trade-off between ASR (attack success rate) and FPR (false positive rate). Besides, they can be effectively detected by the state-of-the-art (SOTA) countermeasures (e.g., SCAn). To address the limitations above, we propose a new class of viable source-specific backdoor attacks, coined as CASSOCK. Our key insight is that trigger designs when creating poisoned data and cover data in SSBAs play a crucial role in demonstrating a viable source-specific attack, which has not been considered by existing SSBAs. With this insight, we focus on trigger transparency and content when crafting triggers for poisoned dataset where a sample has an attacker-targeted label and cover dataset where a sample has a ground-truth label. Specifically, we implement $CASSOCK_{Trans}$ and $CASSOCK_{Cont}$. While both they are orthogonal, they are complementary to each other, generating a more powerful attack, called $CASSOCK_{Comp}$, with further improved attack performance and stealthiness. We perform a comprehensive evaluation of the three $CASSOCK$-based attacks on four popular datasets and three SOTA defenses. Compared with a representative SSBA as a baseline ($SSBA_{Base}$), $CASSOCK$-based attacks have significantly advanced the attack performance, i.e., higher ASR and lower FPR with comparable CDA (clean data accuracy). Besides, $CASSOCK$-based attacks have effectively bypassed the SOTA defenses, and $SSBA_{Base}$ cannot.  ( 2 min )
    The Multimarginal Optimal Transport Formulation of Adversarial Multiclass Classification. (arXiv:2204.12676v2 [cs.LG] UPDATED)
    We study a family of adversarial multiclass classification problems and provide equivalent reformulations in terms of: 1) a family of generalized barycenter problems introduced in the paper and 2) a family of multimarginal optimal transport problems where the number of marginals is equal to the number of classes in the original classification problem. These new theoretical results reveal a rich geometric structure of adversarial learning problems in multiclass classification and extend recent results restricted to the binary classification setting. A direct computational implication of our results is that by solving either the barycenter problem and its dual, or the MOT problem and its dual, we can recover the optimal robust classification rule and the optimal adversarial strategy for the original adversarial problem. Examples with synthetic and real data illustrate our results.  ( 2 min )
    Robust Bayesian Inference for Simulator-based Models via the MMD Posterior Bootstrap. (arXiv:2202.04744v3 [stat.ME] UPDATED)
    Simulator-based models are models for which the likelihood is intractable but simulation of synthetic data is possible. They are often used to describe complex real-world phenomena, and as such can often be misspecified in practice. Unfortunately, existing Bayesian approaches for simulators are known to perform poorly in those cases. In this paper, we propose a novel algorithm based on the posterior bootstrap and maximum mean discrepancy estimators. This leads to a highly-parallelisable Bayesian inference algorithm with strong robustness properties. This is demonstrated through an in-depth theoretical study which includes generalisation bounds and proofs of frequentist consistency and robustness of our posterior. The approach is then assessed on a range of examples including a g-and-k distribution and a toggle-switch model.  ( 2 min )
    Towards Faithful and Consistent Explanations for Graph Neural Networks. (arXiv:2205.13733v2 [cs.LG] UPDATED)
    Uncovering rationales behind predictions of graph neural networks (GNNs) has received increasing attention over recent years. Instance-level GNN explanation aims to discover critical input elements, like nodes or edges, that the target GNN relies upon for making predictions. Though various algorithms are proposed, most of them formalize this task by searching the minimal subgraph which can preserve original predictions. However, an inductive bias is deep-rooted in this framework: several subgraphs can result in the same or similar outputs as the original graphs. Consequently, they have the danger of providing spurious explanations and fail to provide consistent explanations. Applying them to explain weakly-performed GNNs would further amplify these issues. To address this problem, we theoretically examine the predictions of GNNs from the causality perspective. Two typical reasons of spurious explanations are identified: confounding effect of latent variables like distribution shift, and causal factors distinct from the original input. Observing that both confounding effects and diverse causal rationales are encoded in internal representations, we propose a simple yet effective countermeasure by aligning embeddings. Concretely, concerning potential shifts in the high-dimensional space, we design a distribution-aware alignment algorithm based on anchors. This new objective is easy to compute and can be incorporated into existing techniques with no or little effort. Theoretical analysis shows that it is in effect optimizing a more faithful explanation objective in design, which further justifies the proposed approach.  ( 2 min )
    What Do Deep Neural Networks Find in Disordered Structures of Glasses?. (arXiv:2208.00349v3 [cond-mat.dis-nn] UPDATED)
    Glass transitions are widely observed in various types of soft matter systems. However, the physical mechanism of these transitions remains {elusive}, despite years of ambitious research. In particular, an important unanswered question is whether the glass transition is accompanied by a divergence of the correlation lengths of the characteristic static structures. In this study, we develop a deep-neural-network-based method that is used to extract the characteristic local meso-structures solely from instantaneous {particle} configurations without any {information} about the dynamics. We first train a neural network to classify configurations of liquids and glasses correctly. Then, we obtain the characteristic structures by quantifying the grounds for the decisions made by the network using Gradient-weighted Class Activation Mapping (Grad-CAM). We considered two qualitatively different glass-forming binary systems, and through comparisons with several established structural indicators, we demonstrate that our system can be used to identify characteristic structures that depend on the details of the systems. Moreover, the extracted structures are remarkably correlated with the nonequilibrium aging dynamics in thermal fluctuations.  ( 2 min )
    Gaussian Mixture Reduction with Composite Transportation Divergence. (arXiv:2002.08410v3 [stat.ML] UPDATED)
    Gaussian mixtures can approximate almost any smooth density function and are used to simplify downstream inference tasks. As such, it is widely used in applications in density estimation, belief propagation, and Bayesian filtering. In these applications, a finite Gaussian mixture provides an initial approximation to density functions that are updated recursively. A challenge in these recursions is that the order of the Gaussian mixture increases exponentially, and the inference quickly becomes intractable. To overcome the difficulty, the Gaussian mixture reduction, which approximates a high order Gaussian mixture by one with a lower order, can be used. Existing methods such as the clustering-based approaches are renowned for their satisfactory performance and computationally efficiency. However, they have unknown convergence and optimal targets. We propose a novel optimization-based Gaussian mixture reduction method. We develop a majorization-minimization algorithm for its numerical computation and establish its theoretical convergence under general conditions. We show many existing clustering-based methods are special cases of ours, thus bridging the gap between optimization-based and clustering-based methods. The unified framework allows users to choose the most suitable cost function to achieve superior performance in their specific application. We demonstrate the efficiency and effectiveness of the proposed method through extensive empirical experiments.
    Multi-task Joint Strategies of Self-supervised Representation Learning on Biomedical Networks for Drug Discovery. (arXiv:2201.04437v2 [cs.LG] UPDATED)
    Self-supervised representation learning (SSL) on biomedical networks provides new opportunities for drug discovery. However, how to effectively combine multiple SSL models is still challenging and has been rarely explored. Therefore, we propose multi-task joint strategies of self-supervised representation learning on biomedical networks for drug discovery, named MSSL2drug. We design six basic SSL tasks inspired by various modality features including structures, semantics, and attributes in heterogeneous biomedical networks. Importantly, fifteen combinations of multiple tasks are evaluated by a graph attention-based multi-task adversarial learning framework in two drug discovery scenarios. The results suggest two important findings. (1) Combinations of multimodal tasks achieve the best performance compared to other multi-task joint models. (2) The local-global combination models yield higher performance than random two-task combinations when there are the same size of modalities. Therefore, we conjecture that the multimodal and local-global combination strategies can be treated as the guideline of multi-task SSL for drug discovery.
    An Upper Bound for the Distribution Overlap Index and Its Applications. (arXiv:2212.08701v1 [cs.LG])
    This paper proposes an easy-to-compute upper bound for the overlap index between two probability distributions without requiring any knowledge of the distribution models. The computation of our bound is time-efficient and memory-efficient and only requires finite samples. The proposed bound shows its value in one-class classification and domain shift analysis. Specifically, in one-class classification, we build a novel one-class classifier by converting the bound into a confidence score function. Unlike most one-class classifiers, the training process is not needed for our classifier. Additionally, the experimental results show that our classifier \textcolor{\colorname}{can be accurate with} only a small number of in-class samples and outperforms many state-of-the-art methods on various datasets in different one-class classification scenarios. In domain shift analysis, we propose a theorem based on our bound. The theorem is useful in detecting the existence of domain shift and inferring data information. The detection and inference processes are both computation-efficient and memory-efficient. Our work shows significant promise toward broadening the applications of overlap-based metrics.
    Addressing Data Heterogeneity in Decentralized Learning via Topological Pre-processing. (arXiv:2212.08743v1 [cs.LG])
    Recently, local peer topology has been shown to influence the overall convergence of decentralized learning (DL) graphs in the presence of data heterogeneity. In this paper, we demonstrate the advantages of constructing a proxy-based locally heterogeneous DL topology to enhance convergence and maintain data privacy. In particular, we propose a novel peer clumping strategy to efficiently cluster peers before arranging them in a final training graph. By showing how locally heterogeneous graphs outperform locally homogeneous graphs of similar size and from the same global data distribution, we present a strong case for topological pre-processing. Moreover, we demonstrate the scalability of our approach by showing how the proposed topological pre-processing overhead remains small in large graphs while the performance gains get even more pronounced. Furthermore, we show the robustness of our approach in the presence of network partitions.
    Context-dependent Explainability and Contestability for Trustworthy Medical Artificial Intelligence: Misclassification Identification of Morbidity Recognition Models in Preterm Infants. (arXiv:2212.08821v1 [cs.AI])
    Although machine learning (ML) models of AI achieve high performances in medicine, they are not free of errors. Empowering clinicians to identify incorrect model recommendations is crucial for engendering trust in medical AI. Explainable AI (XAI) aims to address this requirement by clarifying AI reasoning to support the end users. Several studies on biomedical imaging achieved promising results recently. Nevertheless, solutions for models using tabular data are not sufficient to meet the requirements of clinicians yet. This paper proposes a methodology to support clinicians in identifying failures of ML models trained with tabular data. We built our methodology on three main pillars: decomposing the feature set by leveraging clinical context latent space, assessing the clinical association of global explanations, and Latent Space Similarity (LSS) based local explanations. We demonstrated our methodology on ML-based recognition of preterm infant morbidities caused by infection. The risk of mortality, lifelong disability, and antibiotic resistance due to model failures was an open research question in this domain. We achieved to identify misclassification cases of two models with our approach. By contextualizing local explanations, our solution provides clinicians with actionable insights to support their autonomy for informed final decisions.
    MeSH Suggester: A Library and System for MeSH Term Suggestion for Systematic Review Boolean Query Construction. (arXiv:2212.09018v1 [cs.IR])
    Boolean query construction is often critical for medical systematic review literature search. To create an effective Boolean query, systematic review researchers typically spend weeks coming up with effective query terms and combinations. One challenge to creating an effective systematic review Boolean query is the selection of effective MeSH Terms to include in the query. In our previous work, we created neural MeSH term suggestion methods and compared them to state-of-the-art MeSH term suggestion methods. We found neural MeSH term suggestion methods to be highly effective. In this demonstration, we build upon our previous work by creating (1) a Web-based MeSH term suggestion prototype system that allows users to obtain suggestions from a number of underlying methods and (2) a Python library that implements ours and others' MeSH term suggestion methods and that is aimed at researchers who want to further investigate, create or deploy such type of methods. We describe the architecture of the web-based system and how to use it for the MeSH term suggestion task. For the Python library, we describe how the library can be used for advancing further research and experimentation, and we validate the results of the methods contained in the library on standard datasets. Our web-based prototype system is available at this http URL, while our Python library is at https://github.com/ielab/meshsuggestlib.
    AutoSlicer: Scalable Automated Data Slicing for ML Model Analysis. (arXiv:2212.09032v1 [cs.LG])
    Automated slicing aims to identify subsets of evaluation data where a trained model performs anomalously. This is an important problem for machine learning pipelines in production since it plays a key role in model debugging and comparison, as well as the diagnosis of fairness issues. Scalability has become a critical requirement for any automated slicing system due to the large search space of possible slices and the growing scale of data. We present Autoslicer, a scalable system that searches for problematic slices through distributed metric computation and hypothesis testing. We develop an efficient strategy that reduces the search space through pruning and prioritization. In the experiments, we show that our search strategy finds most of the anomalous slices by inspecting a small portion of the search space.
    Synthesis and Evaluation of a Domain-specific Large Data Set for Dungeons & Dragons. (arXiv:2212.09080v1 [cs.CL])
    This paper introduces the Forgotten Realms Wiki (FRW) data set and domain specific natural language generation using FRW along with related analyses. Forgotten Realms is the de-facto default setting of the popular open ended tabletop fantasy role playing game, Dungeons & Dragons. The data set was extracted from the Forgotten Realms Fandom wiki consisting of more than over 45,200 articles. The FRW data set is constituted of 11 sub-data sets in a number of formats: raw plain text, plain text annotated by article title, directed link graphs, wiki info-boxes annotated by the wiki article title, Poincar\'e embedding of first link graph, multiple Word2Vec and Doc2Vec models of the corpus. This is the first data set of this size for the Dungeons & Dragons domain. We then present a pairwise similarity comparison benchmark which utilizes similarity measures. In addition, we perform D&D domain specific natural language generation using the corpus and evaluate the named entity classification with respect to the lore of Forgotten Realms.
    Online Lewis Weight Sampling. (arXiv:2207.08268v3 [cs.DS] UPDATED)
    The seminal work of Cohen and Peng introduced Lewis weight sampling to the theoretical computer science community, yielding fast row sampling algorithms for approximating $d$-dimensional subspaces of $\ell_p$ up to $(1+\epsilon)$ error. Several works have extended this important primitive to other settings, including the online coreset and sliding window models. However, these results are only for $p\in\{1,2\}$, and results for $p=1$ require a suboptimal $\tilde O(d^2/\epsilon^2)$ samples. In this work, we design the first nearly optimal $\ell_p$ subspace embeddings for all $p\in(0,\infty)$ in the online coreset and sliding window models. In both models, our algorithms store $\tilde O(d^{1\lor(p/2)}/\epsilon^2)$ rows. This answers a substantial generalization of the main open question of [BDMMUWZ2020], and gives the first results for all $p\notin\{1,2\}$. Towards our result, we give the first analysis of "one-shot'' Lewis weight sampling of sampling rows proportionally to their Lewis weights, with sample complexity $\tilde O(d^{p/2}/\epsilon^2)$ for $p>2$. Previously, this scheme was only known to have sample complexity $\tilde O(d^{p/2}/\epsilon^5)$, whereas $\tilde O(d^{p/2}/\epsilon^2)$ is known if a more sophisticated recursive sampling is used. The recursive sampling cannot be implemented online, thus necessitating an analysis of one-shot Lewis weight sampling. Our analysis uses a novel connection to online numerical linear algebra. As an application, we obtain the first one-pass streaming coreset algorithms for $(1+\epsilon)$ approximation of important generalized linear models, such as logistic regression and $p$-probit regression. Our upper bounds are parameterized by a complexity parameter $\mu$ introduced by [MSSW2018], and we show the first lower bounds showing that a linear dependence on $\mu$ is necessary.
    A Permutation-Free Kernel Independence Test. (arXiv:2212.09108v1 [stat.ME])
    In nonparametric independence testing, we observe i.i.d.\ data $\{(X_i,Y_i)\}_{i=1}^n$, where $X \in \mathcal{X}, Y \in \mathcal{Y}$ lie in any general spaces, and we wish to test the null that $X$ is independent of $Y$. Modern test statistics such as the kernel Hilbert-Schmidt Independence Criterion (HSIC) and Distance Covariance (dCov) have intractable null distributions due to the degeneracy of the underlying U-statistics. Thus, in practice, one often resorts to using permutation testing, which provides a nonasymptotic guarantee at the expense of recalculating the quadratic-time statistics (say) a few hundred times. This paper provides a simple but nontrivial modification of HSIC and dCov (called xHSIC and xdCov, pronounced ``cross'' HSIC/dCov) so that they have a limiting Gaussian distribution under the null, and thus do not require permutations. This requires building on the newly developed theory of cross U-statistics by Kim and Ramdas (2020), and in particular developing several nontrivial extensions of the theory in Shekhar et al. (2022), which developed an analogous permutation-free kernel two-sample test. We show that our new tests, like the originals, are consistent against fixed alternatives, and minimax rate optimal against smooth local alternatives. Numerical simulations demonstrate that compared to the full dCov or HSIC, our variants have the same power up to a $\sqrt 2$ factor, giving practitioners a new option for large problems or data-analysis pipelines where computation, not sample size, could be the bottleneck.
    Faithful Heteroscedastic Regression with Neural Networks. (arXiv:2212.09184v1 [cs.LG])
    Heteroscedastic regression models a Gaussian variable's mean and variance as a function of covariates. Parametric methods that employ neural networks for these parameter maps can capture complex relationships in the data. Yet, optimizing network parameters via log likelihood gradients can yield suboptimal mean and uncalibrated variance estimates. Current solutions side-step this optimization problem with surrogate objectives or Bayesian treatments. Instead, we make two simple modifications to optimization. Notably, their combination produces a heteroscedastic model with mean estimates that are provably as accurate as those from its homoscedastic counterpart (i.e.~fitting the mean under squared error loss). For a wide variety of network and task complexities, we find that mean estimates from existing heteroscedastic solutions can be significantly less accurate than those from an equivalently expressive mean-only model. Our approach provably retains the accuracy of an equally flexible mean-only model while also offering best-in-class variance calibration. Lastly, we show how to leverage our method to recover the underlying heteroscedastic noise variance.
    Wheel Impact Test by Deep Learning: Prediction of Location and Magnitude of Maximum Stress. (arXiv:2210.01126v2 [cs.LG] UPDATED)
    For ensuring vehicle safety, the impact performance of wheels during wheel development must be ensured through a wheel impact test. However, manufacturing and testing a real wheel requires a significant time and money because developing an optimal wheel design requires numerous iterative processes to modify the wheel design and verify the safety performance. Accordingly, wheel impact tests have been replaced by computer simulations such as finite element analysis (FEA); however, it still incurs high computational costs for modeling and analysis, and requires FEA experts. In this study, we present an aluminum road wheel impact performance prediction model based on deep learning that replaces computationally expensive and time-consuming 3D FEA. For this purpose, 2D disk-view wheel image data, 3D wheel voxel data, and barrier mass values used for the wheel impact test were utilized as the inputs to predict the magnitude of the maximum von Mises stress, corresponding location, and the stress distribution of the 2D disk-view. The input data were first compressed into a latent space with a 3D convolutional variational autoencoder (cVAE) and 2D convolutional autoencoder (cAE). Subsequently, the fully connected layers were used to predict the impact performance, and a decoder was used to predict the stress distribution heatmap of the 2D disk-view. The proposed model can replace the impact test in the early wheel-development stage by predicting the impact performance in real-time and can be used without domain knowledge. The time required for the wheel development process can be reduced by using this mechanism.
    Effect of Pre-Training Scale on Intra- and Inter-Domain Full and Few-Shot Transfer Learning for Natural and Medical X-Ray Chest Images. (arXiv:2106.00116v4 [cs.LG] UPDATED)
    Increasing model, data and compute budget scale in the pre-training has been shown to strongly improve model generalization and transfer learning in vast line of work done in language modeling and natural image recognition. However, most studies on the positive effect of larger scale were done in scope of in-domain setting, with source and target data being in close proximity. To study effect of larger scale for both in-domain and out-of-domain setting when performing full and few-shot transfer, we combine here for the first time large, openly available medical X-Ray chest imaging datasets to reach a scale for medical imaging domain comparable to ImageNet-1k, routinely used for pre-training in natural image domain. We then conduct supervised pre-training, while varying network size and source data scale and domain, being either large natural (ImageNet-1k/21k) or large medical chest X-Ray datasets, and transfer pre-trained models to different natural or medical targets. We observe strong improvement due to larger pre-training scale for intra-domain natural-natural and medical-medical transfer. For inter-domain natural-medical transfer, we find improvements due to larger pre-training scale on larger X-Ray targets in full shot regime, while for smaller targets and for few-shot regime the improvement is not visible. Remarkably, large networks pre-trained on very large natural ImageNet-21k are as good or better than networks pre-trained on largest available medical X-Ray data when performing transfer to large X-Ray targets. We conclude that substantially increasing model and generic, medical domain-agnostic natural image source data scale in the pre-training can enable high quality out-of-domain transfer to medical domain specific targets, removing dependency on large medical domain-specific source data often not available in the practice.
    Nish: A Novel Negative Stimulated Hybrid Activation Function. (arXiv:2210.09083v3 [cs.LG] UPDATED)
    An activation function has a significant impact on the efficiency and robustness of the neural networks. As an alternative, we evolved a cutting-edge non-monotonic activation function, Negative Stimulated Hybrid Activation Function (Nish). It acts as a Rectified Linear Unit (ReLU) function for the positive region and a sinus-sigmoidal function for the negative region. In other words, it incorporates a sigmoid and a sine function and gaining new dynamics over classical ReLU. We analyzed the consistency of the Nish for different combinations of essential networks and most common activation functions using on several most popular benchmarks. From the experimental results, we reported that the accuracy rates achieved by the Nish is slightly better than compared to the Mish in classification.
    On Noisy Evaluation in Federated Hyperparameter Tuning. (arXiv:2212.08930v1 [cs.LG])
    Hyperparameter tuning is critical to the success of federated learning applications. Unfortunately, appropriately selecting hyperparameters is challenging in federated networks. Issues of scale, privacy, and heterogeneity introduce noise in the tuning process and make it difficult to evaluate the performance of various hyperparameters. In this work, we perform the first systematic study on the effect of noisy evaluation in federated hyperparameter tuning. We first identify and rigorously explore key sources of noise, including client subsampling, data and systems heterogeneity, and data privacy. Surprisingly, our results indicate that even small amounts of noise can significantly impact tuning methods-reducing the performance of state-of-the-art approaches to that of naive baselines. To address noisy evaluation in such scenarios, we propose a simple and effective approach that leverages public proxy data to boost the evaluation signal. Our work establishes general challenges, baselines, and best practices for future work in federated hyperparameter tuning.
    Toward Data Heterogeneity of Federated Learning. (arXiv:2212.08944v1 [cs.LG])
    Federated learning is a popular paradigm for machine learning. Ideally, federated learning works best when all clients share a similar data distribution. However, it is not always the case in the real world. Therefore, the topic of federated learning on heterogeneous data has gained more and more effort from both academia and industry. In this project, we first do extensive experiments to show how data skew and quantity skew will affect the performance of state-of-art federated learning algorithms. Then we propose a new algorithm FedMix which adjusts existing federated learning algorithms and we show its performance. We find that existing state-of-art algorithms such as FedProx and FedNova do not have a significant improvement in all testing cases. But by testing the existing and new algorithms, it seems that tweaking the client side is more effective than tweaking the server side.
    Trustworthy Visual Analytics in Clinical Gait Analysis: A Case Study for Patients with Cerebral Palsy. (arXiv:2208.05232v3 [cs.HC] UPDATED)
    Three-dimensional clinical gait analysis is essential for selecting optimal treatment interventions for patients with cerebral palsy (CP), but generates a large amount of time series data. For the automated analysis of these data, machine learning approaches yield promising results. However, due to their black-box nature, such approaches are often mistrusted by clinicians. We propose gaitXplorer, a visual analytics approach for the classification of CP-related gait patterns that integrates Grad-CAM, a well-established explainable artificial intelligence algorithm, for explanations of machine learning classifications. Regions of high relevance for classification are highlighted in the interactive visual interface. The approach is evaluated in a case study with two clinical gait experts. They inspected the explanations for a sample of eight patients using the visual interface and expressed which relevance scores they found trustworthy and which they found suspicious. Overall, the clinicians gave positive feedback on the approach as it allowed them a better understanding of which regions in the data were relevant for the classification.
    Multi-Instance Partial-Label Learning: Towards Exploiting Dual Inexact Supervision. (arXiv:2212.08997v1 [cs.LG])
    Weakly supervised machine learning algorithms are able to learn from ambiguous samples or labels, e.g., multi-instance learning or partial-label learning. However, in some real-world tasks, each training sample is associated with not only multiple instances but also a candidate label set that contains one ground-truth label and some false positive labels. Specifically, at least one instance pertains to the ground-truth label while no instance belongs to the false positive labels. In this paper, we formalize such problems as multi-instance partial-label learning (MIPL). Existing multi-instance learning algorithms and partial-label learning algorithms are suboptimal for solving MIPL problems since the former fail to disambiguate a candidate label set, and the latter cannot handle a multi-instance bag. To address these issues, a tailored algorithm named MIPLGP, i.e., Multi-Instance Partial-Label learning with Gaussian Processes, is proposed. MIPLGP first assigns each instance with a candidate label set in an augmented label space, then transforms the candidate label set into a logarithmic space to yield the disambiguated and continuous labels via an exclusive disambiguation strategy, and last induces a model based on the Gaussian processes. Experimental results on various datasets validate that MIPLGP is superior to well-established multi-instance learning and partial-label learning algorithms for solving MIPL problems. Our code and datasets will be made publicly available.
    Graph Neural Networks are Inherently Good Generalizers: Insights by Bridging GNNs and MLPs. (arXiv:2212.09034v1 [cs.LG])
    Graph neural networks (GNNs), as the de-facto model class for representation learning on graphs, are built upon the multi-layer perceptrons (MLP) architecture with additional message passing layers to allow features to flow across nodes. While conventional wisdom largely attributes the success of GNNs to their advanced expressivity for learning desired functions on nodes' ego-graphs, we conjecture that this is \emph{not} the main cause of GNNs' superiority in node prediction tasks. This paper pinpoints the major source of GNNs' performance gain to their intrinsic generalization capabilities, by introducing an intermediate model class dubbed as P(ropagational)MLP, which is identical to standard MLP in training, and then adopt GNN's architecture in testing. Intriguingly, we observe that PMLPs consistently perform on par with (or even exceed) their GNN counterparts across ten benchmarks and different experimental settings, despite the fact that PMLPs share the same (trained) weights with poorly-performed MLP. This critical finding opens a door to a brand new perspective for understanding the power of GNNs, and allow bridging GNNs and MLPs for dissecting their generalization behaviors. As an initial step to analyze PMLP, we show its essential difference with MLP at infinite-width limit lies in the NTK feature map in the post-training stage. Moreover, though MLP and PMLP cannot extrapolate non-linear functions for extreme OOD data, PMLP has more freedom to generalize near the training support.
    Physics-informed Neural Networks with Periodic Activation Functions for Solute Transport in Heterogeneous Porous Media. (arXiv:2212.08965v1 [cs.LG])
    Solute transport in porous media is relevant to a wide range of applications in hydrogeology, geothermal energy, underground CO2 storage, and a variety of chemical engineering systems. Due to the complexity of solute transport in heterogeneous porous media, traditional solvers require high resolution meshing and are therefore expensive computationally. This study explores the application of a mesh-free method based on deep learning to accelerate the simulation of solute transport. We employ Physics-informed Neural Networks (PiNN) to solve solute transport problems in homogeneous and heterogeneous porous media governed by the advection-dispersion equation. Unlike traditional neural networks that learn from large training datasets, PiNNs only leverage the strong form mathematical models to simultaneously solve for multiple dependent or independent field variables (e.g., pressure and solute concentration fields). In this study, we construct PiNN using a periodic activation function to better represent the complex physical signals (i.e., pressure) and their derivatives (i.e., velocity). Several case studies are designed with the intention of investigating the proposed PiNN's capability to handle different degrees of complexity. A manual hyperparameter tuning method is used to find the best PiNN architecture for each test case. Point-wise error and mean square error (MSE) measures are employed to assess the performance of PiNNs' predictions against the ground truth solutions obtained analytically or numerically using the finite element method. Our findings show that the predictions of PiNN are in good agreement with the ground truth solutions while reducing computational complexity and cost by, at least, three orders of magnitude.
    Neural Rankers for Effective Screening Prioritisation in Medical Systematic Review Literature Search. (arXiv:2212.09017v1 [cs.IR])
    Medical systematic reviews typically require assessing all the documents retrieved by a search. The reason is two-fold: the task aims for ``total recall''; and documents retrieved using Boolean search are an unordered set, and thus it is unclear how an assessor could examine only a subset. Screening prioritisation is the process of ranking the (unordered) set of retrieved documents, allowing assessors to begin the downstream processes of the systematic review creation earlier, leading to earlier completion of the review, or even avoiding screening documents ranked least relevant. Screening prioritisation requires highly effective ranking methods. Pre-trained language models are state-of-the-art on many IR tasks but have yet to be applied to systematic review screening prioritisation. In this paper, we apply several pre-trained language models to the systematic review document ranking task, both directly and fine-tuned. An empirical analysis compares how effective neural methods compare to traditional methods for this task. We also investigate different types of document representations for neural methods and their impact on ranking performance. Our results show that BERT-based rankers outperform the current state-of-the-art screening prioritisation methods. However, BERT rankers and existing methods can actually be complementary, and thus, further improvements may be achieved if used in conjunction.
    Predicting Citi Bike Demand Evolution Using Dynamic Graphs. (arXiv:2212.09175v1 [cs.LG])
    Bike sharing systems often suffer from poor capacity management as a result of variable demand. These bike sharing systems would benefit from models to predict demand in order to moderate the number of bikes stored at each station. In this paper, we attempt to apply a graph neural network model to predict bike demand in the New York City, Citi Bike dataset.
    Unrolling SVT to obtain computationally efficient SVT for n-qubit quantum state tomography. (arXiv:2212.08852v1 [quant-ph])
    Quantum state tomography aims to estimate the state of a quantum mechanical system which is described by a trace one, Hermitian positive semidefinite complex matrix, given a set of measurements of the state. Existing works focus on estimating the density matrix that represents the state, using a compressive sensing approach, with only fewer measurements than that required for a tomographically complete set, with the assumption that the true state has a low rank. One very popular method to estimate the state is the use of the Singular Value Thresholding (SVT) algorithm. In this work, we present a machine learning approach to estimate the quantum state of n-qubit systems by unrolling the iterations of SVT which we call Learned Quantum State Tomography (LQST). As merely unrolling SVT may not ensure that the output of the network meets the constraints required for a quantum state, we design and train a custom neural network whose architecture is inspired from the iterations of SVT with additional layers to meet the required constraints. We show that our proposed LQST with very few layers reconstructs the density matrix with much better fidelity than the SVT algorithm which takes many hundreds of iterations to converge. We also demonstrate the reconstruction of the quantum Bell state from an informationally incomplete set of noisy measurements.
    Learning criteria going beyond the usual risk. (arXiv:2110.04996v2 [stat.ML] UPDATED)
    Virtually all machine learning tasks are characterized using some form of loss function, and "good performance" is typically stated in terms of a sufficiently small average loss, taken over the random draw of test data. While optimizing for performance on average is intuitive, convenient to analyze in theory, and easy to implement in practice, such a choice brings about trade-offs. In this work, we survey and introduce a wide variety of non-traditional criteria used to design and evaluate machine learning algorithms, place the classical paradigm within the proper historical context, and propose a view of learning problems which emphasizes the question of "what makes for a desirable loss distribution?" in place of tacit use of the expected loss.
    SkillFence: A Systems Approach to Practically Mitigating Voice-Based Confusion Attacks. (arXiv:2212.08738v1 [cs.CR])
    Voice assistants are deployed widely and provide useful functionality. However, recent work has shown that commercial systems like Amazon Alexa and Google Home are vulnerable to voice-based confusion attacks that exploit design issues. We propose a systems-oriented defense against this class of attacks and demonstrate its functionality for Amazon Alexa. We ensure that only the skills a user intends execute in response to voice commands. Our key insight is that we can interpret a user's intentions by analyzing their activity on counterpart systems of the web and smartphones. For example, the Lyft ride-sharing Alexa skill has an Android app and a website. Our work shows how information from counterpart apps can help reduce dis-ambiguities in the skill invocation process. We build SkilIFence, a browser extension that existing voice assistant users can install to ensure that only legitimate skills run in response to their commands. Using real user data from MTurk (N = 116) and experimental trials involving synthetic and organic speech, we show that SkillFence provides a balance between usability and security by securing 90.83% of skills that a user will need with a False acceptance rate of 19.83%.
    MoDi: Unconditional Motion Synthesis from Diverse Data. (arXiv:2206.08010v3 [cs.GR] UPDATED)
    The emergence of neural networks has revolutionized the field of motion synthesis. Yet, learning to unconditionally synthesize motions from a given distribution remains challenging, especially when the motions are highly diverse. In this work, we present MoDi -- a generative model trained in an unsupervised setting from an extremely diverse, unstructured and unlabeled dataset. During inference, MoDi can synthesize high-quality, diverse motions. Despite the lack of any structure in the dataset, our model yields a well-behaved and highly structured latent space, which can be semantically clustered, constituting a strong motion prior that facilitates various applications including semantic editing and crowd simulation. In addition, we present an encoder that inverts real motions into MoDi's natural motion manifold, issuing solutions to various ill-posed challenges such as completion from prefix and spatial editing. Our qualitative and quantitative experiments achieve state-of-the-art results that outperform recent SOTA techniques. Code and trained models are available at https://sigal-raab.github.io/MoDi.
    Molecule optimization via multi-objective evolutionary in implicit chemical space. (arXiv:2212.08826v1 [q-bio.BM])
    Machine learning methods have been used to accelerate the molecule optimization process. However, efficient search for optimized molecules satisfying several properties with scarce labeled data remains a challenge for machine learning molecule optimization. In this study, we propose MOMO, a multi-objective molecule optimization framework to address the challenge by combining learning of chemical knowledge with Pareto-based multi-objective evolutionary search. To learn chemistry, it employs a self-supervised codec to construct an implicit chemical space and acquire the continues representation of molecules. To explore the established chemical space, MOMO uses multi-objective evolution to comprehensively and efficiently search for similar molecules with multiple desirable properties. We demonstrate the high performance of MOMO on four multi-objective property and similarity optimization tasks, and illustrate the search capability of MOMO through case studies. Remarkably, our approach significantly outperforms previous approaches in optimizing three objectives simultaneously. The results show the optimization capability of MOMO, suggesting to improve the success rate of lead molecule optimization.
    Multimodal CNN Networks for Brain Tumor Segmentation in MRI: A BraTS 2022 Challenge Solution. (arXiv:2212.09310v1 [eess.IV])
    Automatic segmentation is essential for the brain tumor diagnosis, disease prognosis, and follow-up therapy of patients with gliomas. Still, accurate detection of gliomas and their sub-regions in multimodal MRI is very challenging due to the variety of scanners and imaging protocols. Over the last years, the BraTS Challenge has provided a large number of multi-institutional MRI scans as a benchmark for glioma segmentation algorithms. This paper describes our contribution to the BraTS 2022 Continuous Evaluation challenge. We propose a new ensemble of multiple deep learning frameworks namely, DeepSeg, nnU-Net, and DeepSCAN for automatic glioma boundaries detection in pre-operative MRI. It is worth noting that our ensemble models took first place in the final evaluation on the BraTS testing dataset with Dice scores of 0.9294, 0.8788, and 0.8803, and Hausdorf distance of 5.23, 13.54, and 12.05, for the whole tumor, tumor core, and enhancing tumor, respectively. Furthermore, the proposed ensemble method ranked first in the final ranking on another unseen test dataset, namely Sub-Saharan Africa dataset, achieving mean Dice scores of 0.9737, 0.9593, and 0.9022, and HD95 of 2.66, 1.72, 3.32 for the whole tumor, tumor core, and enhancing tumor, respectively. The docker image for the winning submission is publicly available at (https://hub.docker.com/r/razeineldin/camed22).
    TCFimt: Temporal Counterfactual Forecasting from Individual Multiple Treatment Perspective. (arXiv:2212.08890v1 [cs.LG])
    Determining causal effects of temporal multi-intervention assists decision-making. Restricted by time-varying bias, selection bias, and interactions of multiple interventions, the disentanglement and estimation of multiple treatment effects from individual temporal data is still rare. To tackle these challenges, we propose a comprehensive framework of temporal counterfactual forecasting from an individual multiple treatment perspective (TCFimt). TCFimt constructs adversarial tasks in a seq2seq framework to alleviate selection and time-varying bias and designs a contrastive learning-based block to decouple a mixed treatment effect into separated main treatment effects and causal interactions which further improves estimation accuracy. Through implementing experiments on two real-world datasets from distinct fields, the proposed method shows satisfactory performance in predicting future outcomes with specific treatments and in choosing optimal treatment type and timing than state-of-the-art methods.
    Omni-Training: Bridging Pre-Training and Meta-Training for Few-Shot Learning. (arXiv:2110.07510v3 [cs.LG] UPDATED)
    Few-shot learning aims to fast adapt a deep model from a few examples. While pre-training and meta-training can create deep models powerful for few-shot generalization, we find that pre-training and meta-training focuses respectively on cross-domain transferability and cross-task transferability, which restricts their data efficiency in the entangled settings of domain shift and task shift. We thus propose the Omni-Training framework to seamlessly bridge pre-training and meta-training for data-efficient few-shot learning. Our first contribution is a tri-flow Omni-Net architecture. Besides the joint representation flow, Omni-Net introduces two parallel flows for pre-training and meta-training, responsible for improving domain transferability and task transferability respectively. Omni-Net further coordinates the parallel flows by routing their representations via the joint-flow, enabling knowledge transfer across flows. Our second contribution is the Omni-Loss, which introduces a self-distillation strategy separately on the pre-training and meta-training objectives for boosting knowledge transfer throughout different training stages. Omni-Training is a general framework to accommodate many existing algorithms. Evaluations justify that our single framework consistently and clearly outperforms the individual state-of-the-art methods on both cross-task and cross-domain settings in a variety of classification, regression and reinforcement learning problems.
    Disentangling Learnable and Memorizable Data via Contrastive Learning for Semantic Communications. (arXiv:2212.09071v1 [cs.LG])
    Achieving artificially intelligent-native wireless networks is necessary for the operation of future 6G applications such as the metaverse. Nonetheless, current communication schemes are, at heart, a mere reconstruction process that lacks reasoning. One key solution that enables evolving wireless communication to a human-like conversation is semantic communications. In this paper, a novel machine reasoning framework is proposed to pre-process and disentangle source data so as to make it semantic-ready. In particular, a novel contrastive learning framework is proposed, whereby instance and cluster discrimination are performed on the data. These two tasks enable increasing the cohesiveness between data points mapping to semantically similar content elements and disentangling data points of semantically different content elements. Subsequently, the semantic deep clusters formed are ranked according to their level of confidence. Deep semantic clusters of highest confidence are considered learnable, semantic-rich data, i.e., data that can be used to build a language in a semantic communications system. The least confident ones are considered, random, semantic-poor, and memorizable data that must be transmitted classically. Our simulation results showcase the superiority of our contrastive learning approach in terms of semantic impact and minimalism. In fact, the length of the semantic representation achieved is minimized by 57.22% compared to vanilla semantic communication systems, thus achieving minimalist semantic representations.
    The Underlying Correlated Dynamics in Neural Training. (arXiv:2212.09040v1 [cs.LG])
    Training of neural networks is a computationally intensive task. The significance of understanding and modeling the training dynamics is growing as increasingly larger networks are being trained. We propose in this work a model based on the correlation of the parameters' dynamics, which dramatically reduces the dimensionality. We refer to our algorithm as \emph{correlation mode decomposition} (CMD). It splits the parameter space into groups of parameters (modes) which behave in a highly correlated manner through the epochs. We achieve a remarkable dimensionality reduction with this approach, where networks like ResNet-18, transformers and GANs, containing millions of parameters, can be modeled well using just a few modes. We observe each typical time profile of a mode is spread throughout the network in all layers. Moreover, our model induces regularization which yields better generalization capacity on the test set. This representation enhances the understanding of the underlying training dynamics and can pave the way for designing better acceleration techniques.
    Influence-Based Mini-Batching for Graph Neural Networks. (arXiv:2212.09083v1 [cs.LG])
    Using graph neural networks for large graphs is challenging since there is no clear way of constructing mini-batches. To solve this, previous methods have relied on sampling or graph clustering. While these approaches often lead to good training convergence, they introduce significant overhead due to expensive random data accesses and perform poorly during inference. In this work we instead focus on model behavior during inference. We theoretically model batch construction via maximizing the influence score of nodes on the outputs. This formulation leads to optimal approximation of the output when we do not have knowledge of the trained model. We call the resulting method influence-based mini-batching (IBMB). IBMB accelerates inference by up to 130x compared to previous methods that reach similar accuracy. Remarkably, with adaptive optimization and the right training schedule IBMB can also substantially accelerate training, thanks to precomputed batches and consecutive memory accesses. This results in up to 18x faster training per epoch and up to 17x faster convergence per runtime compared to previous methods.
    Empirical Analysis of AI-based Energy Management in Electric Vehicles: A Case Study on Reinforcement Learning. (arXiv:2212.09154v1 [cs.AI])
    Reinforcement learning-based (RL-based) energy management strategy (EMS) is considered a promising solution for the energy management of electric vehicles with multiple power sources. It has been shown to outperform conventional methods in energy management problems regarding energy-saving and real-time performance. However, previous studies have not systematically examined the essential elements of RL-based EMS. This paper presents an empirical analysis of RL-based EMS in a Plug-in Hybrid Electric Vehicle (PHEV) and Fuel Cell Electric Vehicle (FCEV). The empirical analysis is developed in four aspects: algorithm, perception and decision granularity, hyperparameters, and reward function. The results show that the Off-policy algorithm effectively develops a more fuel-efficient solution within the complete driving cycle compared with other algorithms. Improving the perception and decision granularity does not produce a more desirable energy-saving solution but better balances battery power and fuel consumption. The equivalent energy optimization objective based on the instantaneous state of charge (SOC) variation is parameter sensitive and can help RL-EMSs to achieve more efficient energy-cost strategies.
    Text2Struct: A Machine Learning Pipeline for Mining Structured Data from Text. (arXiv:2212.09044v1 [cs.IR])
    Many analysis and prediction tasks require the extraction of structured data from unstructured texts. To solve it, this paper presents an end-to-end machine learning pipeline, Text2Struct, including a text annotation scheme, training data processing, and machine learning implementation. We formulated the mining problems as the extraction of metrics and units associated with numerals in the text. Text2Struct was evaluated on an annotated text dataset collected from abstracts of medical publications regarding thrombectomy. In terms of prediction performance, a dice coefficient of 0.82 was achieved on the test dataset. By random sampling, most predicted relations between numerals and entities were well matched to the ground-truth annotations. These results showed that the Text2Struct is viable for the mining of structured data from text without special templates or patterns. It is anticipated to further improve the pipeline by expanding the dataset and investigating other machine learning models. A code demonstration can be found at: https://github.com/zcc861007/CourseProject
    Coordinate Descent Methods for DC Minimization: Optimality Conditions and Global Convergence. (arXiv:2109.04228v3 [math.OC] UPDATED)
    Difference-of-Convex (DC) minimization, referring to the problem of minimizing the difference of two convex functions, has been found rich applications in statistical learning and studied extensively for decades. However, existing methods are primarily based on multi-stage convex relaxation, only leading to weak optimality of critical points. This paper proposes a coordinate descent method for minimizing a class of DC functions based on sequential nonconvex approximation. Our approach iteratively solves a nonconvex one-dimensional subproblem globally, and it is guaranteed to converge to a coordinate-wise stationary point. We prove that this new optimality condition is always stronger than the standard critical point condition and directional point condition under a mild \textit{locally bounded nonconvexity assumption}. For comparisons, we also include a naive variant of coordinate descent methods based on sequential convex approximation in our study. When the objective function satisfies a \textit{globally bounded nonconvexity assumption} and \textit{Luo-Tseng error bound assumption}, coordinate descent methods achieve \textit{Q-linear} convergence rate. Also, for many applications of interest, we show that the nonconvex one-dimensional subproblem can be computed exactly and efficiently using a breakpoint searching method. Finally, we have conducted extensive experiments on several statistical learning tasks to show the superiority of our approach. Keywords: Coordinate Descent, DC Minimization, DC Programming, Difference-of-Convex Programs, Nonconvex Optimization, Sparse Optimization, Binary Optimization.
    Analyzing the Traffic of MANETs using Graph Neural Networks. (arXiv:2212.08923v1 [cs.LG])
    Graph Neural Networks (GNNs) have been taking role in many areas, thanks to their expressive power on graph-structured data. On the other hand, Mobile Ad-Hoc Networks (MANETs) are gaining attention as network technologies have been taken to the 5G level. However, there is no study that evaluates the efficiency of GNNs on MANETs. In this study, we aim to fill this absence by implementing a MANET dataset in a popular GNN framework, i.e., PyTorch Geometric; and show how GNNs can be utilized to analyze the traffic of MANETs. We operate an edge prediction task on the dataset with GraphSAGE (SAG) model, where SAG model tries to predict whether there is a link between two nodes. We construe several evaluation metrics to measure the performance and efficiency of GNNs on MANETs. SAG model showed 82.1 accuracy on average in the experiments.
    GAN-based Tabular Data Generator for Constructing Synopsis in Approximate Query Processing: Challenges and Solutions. (arXiv:2212.09015v1 [cs.DB])
    In data-driven systems, data exploration is imperative for making real-time decisions. However, big data is stored in massive databases that are difficult to retrieve. Approximate Query Processing (AQP) is a technique for providing approximate answers to aggregate queries based on a summary of the data (synopsis) that closely replicates the behavior of the actual data, which can be useful where an approximate answer to the queries would be acceptable in a fraction of the real execution time. In this paper, we discuss the use of Generative Adversarial Networks (GANs) for generating tabular data that can be employed in AQP for synopsis construction. We first discuss the challenges associated with constructing synopses in relational databases and then introduce solutions to those challenges. Following that, we organized statistical metrics to evaluate the quality of the generated synopses. We conclude that tabular data complexity makes it difficult for algorithms to understand relational database semantics during training, and improved versions of tabular GANs are capable of constructing synopses to revolutionize data-driven decision-making systems.
    A Simple Baseline for Beam Search Reranking. (arXiv:2212.08926v1 [cs.CL])
    Reranking methods in machine translation aim to close the gap between common evaluation metrics (e.g. BLEU) and maximum likelihood learning and decoding algorithms. Prior works address this challenge by training models to rerank beam search candidates according to their predicted BLEU scores, building upon large models pretrained on massive monolingual corpora -- a privilege that was never made available to the baseline translation model. In this work, we examine a simple approach for training rerankers to predict translation candidates' BLEU scores without introducing additional data or parameters. Our approach can be used as a clean baseline, decoupled from external factors, for future research in this area.
    Fine-Tuning Is All You Need to Mitigate Backdoor Attacks. (arXiv:2212.09067v1 [cs.CR])
    Backdoor attacks represent one of the major threats to machine learning models. Various efforts have been made to mitigate backdoors. However, existing defenses have become increasingly complex and often require high computational resources or may also jeopardize models' utility. In this work, we show that fine-tuning, one of the most common and easy-to-adopt machine learning training operations, can effectively remove backdoors from machine learning models while maintaining high model utility. Extensive experiments over three machine learning paradigms show that fine-tuning and our newly proposed super-fine-tuning achieve strong defense performance. Furthermore, we coin a new term, namely backdoor sequela, to measure the changes in model vulnerabilities to other attacks before and after the backdoor has been removed. Empirical evaluation shows that, compared to other defense methods, super-fine-tuning leaves limited backdoor sequela. We hope our results can help machine learning model owners better protect their models from backdoor threats. Also, it calls for the design of more advanced attacks in order to comprehensively assess machine learning models' backdoor vulnerabilities.
    FiLM-Ensemble: Probabilistic Deep Learning via Feature-wise Linear Modulation. (arXiv:2206.00050v4 [cs.LG] UPDATED)
    The ability to estimate epistemic uncertainty is often crucial when deploying machine learning in the real world, but modern methods often produce overconfident, uncalibrated uncertainty predictions. A common approach to quantify epistemic uncertainty, usable across a wide class of prediction models, is to train a model ensemble. In a naive implementation, the ensemble approach has high computational cost and high memory demand. This challenges in particular modern deep learning, where even a single deep network is already demanding in terms of compute and memory, and has given rise to a number of attempts to emulate the model ensemble without actually instantiating separate ensemble members. We introduce FiLM-Ensemble, a deep, implicit ensemble method based on the concept of Feature-wise Linear Modulation (FiLM). That technique was originally developed for multi-task learning, with the aim of decoupling different tasks. We show that the idea can be extended to uncertainty quantification: by modulating the network activations of a single deep network with FiLM, one obtains a model ensemble with high diversity, and consequently well-calibrated estimates of epistemic uncertainty, with low computational overhead in comparison. Empirically, FiLM-Ensemble outperforms other implicit ensemble methods, and it and comes very close to the upper bound of an explicit ensemble of networks (sometimes even beating it), at a fraction of the memory cost.
    Communication Size Reduction of Federated Learning based on Neural ODE Model. (arXiv:2208.09478v2 [cs.LG] UPDATED)
    Federated learning is a machine learning method in which data is not aggregated on a server, but is distributed to the edges, in consideration of security and privacy. ResNet is a classic but representative neural network that succeeds in deepening the neural network by learning a residual function that adds the inputs and outputs together. In federated learning, communication is performed between the server and edge devices to exchange weight parameters, but ResNet has deep layers and a large number of parameters, so communication size becomes large. In this paper, we use Neural ODE as a lightweight model of ResNet to reduce communication size in federated learning. In addition, we newly introduce a flexible federated learning using Neural ODE models with different number of iterations, which correspond to ResNet with different depths. The CIFAR-10 dataset is used in the evaluation, and the use of Neural ODE reduces communication size by approximately 90% compared to ResNet. We also show that the proposed flexible federated learning can merge models with different iteration counts.
    Collaborative Algorithms for Online Personalized Mean Estimation. (arXiv:2208.11530v2 [cs.LG] UPDATED)
    We consider an online estimation problem involving a set of agents. Each agent has access to a (personal) process that generates samples from a real-valued distribution and seeks to estimate its mean. We study the case where some of the distributions have the same mean, and the agents are allowed to actively query information from other agents. The goal is to design an algorithm that enables each agent to improve its mean estimate thanks to communication with other agents. The means as well as the number of distributions with same mean are unknown, which makes the task nontrivial. We introduce a novel collaborative strategy to solve this online personalized mean estimation problem. We analyze its time complexity and introduce variants that enjoy good performance in numerical experiments. We also extend our approach to the setting where clusters of agents with similar means seek to estimate the mean of their cluster.
    Bort: Towards Explainable Neural Networks with Bounded Orthogonal Constraint. (arXiv:2212.09062v1 [cs.CV])
    Deep learning has revolutionized human society, yet the black-box nature of deep neural networks hinders further application to reliability-demanded industries. In the attempt to unpack them, many works observe or impact internal variables to improve the model's comprehensibility and transparency. However, existing methods rely on intuitive assumptions and lack mathematical guarantees. To bridge this gap, we introduce Bort, an optimizer for improving model explainability with boundedness and orthogonality constraints on model parameters, derived from the sufficient conditions of model comprehensibility and transparency. We perform reconstruction and backtracking on the model representations optimized by Bort and observe an evident improvement in model explainability. Based on Bort, we are able to synthesize explainable adversarial samples without additional parameters and training. Surprisingly, we find Bort constantly improves the classification accuracy of various architectures including ResNet and DeiT on MNIST, CIFAR-10, and ImageNet.
    An Extension of Fisher's Criterion: Theoretical Results with a Neural Network Realization. (arXiv:2212.09225v1 [cs.LG])
    Fisher's criterion is a widely used tool in machine learning for feature selection. For large search spaces, Fisher's criterion can provide a scalable solution to select features. A challenging limitation of Fisher's criterion, however, is that it performs poorly when mean values of class-conditional distributions are close to each other. Motivated by this challenge, we propose an extension of Fisher's criterion to overcome this limitation. The proposed extension utilizes the available heteroscedasticity of class-conditional distributions to distinguish one class from another. Additionally, we describe how our theoretical results can be casted into a neural network framework, and conduct a proof-of-concept experiment to demonstrate the viability of our approach to solve classification problems.
    Riemannian Optimization for Variance Estimation in Linear Mixed Models. (arXiv:2212.09081v1 [stat.ML])
    Variance parameter estimation in linear mixed models is a challenge for many classical nonlinear optimization algorithms due to the positive-definiteness constraint of the random effects covariance matrix. We take a completely novel view on parameter estimation in linear mixed models by exploiting the intrinsic geometry of the parameter space. We formulate the problem of residual maximum likelihood estimation as an optimization problem on a Riemannian manifold. Based on the introduced formulation, we give geometric higher-order information on the problem via the Riemannian gradient and the Riemannian Hessian. Based on that, we test our approach with Riemannian optimization algorithms numerically. Our approach yields a higher quality of the variance parameter estimates compared to existing approaches.
    Learning Domain Invariant Representations for Generalizable Person Re-Identification. (arXiv:2103.15890v4 [cs.CV] UPDATED)
    Generalizable person Re-Identification (ReID) has attracted growing attention in recent computer vision community. In this work, we construct a structural causal model among identity labels, identity-specific factors (clothes/shoes color etc), and domain-specific factors (background, viewpoints etc). According to the causal analysis, we propose a novel Domain Invariant Representation Learning for generalizable person Re-Identification (DIR-ReID) framework. Specifically, we first propose to disentangle the identity-specific and domain-specific feature spaces, based on which we propose an effective algorithmic implementation for backdoor adjustment, essentially serving as a causal intervention towards the SCM. Extensive experiments have been conducted, showing that DIR-ReID outperforms state-of-the-art methods on large-scale domain generalization ReID benchmarks.  ( 2 min )
    Molecule Generation by Principal Subgraph Mining and Assembling. (arXiv:2106.15098v4 [cs.LG] UPDATED)
    Molecule generation is central to a variety of applications. Current attention has been paid to approaching the generation task as subgraph prediction and assembling. Nevertheless, these methods usually rely on hand-crafted or external subgraph construction, and the subgraph assembling depends solely on local arrangement. In this paper, we define a novel notion, principal subgraph, that is closely related to the informative pattern within molecules. Interestingly, our proposed merge-and-update subgraph extraction method can automatically discover frequent principal subgraphs from the dataset, while previous methods are incapable of. Moreover, we develop a two-step subgraph assembling strategy, which first predicts a set of subgraphs in a sequence-wise manner and then assembles all generated subgraphs globally as the final output molecule. Built upon graph variational auto-encoder, our model is demonstrated to be effective in terms of several evaluation metrics and efficiency, compared with state-of-the-art methods on distribution learning and (constrained) property optimization tasks.  ( 2 min )
    A Neural Network Warm-Start Approach for the Inverse Acoustic Obstacle Scattering Problem. (arXiv:2212.08736v1 [math.NA])
    We consider the inverse acoustic obstacle problem for sound-soft star-shaped obstacles in two dimensions wherein the boundary of the obstacle is determined from measurements of the scattered field at a collection of receivers outside the object. One of the standard approaches for solving this problem is to reformulate it as an optimization problem: finding the boundary of the domain that minimizes the $L^2$ distance between computed values of the scattered field and the given measurement data. The optimization problem is computationally challenging since the local set of convexity shrinks with increasing frequency and results in an increasing number of local minima in the vicinity of the true solution. In many practical experimental settings, low frequency measurements are unavailable due to limitations of the experimental setup or the sensors used for measurement. Thus, obtaining a good initial guess for the optimization problem plays a vital role in this environment. We present a neural network warm-start approach for solving the inverse scattering problem, where an initial guess for the optimization problem is obtained using a trained neural network. We demonstrate the effectiveness of our method with several numerical examples. For high frequency problems, this approach outperforms traditional iterative methods such as Gauss-Newton initialized without any prior (i.e., initialized using a unit circle), or initialized using the solution of a direct method such as the linear sampling method. The algorithm remains robust to noise in the scattered field measurements and also converges to the true solution for limited aperture data. However, the number of training samples required to train the neural network scales exponentially in frequency and the complexity of the obstacles considered. We conclude with a discussion of this phenomenon and potential directions for future research.  ( 2 min )
    Learning from Training Dynamics: Identifying Mislabeled Data Beyond Manually Designed Features. (arXiv:2212.09321v1 [cs.CV])
    While mislabeled or ambiguously-labeled samples in the training set could negatively affect the performance of deep models, diagnosing the dataset and identifying mislabeled samples helps to improve the generalization power. Training dynamics, i.e., the traces left by iterations of optimization algorithms, have recently been proved to be effective to localize mislabeled samples with hand-crafted features. In this paper, beyond manually designed features, we introduce a novel learning-based solution, leveraging a noise detector, instanced by an LSTM network, which learns to predict whether a sample was mislabeled using the raw training dynamics as input. Specifically, the proposed method trains the noise detector in a supervised manner using the dataset with synthesized label noises and can adapt to various datasets (either naturally or synthesized label-noised) without retraining. We conduct extensive experiments to evaluate the proposed method. We train the noise detector based on the synthesized label-noised CIFAR dataset and test such noise detector on Tiny ImageNet, CUB-200, Caltech-256, WebVision and Clothing1M. Results show that the proposed method precisely detects mislabeled samples on various datasets without further adaptation, and outperforms state-of-the-art methods. Besides, more experiments demonstrate that the mislabel identification can guide a label correction, namely data debugging, providing orthogonal improvements of algorithm-centric state-of-the-art techniques from the data aspect.  ( 2 min )
    Generative Networks for Precision Enthusiasts. (arXiv:2110.13632v3 [hep-ph] UPDATED)
    Generative networks are opening new avenues in fast event generation for the LHC. We show how generative flow networks can reach percent-level precision for kinematic distributions, how they can be trained jointly with a discriminator, and how this discriminator improves the generation. Our joint training relies on a novel coupling of the two networks which does not require a Nash equilibrium. We then estimate the generation uncertainties through a Bayesian network setup and through conditional data augmentation, while the discriminator ensures that there are no systematic inconsistencies compared to the training data.  ( 2 min )
    Learning Inter-Annual Flood Loss Risk Models From Historical Flood Insurance Claims and Extreme Rainfall Data. (arXiv:2212.08660v1 [cs.LG])
    Flooding is one of the most disastrous natural hazards, responsible for substantial economic losses. A predictive model for flood-induced financial damages is useful for many applications such as climate change adaptation planning and insurance underwriting. This research assesses the predictive capability of regressors constructed on the National Flood Insurance Program (NFIP) dataset using neural networks (Conditional Generative Adversarial Networks), decision trees (Extreme Gradient Boosting), and kernel-based regressors (Gaussian Process). The assessment highlights the most informative predictors for regression. The distribution for claims amount inference is modeled with a Burr distribution permitting the introduction of a bias correction scheme and increasing the regressor's predictive capability. Aiming to study the interaction with physical variables, we incorporate Daymet rainfall estimation to NFIP as an additional predictor. A study on the coastal counties in the eight US South-West states resulted in an $R^2=0.807$. Further analysis of 11 counties with a significant number of claims in the NFIP dataset reveals that Extreme Gradient Boosting provides the best results, that bias correction significantly improves the similarity with the reference distribution, and that the rainfall predictor strengthens the regressor performance.  ( 2 min )
    Plankton-FL: Exploration of Federated Learning for Privacy-Preserving Training of Deep Neural Networks for Phytoplankton Classification. (arXiv:2212.08990v1 [cs.LG])
    Creating high-performance generalizable deep neural networks for phytoplankton monitoring requires utilizing large-scale data coming from diverse global water sources. A major challenge to training such networks lies in data privacy, where data collected at different facilities are often restricted from being transferred to a centralized location. A promising approach to overcome this challenge is federated learning, where training is done at site level on local data, and only the model parameters are exchanged over the network to generate a global model. In this study, we explore the feasibility of leveraging federated learning for privacy-preserving training of deep neural networks for phytoplankton classification. More specifically, we simulate two different federated learning frameworks, federated learning (FL) and mutually exclusive FL (ME-FL), and compare their performance to a traditional centralized learning (CL) framework. Experimental results from this study demonstrate the feasibility and potential of federated learning for phytoplankton monitoring.  ( 2 min )
    Graph Neural Network based Child Activity Recognition. (arXiv:2212.09013v1 [cs.CV])
    This paper presents an implementation on child activity recognition (CAR) with a graph convolution network (GCN) based deep learning model since prior implementations in this domain have been dominated by CNN, LSTM and other methods despite the superior performance of GCN. To the best of our knowledge, we are the first to use a GCN model in child activity recognition domain. In overcoming the challenges of having small size publicly available child action datasets, several learning methods such as feature extraction, fine-tuning and curriculum learning were implemented to improve the model performance. Inspired by the contradicting claims made on the use of transfer learning in CAR, we conducted a detailed implementation and analysis on transfer learning together with a study on negative transfer learning effect on CAR as it hasn't been addressed previously. As the principal contribution, we were able to develop a ST-GCN based CAR model which, despite the small size of the dataset, obtained around 50% accuracy on vanilla implementations. With feature extraction and fine-tuning methods, accuracy was improved by 20%-30% with the highest accuracy being 82.24%. Furthermore, the results provided on activity datasets empirically demonstrate that with careful selection of pre-train model datasets through methods such as curriculum learning could enhance the accuracy levels. Finally, we provide preliminary evidence on possible frame rate effect on the accuracy of CAR models, a direction future research can explore.  ( 2 min )
    Convergence of gradient descent for deep neural networks. (arXiv:2203.16462v4 [cs.LG] UPDATED)
    This article presents a criterion for convergence of gradient descent to a global minimum, which is then used to show that gradient descent with proper initialization converges to a global minimum when training any feedforward neural network with smooth and strictly increasing activation functions, provided that the input dimension is greater than or equal to the number of data points. The main difference with prior work is that the width of the network can be a fixed number instead of growing as some multiple or power of the number of data points.
    Dueling RL: Reinforcement Learning with Trajectory Preferences. (arXiv:2111.04850v2 [cs.LG] UPDATED)
    We consider the problem of preference based reinforcement learning (PbRL), where, unlike traditional reinforcement learning, an agent receives feedback only in terms of a 1 bit (0/1) preference over a trajectory pair instead of absolute rewards for them. The success of the traditional RL framework crucially relies on the underlying agent-reward model, which, however, depends on how accurately a system designer can express an appropriate reward function and often a non-trivial task. The main novelty of our framework is the ability to learn from preference-based trajectory feedback that eliminates the need to hand-craft numeric reward models. This paper sets up a formal framework for the PbRL problem with non-markovian rewards, where the trajectory preferences are encoded by a generalized linear model of dimension $d$. Assuming the transition model is known, we then propose an algorithm with almost optimal regret guarantee of $\tilde {\mathcal{O}}\left( SH d \log (T / \delta) \sqrt{T} \right)$. We further, extend the above algorithm to the case of unknown transition dynamics, and provide an algorithm with near optimal regret guarantee $\widetilde{\mathcal{O}}((\sqrt{d} + H^2 + |\mathcal{S}|)\sqrt{dT} +\sqrt{|\mathcal{S}||\mathcal{A}|TH} )$. To the best of our knowledge, our work is one of the first to give tight regret guarantees for preference based RL problems with trajectory preferences.  ( 2 min )
    Confidence-aware Training of Smoothed Classifiers for Certified Robustness. (arXiv:2212.09000v1 [cs.LG])
    Any classifier can be "smoothed out" under Gaussian noise to build a new classifier that is provably robust to $\ell_2$-adversarial perturbations, viz., by averaging its predictions over the noise via randomized smoothing. Under the smoothed classifiers, the fundamental trade-off between accuracy and (adversarial) robustness has been well evidenced in the literature: i.e., increasing the robustness of a classifier for an input can be at the expense of decreased accuracy for some other inputs. In this paper, we propose a simple training method leveraging this trade-off to obtain robust smoothed classifiers, in particular, through a sample-wise control of robustness over the training samples. We make this control feasible by using "accuracy under Gaussian noise" as an easy-to-compute proxy of adversarial robustness for an input. Specifically, we differentiate the training objective depending on this proxy to filter out samples that are unlikely to benefit from the worst-case (adversarial) objective. Our experiments show that the proposed method, despite its simplicity, consistently exhibits improved certified robustness upon state-of-the-art training methods. Somewhat surprisingly, we find these improvements persist even for other notions of robustness, e.g., to various types of common corruptions.  ( 2 min )
    XEngine: Optimal Tensor Rematerialization for Neural Networks in Heterogeneous Environments. (arXiv:2212.09290v1 [cs.LG])
    Memory efficiency is crucial in training deep learning networks on resource-restricted devices. During backpropagation, forward tensors are used to calculate gradients. Despite the option of keeping those dependencies in memory until they are reused in backpropagation, some forward tensors can be discarded and recomputed later from saved tensors, so-called checkpoints. This allows, in particular, for resource-constrained heterogeneous environments to make use of all available compute devices. Unfortunately, the definition of these checkpoints is a non-trivial problem and poses a challenge to the programmer - improper or excessive recomputations negate the benefit of checkpointing. In this article, we present XEngine, an approach that schedules network operators to heterogeneous devices in low memory environments by determining checkpoints and recomputations of tensors. Our approach selects suitable resources per timestep and operator and optimizes the end-to-end time for neural networks taking the memory limitation of each device into account. For this, we formulate a mixed-integer quadratic program (MIQP) to schedule operators of deep learning networks on heterogeneous systems. We compare our MIQP solver XEngine against Checkmate, a mixed-integer linear programming (MILP) approach that solves recomputation on a single device. Our solver finds solutions that are up to 22.5 % faster than the fastest Checkmate schedule in which the network is computed exclusively on a single device. We also find valid schedules for networks making use of both central processing units and graphics processing units if memory limitations do not allow scheduling exclusively to the graphics processing unit.  ( 2 min )
    Uncertainty Estimation for Heatmap-based Landmark Localization. (arXiv:2203.02351v2 [cs.LG] UPDATED)
    Automatic anatomical landmark localization has made great strides by leveraging deep learning methods in recent years. The ability to quantify the uncertainty of these predictions is a vital component needed for these methods to be adopted in clinical settings, where it is imperative that erroneous predictions are caught and corrected. We propose Quantile Binning, a data-driven method to categorize predictions by uncertainty with estimated error bounds. Our framework can be applied to any continuous uncertainty measure, allowing straightforward identification of the best subset of predictions with accompanying estimated error bounds. We facilitate easy comparison between uncertainty measures by constructing two evaluation metrics derived from Quantile Binning. We compare and contrast three epistemic uncertainty measures (two baselines, and a proposed method combining aspects of the two), derived from two heatmap-based landmark localization model paradigms (U-Net and patch-based). We show results across three datasets, including a publicly available Cephalometric dataset. We illustrate how filtering out gross mispredictions caught in our Quantile Bins significantly improves the proportion of predictions under an acceptable error threshold. Finally, we demonstrate that Quantile Binning remains effective on landmarks with high aleatoric uncertainty caused by inherent landmark ambiguity, and offer recommendations on which uncertainty measure to use and how to use it. The code and data are available at https://github.com/schobs/qbin.  ( 2 min )
    SPARF: Large-Scale Learning of 3D Sparse Radiance Fields from Few Input Images. (arXiv:2212.09100v1 [cs.CV])
    Recent advances in Neural Radiance Fields (NeRFs) treat the problem of novel view synthesis as Sparse Radiance Field (SRF) optimization using sparse voxels for efficient and fast rendering (plenoxels,InstantNGP). In order to leverage machine learning and adoption of SRFs as a 3D representation, we present SPARF, a large-scale ShapeNet-based synthetic dataset for novel view synthesis consisting of $\sim$ 17 million images rendered from nearly 40,000 shapes at high resolution (400 X 400 pixels). The dataset is orders of magnitude larger than existing synthetic datasets for novel view synthesis and includes more than one million 3D-optimized radiance fields with multiple voxel resolutions. Furthermore, we propose a novel pipeline (SuRFNet) that learns to generate sparse voxel radiance fields from only few views. This is done by using the densely collected SPARF dataset and 3D sparse convolutions. SuRFNet employs partial SRFs from few/one images and a specialized SRF loss to learn to generate high-quality sparse voxel radiance fields that can be rendered from novel views. Our approach achieves state-of-the-art results in the task of unconstrained novel view synthesis based on few views on ShapeNet as compared to recent baselines. The SPARF dataset will be made public with the code and models on the project website https://abdullahamdi.com/sparf/ .  ( 2 min )
    Enriching Relation Extraction with OpenIE. (arXiv:2212.09376v1 [cs.CL])
    Relation extraction (RE) is a sub-discipline of information extraction (IE) which focuses on the prediction of a relational predicate from a natural-language input unit (such as a sentence, a clause, or even a short paragraph consisting of multiple sentences and/or clauses). Together with named-entity recognition (NER) and disambiguation (NED), RE forms the basis for many advanced IE tasks such as knowledge-base (KB) population and verification. In this work, we explore how recent approaches for open information extraction (OpenIE) may help to improve the task of RE by encoding structured information about the sentences' principal units, such as subjects, objects, verbal phrases, and adverbials, into various forms of vectorized (and hence unstructured) representations of the sentences. Our main conjecture is that the decomposition of long and possibly convoluted sentences into multiple smaller clauses via OpenIE even helps to fine-tune context-sensitive language models such as BERT (and its plethora of variants) for RE. Our experiments over two annotated corpora, KnowledgeNet and FewRel, demonstrate the improved accuracy of our enriched models compared to existing RE approaches. Our best results reach 92% and 71% of F1 score for KnowledgeNet and FewRel, respectively, proving the effectiveness of our approach on competitive benchmarks.  ( 2 min )
    Multi-block-Single-probe Variance Reduced Estimator for Coupled Compositional Optimization. (arXiv:2207.08540v3 [cs.LG] UPDATED)
    Variance reduction techniques such as SPIDER/SARAH/STORM have been extensively studied to improve the convergence rates of stochastic non-convex optimization, which usually maintain and update a sequence of estimators for a single function across iterations. What if we need to track multiple functional mappings across iterations but only with access to stochastic samples of $\mathcal{O}(1)$ functional mappings at each iteration? There is an important application in solving an emerging family of coupled compositional optimization problems in the form of $\sum_{i=1}^m f_i(g_i(\mathbf{w}))$, where $g_i$ is accessible through a stochastic oracle. The key issue is to track and estimate a sequence of $\mathbf g(\mathbf{w})=(g_1(\mathbf{w}), \ldots, g_m(\mathbf{w}))$ across iterations, where $\mathbf g(\mathbf{w})$ has $m$ blocks and it is only allowed to probe $\mathcal{O}(1)$ blocks to attain their stochastic values and Jacobians. To improve the complexity for solving these problems, we propose a novel stochastic method named Multi-block-Single-probe Variance Reduced (MSVR) estimator to track the sequence of $\mathbf g(\mathbf{w})$. It is inspired by STORM but introduces a customized error correction term to alleviate the noise not only in stochastic samples for the selected blocks but also in those blocks that are not sampled. With the help of the MSVR estimator, we develop several algorithms for solving the aforementioned compositional problems with improved complexities across a spectrum of settings with non-convex/convex/strongly convex/Polyak-{\L}ojasiewicz (PL) objectives. Our results improve upon prior ones in several aspects, including the order of sample complexities and dependence on the strong convexity parameter. Empirical studies on multi-task deep AUC maximization demonstrate the better performance of using the new estimator.  ( 2 min )
    Very Large Language Model as a Unified Methodology of Text Mining. (arXiv:2212.09271v1 [cs.DB])
    Text data mining is the process of deriving essential information from language text. Typical text mining tasks include text categorization, text clustering, topic modeling, information extraction, and text summarization. Various data sets are collected and various algorithms are designed for the different types of tasks. In this paper, I present a blue sky idea that very large language model (VLLM) will become an effective unified methodology of text mining. I discuss at least three advantages of this new methodology against conventional methods. Finally I discuss the challenges in the design and development of VLLM techniques for text mining.  ( 2 min )
    Clinical Deterioration Prediction in Brazilian Hospitals Based on Artificial Neural Networks and Tree Decision Models. (arXiv:2212.08975v1 [cs.LG])
    Early recognition of clinical deterioration (CD) has vital importance in patients' survival from exacerbation or death. Electronic health records (EHRs) data have been widely employed in Early Warning Scores (EWS) to measure CD risk in hospitalized patients. Recently, EHRs data have been utilized in Machine Learning (ML) models to predict mortality and CD. The ML models have shown superior performance in CD prediction compared to EWS. Since EHRs data are structured and tabular, conventional ML models are generally applied to them, and less effort is put into evaluating the artificial neural network's performance on EHRs data. Thus, in this article, an extremely boosted neural network (XBNet) is used to predict CD, and its performance is compared to eXtreme Gradient Boosting (XGBoost) and random forest (RF) models. For this purpose, 103,105 samples from thirteen Brazilian hospitals are used to generate the models. Moreover, the principal component analysis (PCA) is employed to verify whether it can improve the adopted models' performance. The performance of ML models and Modified Early Warning Score (MEWS), an EWS candidate, are evaluated in CD prediction regarding the accuracy, precision, recall, F1-score, and geometric mean (G-mean) metrics in a 10-fold cross-validation approach. According to the experiments, the XGBoost model obtained the best results in predicting CD among Brazilian hospitals' data.  ( 2 min )
    A Layered Architecture for Universal Causality. (arXiv:2212.08981v1 [cs.AI])
    We propose a layered hierarchical architecture called UCLA (Universal Causality Layered Architecture), which combines multiple levels of categorical abstraction for causal inference. At the top-most level, causal interventions are modeled combinatorially using a simplicial category of ordinal numbers. At the second layer, causal models are defined by a graph-type category. The non-random ``surgical" operations on causal structures, such as edge deletion, are captured using degeneracy and face operators from the simplicial layer above. The third categorical abstraction layer corresponds to the data layer in causal inference. The fourth homotopy layer comprises of additional structure imposed on the instance layer above, such as a topological space, which enables evaluating causal models on datasets. Functors map between every pair of layers in UCLA. Each functor between layers is characterized by a universal arrow, which defines an isomorphism between every pair of categorical layers. These universal arrows define universal elements and representations through the Yoneda Lemma, and in turn lead to a new category of elements based on a construction introduced by Grothendieck. Causal inference between each pair of layers is defined as a lifting problem, a commutative diagram whose objects are categories, and whose morphisms are functors that are characterized as different types of fibrations. We illustrate the UCLA architecture using a range of examples, including integer-valued multisets that represent a non-graphical framework for conditional independence, and causal models based on graphs and string diagrams using symmetric monoidal categories. We define causal effect in terms of the homotopy colimit of the nerve of the category of elements.  ( 2 min )
    The Impact of Socioeconomic Factors on Health Disparities. (arXiv:2212.04285v2 [cs.CY] CROSS LISTED)
    High-quality healthcare in the US can be cost-prohibitive for certain socioeconomic groups. In this paper, we examined data from the US Census and the CDC to determine the degree to which specific socioeconomic factors correlate with both specific and general health metrics. We employed visual analysis to find broad trends and predictive modeling to identify more complex relationships between variables. Our results indicate that certain socioeconomic factors, like income and educational attainment, are highly correlated with aggregate measures of health.  ( 2 min )
    Deep learning applied to computational mechanics: A comprehensive review, state of the art, and the classics. (arXiv:2212.08989v1 [cs.LG])
    Three recent breakthroughs due to AI in arts and science serve as motivation: An award winning digital image, protein folding, fast matrix multiplication. Many recent developments in artificial neural networks, particularly deep learning (DL), applied and relevant to computational mechanics (solid, fluids, finite-element technology) are reviewed in detail. Both hybrid and pure machine learning (ML) methods are discussed. Hybrid methods combine traditional PDE discretizations with ML methods either (1) to help model complex nonlinear constitutive relations, (2) to nonlinearly reduce the model order for efficient simulation (turbulence), or (3) to accelerate the simulation by predicting certain components in the traditional integration methods. Here, methods (1) and (2) relied on Long-Short-Term Memory (LSTM) architecture, with method (3) relying on convolutional neural networks.. Pure ML methods to solve (nonlinear) PDEs are represented by Physics-Informed Neural network (PINN) methods, which could be combined with attention mechanism to address discontinuous solutions. Both LSTM and attention architectures, together with modern and generalized classic optimizers to include stochasticity for DL networks, are extensively reviewed. Kernel machines, including Gaussian processes, are provided to sufficient depth for more advanced works such as shallow networks with infinite width. Not only addressing experts, readers are assumed familiar with computational mechanics, but not with DL, whose concepts and applications are built up from the basics, aiming at bringing first-time learners quickly to the forefront of research. History and limitations of AI are recounted and discussed, with particular attention at pointing out misstatements or misconceptions of the classics, even in well-known references. Positioning and pointing control of a large-deformable beam is given as an example.  ( 2 min )
    EffMulti: Efficiently Modeling Complex Multimodal Interactions for Emotion Analysis. (arXiv:2212.08661v1 [cs.LG])
    Humans are skilled in reading the interlocutor's emotion from multimodal signals, including spoken words, simultaneous speech, and facial expressions. It is still a challenge to effectively decode emotions from the complex interactions of multimodal signals. In this paper, we design three kinds of multimodal latent representations to refine the emotion analysis process and capture complex multimodal interactions from different views, including a intact three-modal integrating representation, a modality-shared representation, and three modality-individual representations. Then, a modality-semantic hierarchical fusion is proposed to reasonably incorporate these representations into a comprehensive interaction representation. The experimental results demonstrate that our EffMulti outperforms the state-of-the-art methods. The compelling performance benefits from its well-designed framework with ease of implementation, lower computing complexity, and less trainable parameters.  ( 2 min )
  • Open

    Asymptotics of $\ell_2$ Regularized Network Embeddings. (arXiv:2201.01689v3 [stat.ML] UPDATED)
    A common approach to solving prediction tasks on large networks, such as node classification or link prediction, begin by learning a Euclidean embedding of the nodes of the network, from which traditional machine learning methods can then be applied. This includes methods such as DeepWalk and node2vec, which learn embeddings by optimizing stochastic losses formed over subsamples of the graph at each iteration of stochastic gradient descent. In this paper, we study the effects of adding an $\ell_2$ penalty of the embedding vectors to the training loss of these types of methods. We prove that, under some exchangeability assumptions on the graph, this asymptotically leads to learning a graphon with a nuclear-norm-type penalty, and give guarantees for the asymptotic distribution of the learned embedding vectors. In particular, the exact form of the penalty depends on the choice of subsampling method used as part of stochastic gradient descent. We also illustrate empirically that concatenating node covariates to $\ell_2$ regularized node2vec embeddings leads to comparable, when not superior, performance to methods which incorporate node covariates and the network structure in a non-linear manner.
    A Complete Characterization of Linear Estimators for Offline Policy Evaluation. (arXiv:2203.04236v2 [cs.LG] UPDATED)
    Offline policy evaluation is a fundamental statistical problem in reinforcement learning that involves estimating the value function of some decision-making policy given data collected by a potentially different policy. In order to tackle problems with complex, high-dimensional observations, there has been significant interest from theoreticians and practitioners alike in understanding the possibility of function approximation in reinforcement learning. Despite significant study, a sharp characterization of when we might expect offline policy evaluation to be tractable, even in the simplest setting of linear function approximation, has so far remained elusive, with a surprising number of strong negative results recently appearing in the literature. In this work, we identify simple control-theoretic and linear-algebraic conditions that are necessary and sufficient for classical methods, in particular Fitted Q-iteration (FQI) and least squares temporal difference learning (LSTD), to succeed at offline policy evaluation. Using this characterization, we establish a precise hierarchy of regimes under which these estimators succeed. We prove that LSTD works under strictly weaker conditions than FQI. Furthermore, we establish that if a problem is not solvable via LSTD, then it cannot be solved by a broad class of linear estimators, even in the limit of infinite data. Taken together, our results provide a complete picture of the behavior of linear estimators for offline policy evaluation, unify previously disparate analyses of canonical algorithms, and provide significantly sharper notions of the underlying statistical complexity of offline policy evaluation.
    Fast and robust Bayesian Inference using Gaussian Processes with GPry. (arXiv:2211.02045v2 [astro-ph.CO] UPDATED)
    We present the GPry algorithm for fast Bayesian inference of general (non-Gaussian) posteriors with a moderate number of parameters. GPry does not need any pre-training, special hardware such as GPUs, and is intended as a drop-in replacement for traditional Monte Carlo methods for Bayesian inference. Our algorithm is based on generating a Gaussian Process surrogate model of the log-posterior, aided by a Support Vector Machine classifier that excludes extreme or non-finite values. An active learning scheme allows us to reduce the number of required posterior evaluations by two orders of magnitude compared to traditional Monte Carlo inference. Our algorithm allows for parallel evaluations of the posterior at optimal locations, further reducing wall-clock times. We significantly improve performance using properties of the posterior in our active learning scheme and for the definition of the GP prior. In particular we account for the expected dynamical range of the posterior in different dimensionalities. We test our model against a number of synthetic and cosmological examples. GPry outperforms traditional Monte Carlo methods when the evaluation time of the likelihood (or the calculation of theoretical observables) is of the order of seconds; for evaluation times of over a minute it can perform inference in days that would take months using traditional methods. GPry is distributed as an open source Python package (pip install gpry) and can also be found at https://github.com/jonaselgammal/GPry.
    Subgraph nomination: Query by Example Subgraph Retrieval in Networks. (arXiv:2101.12430v2 [cs.LG] UPDATED)
    This paper introduces the subgraph nomination inference task, in which example subgraphs of interest are used to query a network for similarly interesting subgraphs. This type of problem appears time and again in real world problems connected to, for example, user recommendation systems and structural retrieval tasks in social and biological/connectomic networks. We formally define the subgraph nomination framework with an emphasis on the notion of a user-in-the-loop in the subgraph nomination pipeline. In this setting, a user can provide additional post-nomination light supervision that can be incorporated into the retrieval task. After introducing and formalizing the retrieval task, we examine the nuanced effect that user-supervision can have on performance, both analytically and across real and simulated data examples.
    Censored Quantile Regression Neural Networks for Distribution-Free Survival Analysis. (arXiv:2205.13496v3 [stat.ML] UPDATED)
    This paper considers doing quantile regression on censored data using neural networks (NNs). This adds to the survival analysis toolkit by allowing direct prediction of the target variable, along with a distribution-free characterisation of uncertainty, using a flexible function approximator. We begin by showing how an algorithm popular in linear models can be applied to NNs. However, the resulting procedure is inefficient, requiring sequential optimisation of an individual NN at each desired quantile. Our major contribution is a novel algorithm that simultaneously optimises a grid of quantiles output by a single NN. To offer theoretical insight into our algorithm, we show firstly that it can be interpreted as a form of expectation-maximisation, and secondly that it exhibits a desirable `self-correcting' property. Experimentally, the algorithm produces quantiles that are better calibrated than existing methods on 10 out of 12 real datasets.  ( 2 min )
    Convergence of gradient descent for deep neural networks. (arXiv:2203.16462v4 [cs.LG] UPDATED)
    This article presents a criterion for convergence of gradient descent to a global minimum, which is then used to show that gradient descent with proper initialization converges to a global minimum when training any feedforward neural network with smooth and strictly increasing activation functions, provided that the input dimension is greater than or equal to the number of data points. The main difference with prior work is that the width of the network can be a fixed number instead of growing as some multiple or power of the number of data points.
    Near-optimal Policy Identification in Active Reinforcement Learning. (arXiv:2212.09510v1 [stat.ML])
    Many real-world reinforcement learning tasks require control of complex dynamical systems that involve both costly data acquisition processes and large state spaces. In cases where the transition dynamics can be readily evaluated at specified states (e.g., via a simulator), agents can operate in what is often referred to as planning with a \emph{generative model}. We propose the AE-LSVI algorithm for best-policy identification, a novel variant of the kernelized least-squares value iteration (LSVI) algorithm that combines optimism with pessimism for active exploration (AE). AE-LSVI provably identifies a near-optimal policy \emph{uniformly} over an entire state space and achieves polynomial sample complexity guarantees that are independent of the number of states. When specialized to the recently introduced offline contextual Bayesian optimization setting, our algorithm achieves improved sample complexity bounds. Experimentally, we demonstrate that AE-LSVI outperforms other RL algorithms in a variety of environments when robustness to the initial state is required.
    Ordinal Causal Discovery. (arXiv:2201.07396v3 [stat.ME] UPDATED)
    Causal discovery for purely observational, categorical data is a long-standing challenging problem. Unlike continuous data, the vast majority of existing methods for categorical data focus on inferring the Markov equivalence class only, which leaves the direction of some causal relationships undetermined. This paper proposes an identifiable ordinal causal discovery method that exploits the ordinal information contained in many real-world applications to uniquely identify the causal structure. The proposed method is applicable beyond ordinal data via data discretization. Through real-world and synthetic experiments, we demonstrate that the proposed ordinal causal discovery method combined with simple score-and-search algorithms has favorable and robust performance compared to state-of-the-art alternative methods in both ordinal categorical and non-categorical data. An accompanied R package OrdCD is freely available on CRAN and at https://web.stat.tamu.edu/~yni/files/OrdCD_1.0.0.tar.gz.
    Gradient Descent-Type Methods: Background and Simple Unified Convergence Analysis. (arXiv:2212.09413v1 [math.OC])
    In this book chapter, we briefly describe the main components that constitute the gradient descent method and its accelerated and stochastic variants. We aim at explaining these components from a mathematical point of view, including theoretical and practical aspects, but at an elementary level. We will focus on basic variants of the gradient descent method and then extend our view to recent variants, especially variance-reduced stochastic gradient schemes (SGD). Our approach relies on revealing the structures presented inside the problem and the assumptions imposed on the objective function. Our convergence analysis unifies several known results and relies on a general, but elementary recursive expression. We have illustrated this analysis on several common schemes.
    Multiple Robust Learning for Recommendation. (arXiv:2207.10796v4 [cs.IR] UPDATED)
    In recommender systems, a common problem is the presence of various biases in the collected data, which deteriorates the generalization ability of the recommendation models and leads to inaccurate predictions. Doubly robust (DR) learning has been studied in many tasks in RS, with the advantage that unbiased learning can be achieved when either a single imputation or a single propensity model is accurate. In this paper, we propose a multiple robust (MR) estimator that can take the advantage of multiple candidate imputation and propensity models to achieve unbiasedness. Specifically, the MR estimator is unbiased when any of the imputation or propensity models, or a linear combination of these models is accurate. Theoretical analysis shows that the proposed MR is an enhanced version of DR when only having a single imputation and propensity model, and has a smaller bias. Inspired by the generalization error bound of MR, we further propose a novel multiple robust learning approach with stabilization. We conduct extensive experiments on real-world and semi-synthetic datasets, which demonstrates the superiority of the proposed approach over state-of-the-art methods.  ( 2 min )
    Gaussian Mixture Reduction with Composite Transportation Divergence. (arXiv:2002.08410v3 [stat.ML] UPDATED)
    Gaussian mixtures can approximate almost any smooth density function and are used to simplify downstream inference tasks. As such, it is widely used in applications in density estimation, belief propagation, and Bayesian filtering. In these applications, a finite Gaussian mixture provides an initial approximation to density functions that are updated recursively. A challenge in these recursions is that the order of the Gaussian mixture increases exponentially, and the inference quickly becomes intractable. To overcome the difficulty, the Gaussian mixture reduction, which approximates a high order Gaussian mixture by one with a lower order, can be used. Existing methods such as the clustering-based approaches are renowned for their satisfactory performance and computationally efficiency. However, they have unknown convergence and optimal targets. We propose a novel optimization-based Gaussian mixture reduction method. We develop a majorization-minimization algorithm for its numerical computation and establish its theoretical convergence under general conditions. We show many existing clustering-based methods are special cases of ours, thus bridging the gap between optimization-based and clustering-based methods. The unified framework allows users to choose the most suitable cost function to achieve superior performance in their specific application. We demonstrate the efficiency and effectiveness of the proposed method through extensive empirical experiments.
    Variational Wasserstein Barycenters with c-Cyclical Monotonicity. (arXiv:2110.11707v2 [cs.LG] UPDATED)
    Wasserstein barycenter, built on the theory of optimal transport, provides a powerful framework to aggregate probability distributions, and it has increasingly attracted great attention within the machine learning community. However, it suffers from severe computational burden, especially for high dimensional and continuous settings. To this end, we develop a novel continuous approximation method for the Wasserstein barycenters problem given sample access to the input distributions. The basic idea is to introduce a variational distribution as the approximation of the true continuous barycenter, so as to frame the barycenters computation problem as an optimization problem, where parameters of the variational distribution adjust the proxy distribution to be similar to the barycenter. Leveraging the variational distribution, we construct a tractable dual formulation for the regularized Wasserstein barycenter problem with c-cyclical monotonicity, which can be efficiently solved by stochastic optimization. We provide theoretical analysis on convergence and demonstrate the practical effectiveness of our method on real applications of subset posterior aggregation and synthetic data.
    Spectral Regularized Kernel Two-Sample Tests. (arXiv:2212.09201v1 [math.ST])
    Over the last decade, an approach that has gained a lot of popularity to tackle non-parametric testing problems on general (i.e., non-Euclidean) domains is based on the notion of reproducing kernel Hilbert space (RKHS) embedding of probability distributions. The main goal of our work is to understand the optimality of two-sample tests constructed based on this approach. First, we show that the popular MMD (maximum mean discrepancy) two-sample test is not optimal in terms of the separation boundary measured in Hellinger distance. Second, we propose a modification to the MMD test based on spectral regularization by taking into account the covariance information (which is not captured by the MMD test) and prove the proposed test to be minimax optimal with a smaller separation boundary than that achieved by the MMD test. Third, we propose an adaptive version of the above test which involves a data-driven strategy to choose the regularization parameter and show the adaptive test to be almost minimax optimal up to a logarithmic factor. Moreover, our results hold for the permutation variant of the test where the test threshold is chosen elegantly through the permutation of the samples. Through numerical experiments on synthetic and real-world data, we demonstrate the superior performance of the proposed test in comparison to the MMD test.
    Online Lewis Weight Sampling. (arXiv:2207.08268v3 [cs.DS] UPDATED)
    The seminal work of Cohen and Peng introduced Lewis weight sampling to the theoretical computer science community, yielding fast row sampling algorithms for approximating $d$-dimensional subspaces of $\ell_p$ up to $(1+\epsilon)$ error. Several works have extended this important primitive to other settings, including the online coreset and sliding window models. However, these results are only for $p\in\{1,2\}$, and results for $p=1$ require a suboptimal $\tilde O(d^2/\epsilon^2)$ samples. In this work, we design the first nearly optimal $\ell_p$ subspace embeddings for all $p\in(0,\infty)$ in the online coreset and sliding window models. In both models, our algorithms store $\tilde O(d^{1\lor(p/2)}/\epsilon^2)$ rows. This answers a substantial generalization of the main open question of [BDMMUWZ2020], and gives the first results for all $p\notin\{1,2\}$. Towards our result, we give the first analysis of "one-shot'' Lewis weight sampling of sampling rows proportionally to their Lewis weights, with sample complexity $\tilde O(d^{p/2}/\epsilon^2)$ for $p>2$. Previously, this scheme was only known to have sample complexity $\tilde O(d^{p/2}/\epsilon^5)$, whereas $\tilde O(d^{p/2}/\epsilon^2)$ is known if a more sophisticated recursive sampling is used. The recursive sampling cannot be implemented online, thus necessitating an analysis of one-shot Lewis weight sampling. Our analysis uses a novel connection to online numerical linear algebra. As an application, we obtain the first one-pass streaming coreset algorithms for $(1+\epsilon)$ approximation of important generalized linear models, such as logistic regression and $p$-probit regression. Our upper bounds are parameterized by a complexity parameter $\mu$ introduced by [MSSW2018], and we show the first lower bounds showing that a linear dependence on $\mu$ is necessary.
    The Multimarginal Optimal Transport Formulation of Adversarial Multiclass Classification. (arXiv:2204.12676v2 [cs.LG] UPDATED)
    We study a family of adversarial multiclass classification problems and provide equivalent reformulations in terms of: 1) a family of generalized barycenter problems introduced in the paper and 2) a family of multimarginal optimal transport problems where the number of marginals is equal to the number of classes in the original classification problem. These new theoretical results reveal a rich geometric structure of adversarial learning problems in multiclass classification and extend recent results restricted to the binary classification setting. A direct computational implication of our results is that by solving either the barycenter problem and its dual, or the MOT problem and its dual, we can recover the optimal robust classification rule and the optimal adversarial strategy for the original adversarial problem. Examples with synthetic and real data illustrate our results.
    Collaborative Algorithms for Online Personalized Mean Estimation. (arXiv:2208.11530v2 [cs.LG] UPDATED)
    We consider an online estimation problem involving a set of agents. Each agent has access to a (personal) process that generates samples from a real-valued distribution and seeks to estimate its mean. We study the case where some of the distributions have the same mean, and the agents are allowed to actively query information from other agents. The goal is to design an algorithm that enables each agent to improve its mean estimate thanks to communication with other agents. The means as well as the number of distributions with same mean are unknown, which makes the task nontrivial. We introduce a novel collaborative strategy to solve this online personalized mean estimation problem. We analyze its time complexity and introduce variants that enjoy good performance in numerical experiments. We also extend our approach to the setting where clusters of agents with similar means seek to estimate the mean of their cluster.  ( 2 min )
    Meta-Learning Priors for Safe Bayesian Optimization. (arXiv:2210.00762v2 [cs.LG] UPDATED)
    In robotics, optimizing controller parameters under safety constraints is an important challenge. Safe Bayesian optimization (BO) quantifies uncertainty in the objective and constraints to safely guide exploration in such settings. Hand-designing a suitable probabilistic model can be challenging, however. In the presence of unknown safety constraints, it is crucial to choose reliable model hyper-parameters to avoid safety violations. Here, we propose a data-driven approach to this problem by meta-learning priors for safe BO from offline data. We build on a meta-learning algorithm, F-PACOH, capable of providing reliable uncertainty quantification in settings of data scarcity. As core contribution, we develop a novel framework for choosing safety-compliant priors in a data-riven manner via empirical uncertainty metrics and a frontier search algorithm. On benchmark functions and a high-precision motion system, we demonstrate that our meta-learned priors accelerate the convergence of safe BO approaches while maintaining safety.
    Priority to unemployed immigrants? A causal machine learning evaluation of training in Belgium. (arXiv:1912.12864v4 [econ.EM] UPDATED)
    Based on administrative data of unemployed in Belgium, we estimate the labour market effects of three training programmes at various aggregation levels using Modified Causal Forests, a causal machine learning estimator. While all programmes have positive effects after the lock-in period, we find substantial heterogeneity across programmes and unemployed. Simulations show that 'black-box' rules that reassign unemployed to programmes that maximise estimated individual gains can considerably improve effectiveness: up to 20 percent more (less) time spent in (un)employment within a 30 months window. A shallow policy tree delivers a simple rule that realizes about 70 percent of this gain.
    Automatic quality control framework for more reliable integration of machine learning-based image segmentation into medical workflows. (arXiv:2112.03277v2 [eess.IV] UPDATED)
    Machine learning algorithms underpin modern diagnostic-aiding software, which has proved valuable in clinical practice, particularly in radiology. However, inaccuracies, mainly due to the limited availability of clinical samples for training these algorithms, hamper their wider applicability, acceptance, and recognition amongst clinicians. We present an analysis of state-of-the-art automatic quality control (QC) approaches that can be implemented within these algorithms to estimate the certainty of their outputs. We validated the most promising approaches on a brain image segmentation task identifying white matter hyperintensities (WMH) in magnetic resonance imaging data. WMH are a correlate of small vessel disease common in mid-to-late adulthood and are particularly challenging to segment due to their varied size, and distributional patterns. Our results show that the aggregation of uncertainty and Dice prediction were most effective in failure detection for this task. Both methods independently improved mean Dice from 0.82 to 0.84. Our work reveals how QC methods can help to detect failed segmentation cases and therefore make automatic segmentation more reliable and suitable for clinical practice.
    Quasi-parametric rates for Sparse Multivariate Functional Principal Components Analysis. (arXiv:2212.09434v1 [stat.ME])
    This work aims to give non-asymptotic results for estimating the first principal component of a multivariate random process. We first define the covariance function and the covariance operator in the multivariate case. We then define a projection operator. This operator can be seen as a reconstruction step from the raw data in the functional data analysis context. Next, we show that the eigenelements can be expressed as the solution to an optimization problem, and we introduce the LASSO variant of this optimization problem and the associated plugin estimator. Finally, we assess the estimator's accuracy. We establish a minimax lower bound on the mean square reconstruction error of the eigenelement, which proves that the procedure has an optimal variance in the minimax sense.
    Probabilistic machine learning based predictive and interpretable digital twin for dynamical systems. (arXiv:2212.09240v1 [stat.ML])
    A framework for creating and updating digital twins for dynamical systems from a library of physics-based functions is proposed. The sparse Bayesian machine learning is used to update and derive an interpretable expression for the digital twin. Two approaches for updating the digital twin are proposed. The first approach makes use of both the input and output information from a dynamical system, whereas the second approach utilizes output-only observations to update the digital twin. Both methods use a library of candidate functions representing certain physics to infer new perturbation terms in the existing digital twin model. In both cases, the resulting expressions of updated digital twins are identical, and in addition, the epistemic uncertainties are quantified. In the first approach, the regression problem is derived from a state-space model, whereas in the latter case, the output-only information is treated as a stochastic process. The concepts of It\^o calculus and Kramers-Moyal expansion are being utilized to derive the regression equation. The performance of the proposed approaches is demonstrated using highly nonlinear dynamical systems such as the crack-degradation problem. Numerical results demonstrated in this paper almost exactly identify the correct perturbation terms along with their associated parameters in the dynamical system. The probabilistic nature of the proposed approach also helps in quantifying the uncertainties associated with updated models. The proposed approaches provide an exact and explainable description of the perturbations in digital twin models, which can be directly used for better cyber-physical integration, long-term future predictions, degradation monitoring, and model-agnostic control.
    A Probabilistic Framework for Lifelong Test-Time Adaptation. (arXiv:2212.09713v1 [cs.LG])
    Test-time adaptation is the problem of adapting a source pre-trained model using test inputs from a target domain without access to source domain data. Most of the existing approaches address the setting in which the target domain is stationary. Moreover, these approaches are prone to making erroneous predictions with unreliable uncertainty estimates when distribution shifts occur. Hence, test-time adaptation in the face of non-stationary target domain shift becomes a problem of significant interest. To address these issues, we propose a principled approach, PETAL (Probabilistic lifElong Test-time Adaptation with seLf-training prior), which looks into this problem from a probabilistic perspective using a partly data-dependent prior. A student-teacher framework, where the teacher model is an exponential moving average of the student model naturally emerges from this probabilistic perspective. In addition, the knowledge from the posterior distribution obtained for the source task acts as a regularizer. To handle catastrophic forgetting in the long term, we also propose a data-driven model parameter resetting mechanism based on the Fisher information matrix (FIM). Moreover, improvements in experimental results suggest that FIM based data-driven parameter restoration contributes to reducing the error accumulation and maintaining the knowledge of recent domain by restoring only the irrelevant parameters. In terms of predictive error rate as well as uncertainty based metrics such as Brier score and negative log-likelihood, our method achieves better results than the current state-of-the-art for online lifelong test time adaptation across various benchmarks, such as CIFAR-10C, CIFAR-100C, ImageNetC, and ImageNet3DCC datasets.
    Energy-Based Models for Continual Learning. (arXiv:2011.12216v3 [cs.LG] UPDATED)
    We motivate Energy-Based Models (EBMs) as a promising model class for continual learning problems. Instead of tackling continual learning via the use of external memory, growing models, or regularization, EBMs change the underlying training objective to cause less interference with previously learned information. Our proposed version of EBMs for continual learning is simple, efficient, and outperforms baseline methods by a large margin on several benchmarks. Moreover, our proposed contrastive divergence-based training objective can be combined with other continual learning methods, resulting in substantial boosts in their performance. We further show that EBMs are adaptable to a more general continual learning setting where the data distribution changes without the notion of explicitly delineated tasks. These observations point towards EBMs as a useful building block for future continual learning methods.
    Faithful Heteroscedastic Regression with Neural Networks. (arXiv:2212.09184v1 [cs.LG])
    Heteroscedastic regression models a Gaussian variable's mean and variance as a function of covariates. Parametric methods that employ neural networks for these parameter maps can capture complex relationships in the data. Yet, optimizing network parameters via log likelihood gradients can yield suboptimal mean and uncalibrated variance estimates. Current solutions side-step this optimization problem with surrogate objectives or Bayesian treatments. Instead, we make two simple modifications to optimization. Notably, their combination produces a heteroscedastic model with mean estimates that are provably as accurate as those from its homoscedastic counterpart (i.e.~fitting the mean under squared error loss). For a wide variety of network and task complexities, we find that mean estimates from existing heteroscedastic solutions can be significantly less accurate than those from an equivalently expressive mean-only model. Our approach provably retains the accuracy of an equally flexible mean-only model while also offering best-in-class variance calibration. Lastly, we show how to leverage our method to recover the underlying heteroscedastic noise variance.
    Learning criteria going beyond the usual risk. (arXiv:2110.04996v2 [stat.ML] UPDATED)
    Virtually all machine learning tasks are characterized using some form of loss function, and "good performance" is typically stated in terms of a sufficiently small average loss, taken over the random draw of test data. While optimizing for performance on average is intuitive, convenient to analyze in theory, and easy to implement in practice, such a choice brings about trade-offs. In this work, we survey and introduce a wide variety of non-traditional criteria used to design and evaluate machine learning algorithms, place the classical paradigm within the proper historical context, and propose a view of learning problems which emphasizes the question of "what makes for a desirable loss distribution?" in place of tacit use of the expected loss.
    Exploring Optimal Substructure for Out-of-distribution Generalization via Feature-targeted Model Pruning. (arXiv:2212.09458v1 [cs.LG])
    Recent studies show that even highly biased dense networks contain an unbiased substructure that can achieve better out-of-distribution (OOD) generalization than the original model. Existing works usually search the invariant subnetwork using modular risk minimization (MRM) with out-domain data. Such a paradigm may bring about two potential weaknesses: 1) Unfairness, due to the insufficient observation of out-domain data during training; and 2) Sub-optimal OOD generalization, due to the feature-untargeted model pruning on the whole data distribution. In this paper, we propose a novel Spurious Feature-targeted model Pruning framework, dubbed SFP, to automatically explore invariant substructures without referring to the above weaknesses. Specifically, SFP identifies in-distribution (ID) features during training using our theoretically verified task loss, upon which, SFP can perform ID targeted-model pruning that removes branches with strong dependencies on ID features. Notably, by attenuating the projections of spurious features into model space, SFP can push the model learning toward invariant features and pull that out of environmental features, devising optimal OOD generalization. Moreover, we also conduct detailed theoretical analysis to provide the rationality guarantee and a proof framework for OOD structures via model sparsity, and for the first time, reveal how a highly biased data distribution affects the model's OOD generalization. Extensive experiments on various OOD datasets show that SFP can significantly outperform both structure-based and non-structure OOD generalization SOTAs, with accuracy improvement up to 4.72% and 23.35%, respectively.
    Agile Effort Estimation: Have We Solved the Problem Yet? Insights From A Replication Study. (arXiv:2201.05401v2 [cs.SE] UPDATED)
    In the last decade, several studies have explored automated techniques to estimate the effort of agile software development. We perform a close replication and extension of a seminal work proposing the use of Deep Learning for Agile Effort Estimation (namely Deep-SE), which has set the state-of-the-art since. Specifically, we replicate three of the original research questions aiming at investigating the effectiveness of Deep-SE for both within-project and cross-project effort estimation. We benchmark Deep-SE against three baselines (i.e., Random, Mean and Median effort estimators) and a previously proposed method to estimate agile software project development effort (dubbed TF/IDF-SVM), as done in the original study. To this end, we use the data from the original study and an additional dataset of 31,960 issues mined from TAWOS, as using more data allows us to strengthen the confidence in the results, and to further mitigate external validity threats. The results of our replication show that Deep-SE outperforms the Median baseline estimator and TF/IDF-SVM in only very few cases with statistical significance (8/42 and 9/32 cases, respectively), thus confounding previous findings on the efficacy of Deep-SE. The two additional RQs revealed that neither augmenting the training set nor pre-training Deep-SE play lead to an improvement of its accuracy and convergence speed. These results suggest that using semantic similarity is not enough to differentiate user stories with respect to their story points; thus, future work has yet to explore and find new techniques and features that obtain accurate agile software development estimates.
    Robust Bayesian Inference for Simulator-based Models via the MMD Posterior Bootstrap. (arXiv:2202.04744v3 [stat.ME] UPDATED)
    Simulator-based models are models for which the likelihood is intractable but simulation of synthetic data is possible. They are often used to describe complex real-world phenomena, and as such can often be misspecified in practice. Unfortunately, existing Bayesian approaches for simulators are known to perform poorly in those cases. In this paper, we propose a novel algorithm based on the posterior bootstrap and maximum mean discrepancy estimators. This leads to a highly-parallelisable Bayesian inference algorithm with strong robustness properties. This is demonstrated through an in-depth theoretical study which includes generalisation bounds and proofs of frequentist consistency and robustness of our posterior. The approach is then assessed on a range of examples including a g-and-k distribution and a toggle-switch model.
    Influence-Based Mini-Batching for Graph Neural Networks. (arXiv:2212.09083v1 [cs.LG])
    Using graph neural networks for large graphs is challenging since there is no clear way of constructing mini-batches. To solve this, previous methods have relied on sampling or graph clustering. While these approaches often lead to good training convergence, they introduce significant overhead due to expensive random data accesses and perform poorly during inference. In this work we instead focus on model behavior during inference. We theoretically model batch construction via maximizing the influence score of nodes on the outputs. This formulation leads to optimal approximation of the output when we do not have knowledge of the trained model. We call the resulting method influence-based mini-batching (IBMB). IBMB accelerates inference by up to 130x compared to previous methods that reach similar accuracy. Remarkably, with adaptive optimization and the right training schedule IBMB can also substantially accelerate training, thanks to precomputed batches and consecutive memory accesses. This results in up to 18x faster training per epoch and up to 17x faster convergence per runtime compared to previous methods.  ( 2 min )
    TCFimt: Temporal Counterfactual Forecasting from Individual Multiple Treatment Perspective. (arXiv:2212.08890v1 [cs.LG])
    Determining causal effects of temporal multi-intervention assists decision-making. Restricted by time-varying bias, selection bias, and interactions of multiple interventions, the disentanglement and estimation of multiple treatment effects from individual temporal data is still rare. To tackle these challenges, we propose a comprehensive framework of temporal counterfactual forecasting from an individual multiple treatment perspective (TCFimt). TCFimt constructs adversarial tasks in a seq2seq framework to alleviate selection and time-varying bias and designs a contrastive learning-based block to decouple a mixed treatment effect into separated main treatment effects and causal interactions which further improves estimation accuracy. Through implementing experiments on two real-world datasets from distinct fields, the proposed method shows satisfactory performance in predicting future outcomes with specific treatments and in choosing optimal treatment type and timing than state-of-the-art methods.  ( 2 min )
    Rank-1 Matrix Completion with Gradient Descent and Small Random Initialization. (arXiv:2212.09396v1 [stat.ML])
    The nonconvex formulation of matrix completion problem has received significant attention in recent years due to its affordable complexity compared to the convex formulation. Gradient descent (GD) is the simplest yet efficient baseline algorithm for solving nonconvex optimization problems. The success of GD has been witnessed in many different problems in both theory and practice when it is combined with random initialization. However, previous works on matrix completion require either careful initialization or regularizers to prove the convergence of GD. In this work, we study the rank-1 symmetric matrix completion and prove that GD converges to the ground truth when small random initialization is used. We show that in logarithmic amount of iterations, the trajectory enters the region where local convergence occurs. We provide an upper bound on the initialization size that is sufficient to guarantee the convergence and show that a larger initialization can be used as more samples are available. We observe that implicit regularization effect of GD plays a critical role in the analysis, and for the entire trajectory, it prevents each entry from becoming much larger than the others.  ( 2 min )
    Riemannian Optimization for Variance Estimation in Linear Mixed Models. (arXiv:2212.09081v1 [stat.ML])
    Variance parameter estimation in linear mixed models is a challenge for many classical nonlinear optimization algorithms due to the positive-definiteness constraint of the random effects covariance matrix. We take a completely novel view on parameter estimation in linear mixed models by exploiting the intrinsic geometry of the parameter space. We formulate the problem of residual maximum likelihood estimation as an optimization problem on a Riemannian manifold. Based on the introduced formulation, we give geometric higher-order information on the problem via the Riemannian gradient and the Riemannian Hessian. Based on that, we test our approach with Riemannian optimization algorithms numerically. Our approach yields a higher quality of the variance parameter estimates compared to existing approaches.  ( 2 min )
    Quantum policy gradient algorithms. (arXiv:2212.09328v1 [quant-ph])
    Understanding the power and limitations of quantum access to data in machine learning tasks is primordial to assess the potential of quantum computing in artificial intelligence. Previous works have already shown that speed-ups in learning are possible when given quantum access to reinforcement learning environments. Yet, the applicability of quantum algorithms in this setting remains very limited, notably in environments with large state and action spaces. In this work, we design quantum algorithms to train state-of-the-art reinforcement learning policies by exploiting quantum interactions with an environment. However, these algorithms only offer full quadratic speed-ups in sample complexity over their classical analogs when the trained policies satisfy some regularity conditions. Interestingly, we find that reinforcement learning policies derived from parametrized quantum circuits are well-behaved with respect to these conditions, which showcases the benefit of a fully-quantum reinforcement learning framework.  ( 2 min )
    A Permutation-Free Kernel Independence Test. (arXiv:2212.09108v1 [stat.ME])
    In nonparametric independence testing, we observe i.i.d.\ data $\{(X_i,Y_i)\}_{i=1}^n$, where $X \in \mathcal{X}, Y \in \mathcal{Y}$ lie in any general spaces, and we wish to test the null that $X$ is independent of $Y$. Modern test statistics such as the kernel Hilbert-Schmidt Independence Criterion (HSIC) and Distance Covariance (dCov) have intractable null distributions due to the degeneracy of the underlying U-statistics. Thus, in practice, one often resorts to using permutation testing, which provides a nonasymptotic guarantee at the expense of recalculating the quadratic-time statistics (say) a few hundred times. This paper provides a simple but nontrivial modification of HSIC and dCov (called xHSIC and xdCov, pronounced ``cross'' HSIC/dCov) so that they have a limiting Gaussian distribution under the null, and thus do not require permutations. This requires building on the newly developed theory of cross U-statistics by Kim and Ramdas (2020), and in particular developing several nontrivial extensions of the theory in Shekhar et al. (2022), which developed an analogous permutation-free kernel two-sample test. We show that our new tests, like the originals, are consistent against fixed alternatives, and minimax rate optimal against smooth local alternatives. Numerical simulations demonstrate that compared to the full dCov or HSIC, our variants have the same power up to a $\sqrt 2$ factor, giving practitioners a new option for large problems or data-analysis pipelines where computation, not sample size, could be the bottleneck.  ( 2 min )
    Latent Variable Representation for Reinforcement Learning. (arXiv:2212.08765v1 [cs.LG])
    Deep latent variable models have achieved significant empirical successes in model-based reinforcement learning (RL) due to their expressiveness in modeling complex transition dynamics. On the other hand, it remains unclear theoretically and empirically how latent variable models may facilitate learning, planning, and exploration to improve the sample efficiency of RL. In this paper, we provide a representation view of the latent variable models for state-action value functions, which allows both tractable variational learning algorithm and effective implementation of the optimism/pessimism principle in the face of uncertainty for exploration. In particular, we propose a computationally efficient planning algorithm with UCB exploration by incorporating kernel embeddings of latent variable models. Theoretically, we establish the sample complexity of the proposed approach in the online and offline settings. Empirically, we demonstrate superior performance over current state-of-the-art algorithms across various benchmarks.  ( 2 min )
    Managing Temporal Resolution in Continuous Value Estimation: A Fundamental Trade-off. (arXiv:2212.08949v1 [cs.LG])
    A default assumption in reinforcement learning and optimal control is that experience arrives at discrete time points on a fixed clock cycle. Many applications, however, involve continuous systems where the time discretization is not fixed but instead can be managed by a learning algorithm. By analyzing Monte-Carlo value estimation for LQR systems in both finite-horizon and infinite-horizon settings, we uncover a fundamental trade-off between approximation and statistical error in value estimation. Importantly, these two errors behave differently with respect to time discretization, which implies that there is an optimal choice for the temporal resolution that depends on the data budget. These findings show how adapting the temporal resolution can provably improve value estimation quality in LQR systems from finite data. Empirically, we demonstrate the trade-off in numerical simulations of LQR instances and several non-linear environments.  ( 2 min )
    Two-Scale Gradient Descent Ascent Dynamics Finds Mixed Nash Equilibria of Continuous Games: A Mean-Field Perspective. (arXiv:2212.08791v1 [math.OC])
    Finding the mixed Nash equilibria (MNE) of a two-player zero sum continuous game is an important and challenging problem in machine learning. A canonical algorithm to finding the MNE is the noisy gradient descent ascent method which in the infinite particle limit gives rise to the {\em Mean-Field Gradient Descent Ascent} (GDA) dynamics on the space of probability measures. In this paper, we first study the convergence of a two-scale Mean-Field GDA dynamics for finding the MNE of the entropy-regularized objective. More precisely we show that for any fixed positive temperature (or regularization parameter), the two-scale Mean-Field GDA with a {\em finite} scale ratio converges to exponentially to the unique MNE without assuming the convexity or concavity of the interaction potential. The key ingredient of our proof lies in the construction of new Lyapunov functions that dissipate exponentially along the Mean-Field GDA. We further study the simulated annealing of the Mean-Field GDA dynamics. We show that with a temperature schedule that decays logarithmically in time the annealed Mean-Field GDA converges to the MNE of the original unregularized objective function.  ( 2 min )
    Prediction of Auto Insurance Risk Based on t-SNE Dimensionality Reduction. (arXiv:2212.09385v1 [cs.AI])
    Correct scoring of a driver's risk is of great significance to auto insurance companies. While the current tools used in this field have been proven in practice to be quite efficient and beneficial, we argue that there is still a lot of room for development and improvement in the auto insurance risk estimation process. To this end, we develop a framework based on a combination of a neural network together with a dimensionality reduction technique t-SNE (t-distributed stochastic neighbour embedding). This enables us to visually represent the complex structure of the risk as a two-dimensional surface, while still preserving the properties of the local region in the features space. The obtained results, which are based on real insurance data, reveal a clear contrast between the high and low risk policy holders, and indeed improve upon the actual risk estimation performed by the insurer. Due to the visual accessibility of the portfolio in this approach, we argue that this framework could be advantageous to the auto insurer, both as a main risk prediction tool and as an additional validation stage in other approaches.  ( 2 min )
    Asymptotics of Network Embeddings Learned via Subsampling. (arXiv:2107.02363v3 [stat.ML] UPDATED)
    Network data are ubiquitous in modern machine learning, with tasks of interest including node classification, node clustering and link prediction. A frequent approach begins by learning an Euclidean embedding of the network, to which algorithms developed for vector-valued data are applied. For large networks, embeddings are learned using stochastic gradient methods where the sub-sampling scheme can be freely chosen. Despite the strong empirical performance of such methods, they are not well understood theoretically. Our work encapsulates representation methods using a subsampling approach, such as node2vec, into a single unifying framework. We prove, under the assumption that the graph is exchangeable, that the distribution of the learned embedding vectors asymptotically decouples. Moreover, we characterize the asymptotic distribution and provided rates of convergence, in terms of the latent parameters, which includes the choice of loss function and the embedding dimension. This provides a theoretical foundation to understand what the embedding vectors represent and how well these methods perform on downstream tasks. Notably, we observe that typically used loss functions may lead to shortcomings, such as a lack of Fisher consistency.  ( 2 min )
    Optimal Individualized Decision-Making with Proxies. (arXiv:2212.09494v1 [stat.ME])
    A common concern when a policy-maker draws causal inferences and makes decisions from observational data is that the measured covariates are insufficiently rich to account for all sources of confounding, i.e., the standard no confoundedness assumption fails to hold. The recently proposed proximal causal inference framework shows that proxy variables can be leveraged to identify causal effects and therefore facilitate decision-making. Building upon this line of work, we propose a novel optimal individualized treatment regime based on so-called outcome-inducing and treatment-inducing confounding bridges. We then show that the value function of this new optimal treatment regime is superior to that of existing ones in the literature. Theoretical guarantees, including identification, superiority, and excess value bound of the estimated regime, are established. Moreover, we demonstrate the proposed optimal regime via numerical experiments and a real data application.  ( 2 min )
    Machine Learning Assessment: implications to cybersecurity. (arXiv:1907.12851v5 [stat.ML] UPDATED)
    This chapter is dedicated to the assessment and performance estimation of machine learning (ML) algorithms, a topic that is equally important to the construction of these algorithms, in particular in the context of cyberphysical security design. The literature is full of nonparametric methods to estimate a statistic from just one available dataset through resampling techniques, e.g., jackknife, bootstrap and cross validation (CV). Special statistics of great interest are the error rate and the area under the ROC curve (AUC) of a classification rule. The importance of these resampling methods stems from the fact that they require no knowledge about the probability distribution of the data or the construction details of the ML algorithm. This chapter provides a concise review of this literature to establish a coherent theoretical framework for these methods that can estimate both the error rate (a one-sample statistic) and the AUC (a two-sample statistic). The resampling methods are usually computationally expensive, because they rely on repeating the training and testing of a ML algorithm after each resampling iteration. Therefore, the practical applicability of some of these methods may be limited to the traditional ML algorithms rather than the very computationally demanding approaches of the recent deep neural networks (DNN). In the field of cyberphysical security, many applications generate structured (tabular) data, which can be fed to all traditional ML approaches. This is in contrast to the DNN approaches, which favor unstructured data, e.g., images, text, voice, etc.; hence, the relevance of this chapter to this field.%  ( 2 min )
    Machine Learning Construction: implications to cybersecurity. (arXiv:1906.10019v4 [cs.LG] UPDATED)
    Statistical learning is the process of estimating an unknown probabilistic input-output relationship of a system using a limited number of observations. A statistical learning machine (SLM) is the algorithm, function, model, or rule, that learns such a process; and machine learning (ML) is the conventional name of this field. ML and its applications are ubiquitous in the modern world. Systems such as Automatic target recognition (ATR) in military applications, computer aided diagnosis (CAD) in medical imaging, DNA microarrays in genomics, optical character recognition (OCR), speech recognition (SR), spam email filtering, stock market prediction, etc., are few examples and applications for ML; diverse fields but one theory. In particular, ML has gained a lot of attention in the field of cyberphysical security, especially in the last decade. It is of great importance to this field to design detection algorithms that have the capability of learning from security data to be able to hunt threats, achieve better monitoring, master the complexity of the threat intelligence feeds, and achieve timely remediation of security incidents. The field of ML can be decomposed into two basic subfields: \textit{construction} and \textit{assessment}. We mean by \textit{construction} designing or inventing an appropriate algorithm that learns from the input data and achieves a good performance according to some optimality criterion. We mean by \textit{assessment} attributing some performance measures to the constructed ML algorithm, along with their estimators, to objectively assess this algorithm. \textit{Construction} and \textit{assessment} of a ML algorithm require familiarity with different other fields: probability, statistics, matrix theory, optimization, algorithms, and programming, among others.f  ( 3 min )
    On the Complexity of Representation Learning in Contextual Linear Bandits. (arXiv:2212.09429v1 [cs.LG])
    In contextual linear bandits, the reward function is assumed to be a linear combination of an unknown reward vector and a given embedding of context-arm pairs. In practice, the embedding is often learned at the same time as the reward vector, thus leading to an online representation learning problem. Existing approaches to representation learning in contextual bandits are either very generic (e.g., model-selection techniques or algorithms for learning with arbitrary function classes) or specialized to particular structures (e.g., nested features or representations with certain spectral properties). As a result, the understanding of the cost of representation learning in contextual linear bandit is still limited. In this paper, we take a systematic approach to the problem and provide a comprehensive study through an instance-dependent perspective. We show that representation learning is fundamentally more complex than linear bandits (i.e., learning with a given representation). In particular, learning with a given set of representations is never simpler than learning with the worst realizable representation in the set, while we show cases where it can be arbitrarily harder. We complement this result with an extensive discussion of how it relates to existing literature and we illustrate positive instances where representation learning is as complex as learning with a fixed representation and where sub-logarithmic regret is achievable.  ( 2 min )
    VC dimensions of group convolutional neural networks. (arXiv:2212.09507v1 [cs.LG])
    We study the generalization capacity of group convolutional neural networks. We identify precise estimates for the VC dimensions of simple sets of group convolutional neural networks. In particular, we find that for infinite groups and appropriately chosen convolutional kernels, already two-parameter families of convolutional neural networks have an infinite VC dimension, despite being invariant to the action of an infinite group.  ( 2 min )
    Multi-Task Learning for Sparsity Pattern Heterogeneity: A Discrete Optimization Approach. (arXiv:2212.08697v1 [stat.ME])
    We extend best-subset selection to linear Multi-Task Learning (MTL), where a set of linear models are jointly trained on a collection of datasets (``tasks''). Allowing the regression coefficients of tasks to have different sparsity patterns (i.e., different supports), we propose a modeling framework for MTL that encourages models to share information across tasks, for a given covariate, through separately 1) shrinking the coefficient supports together, and/or 2) shrinking the coefficient values together. This allows models to borrow strength during variable selection even when the coefficient values differ markedly between tasks. We express our modeling framework as a Mixed-Integer Program, and propose efficient and scalable algorithms based on block coordinate descent and combinatorial local search. We show our estimator achieves statistically optimal prediction rates. Importantly, our theory characterizes how our estimator leverages the shared support information across tasks to achieve better variable selection performance. We evaluate the performance of our method in simulations and two biology applications. Our proposed approaches outperform other sparse MTL methods in variable selection and prediction accuracy. Interestingly, penalties that shrink the supports together often outperform penalties that shrink the coefficient values together. We will release an R package implementing our methods.  ( 2 min )
    Learning Inter-Annual Flood Loss Risk Models From Historical Flood Insurance Claims and Extreme Rainfall Data. (arXiv:2212.08660v1 [cs.LG])
    Flooding is one of the most disastrous natural hazards, responsible for substantial economic losses. A predictive model for flood-induced financial damages is useful for many applications such as climate change adaptation planning and insurance underwriting. This research assesses the predictive capability of regressors constructed on the National Flood Insurance Program (NFIP) dataset using neural networks (Conditional Generative Adversarial Networks), decision trees (Extreme Gradient Boosting), and kernel-based regressors (Gaussian Process). The assessment highlights the most informative predictors for regression. The distribution for claims amount inference is modeled with a Burr distribution permitting the introduction of a bias correction scheme and increasing the regressor's predictive capability. Aiming to study the interaction with physical variables, we incorporate Daymet rainfall estimation to NFIP as an additional predictor. A study on the coastal counties in the eight US South-West states resulted in an $R^2=0.807$. Further analysis of 11 counties with a significant number of claims in the NFIP dataset reveals that Extreme Gradient Boosting provides the best results, that bias correction significantly improves the similarity with the reference distribution, and that the rainfall predictor strengthens the regressor performance.  ( 2 min )
    Support Vector Regression: Risk Quadrangle Framework. (arXiv:2212.09178v1 [stat.ML])
    This paper investigates Support Vector Regression (SVR) in the context of the fundamental risk quadrangle paradigm. It is shown that both formulations of SVR, $\varepsilon$-SVR and $\nu$-SVR, correspond to the minimization of equivalent regular error measures (Vapnik error and superquantile (CVaR) norm, respectively) with a regularization penalty. These error measures, in turn, give rise to corresponding risk quadrangles. Additionally, the technique used for the construction of quadrangles serves as a powerful tool in proving the equivalence between $\varepsilon$-SVR and $\nu$-SVR. By constructing the fundamental risk quadrangle, which corresponds to SVR, we show that SVR is the asymptotically unbiased estimator of the average of two symmetric conditional quantiles. Additionally, SVR is formulated as a regular deviation minimization problem with a regularization penalty by invoking Error Shaping Decomposition of Regression. Finally, the dual formulation of SVR in the risk quadrangle framework is derived.  ( 2 min )

  • Open

    DSC Weekly 20 December 2022 – Highlighting Our Contributors Part 2
    Announcements Highlighting Our Contributors This week is DSC’s final issue of DSC Weekly in 2022. With the conclusion of the year, we’re highlighting two of our top contributors’ and their articles from the last year. These articles are chosen for their high-quality, informative nature and attention to detail.  Alan Morrison posts articles with critical thought… Read More »DSC Weekly 20 December 2022 – Highlighting Our Contributors Part 2 The post DSC Weekly 20 December 2022 – Highlighting Our Contributors Part 2 appeared first on Data Science Central.  ( 20 min )
    Ready, Fire, Aim: the Path to Agility
    Ready, Fire, Aim is often used disparagingly to describe how incompetent operators perpetrate disastrous projects on the dime of the companies for which they work.  Such people become attached to ideas and then pursue them to a catastrophic end.  Sort of like New Coke, though that turned out better than it should have through a… Read More »Ready, Fire, Aim: the Path to Agility The post Ready, Fire, Aim: the Path to Agility appeared first on Data Science Central.  ( 22 min )
    Why we need to think differently about AI risks in the medium to long term
    There are many articles that point to the risks of AI. Indeed, these risks are real, but also many of these articles are based on scaremongering and sensationalism. If we take a medium to long-term view, we definitely need to think differently about the risks of AI. Here is why: a)  We do not take… Read More »Why we need to think differently about AI risks in the medium to long term The post Why we need to think differently about AI risks in the medium to long term appeared first on Data Science Central.  ( 19 min )
    Top 5 Key Business Applications of Sentiment Analysis
    In order to gauge whether your customers are happy with what you’re doing, customer satisfaction is crucial. It is proven that a high degree of satisfaction increases the lifetime value of a customer, more customer retention, and a stronger reputation of the brand. Alternatively referred to as “opinion mining,” sentiment analysis can help product managers… Read More »Top 5 Key Business Applications of Sentiment Analysis The post Top 5 Key Business Applications of Sentiment Analysis appeared first on Data Science Central.  ( 22 min )
    Why Should You Care About CPRA’s ‘Do Not Sell’?
    The California Privacy Rights Act (CPRA) of 2020, also known as Proposition 24, was approved by California voters on November 3, 2020. Even though the CPRA has been here for a while, many businesses and individuals are unaware of its regulations to date. The CPRA extends the already enacted California Consumer Privacy Act (CCPA), which… Read More »Why Should You Care About CPRA’s ‘Do Not Sell’? The post Why Should You Care About CPRA’s ‘Do Not Sell’? appeared first on Data Science Central.  ( 21 min )
    Benefits of AI to Fight Fraud in the Banking System
    Banking, financial institutions & customers have been facing fraud for a very long time, in fact ever since the financial industry was created. The chances of fraud being attempted are almost guaranteed wherever money and/or private data are present. As the use of digitization and use of technology increases, it also increases the ways and… Read More »Benefits of AI to Fight Fraud in the Banking System The post Benefits of AI to Fight Fraud in the Banking System appeared first on Data Science Central.  ( 20 min )
  • Open

    We are designing a physics environment to learn dynamic soaring control policies. The project is just getting off the ground, but these early results are showing a lot of promise!
    submitted by /u/Alphasoar_AI [link] [comments]  ( 61 min )
    [D] Math in Sutton's Reinforcement Learning: An Introduction
    Does anyone else feel that the mathematics (and proofs) in Sutton and Barto's book are not rigorous enough? I sometimes feel that it oversimplifies concepts to the point that they make intuitive sense without sufficient mathematical backing. ​ A good example is: https://preview.redd.it/dxgauaqrt47a1.png?width=661&format=png&auto=webp&s=fef2867911cdd3d2cc737e4496061fa3c92bdb31 I think I understand the book well, but the last line is just nonsensical. I understand that under a stochastic policy assumption, the agent would transition through all possible states at the limit therefore, we can go from a trajectory notation (in t->inf) to a summation over all states and actions. However, I can easily come up with that equation from scratch based on intuition, which would be just as (un)useful. The worst part is that I can think of many other examples throughout the book that leaves my mathematical curiosity unsatisfied. Does anyone else feel like that? Are there any other alternatives that are more mathematically rigorous? submitted by /u/nacho_rz [link] [comments]  ( 61 min )
    Solving Mastermind
    I'm currently trying several agents to play the game Mastermind ( https://www.youtube.com/watch?v=Dn0iqlY5tMU ). I'm using tf-agents for the implementation and use DQN + PPO, but I am not able to get it working. The agents are never learning. I tried different observation spaces and reward functions, but got nothing what was working. The action space consists of 6**4 actions (0-1295), where each action reflects a possible color combination (4 colors have to be guessed and the total number of different colors is 6). Currently, my observation consists of a 1x6 array, where the first four values are the guessed code and the last two values the feedback. I have tried changing the observation, to show all guesses + feedbacks so far and only feedbacks, but neither works. The game is supposed to be played in 10 rounds, but the agents are loosing mostly all games and are stuck with 9.5+ average episode length. I also tried increasing the episode length from 10 to 1295, but same result. I tried reward functions, where the agent gets penalized for each step, gets positive reward for correct guess of color and position, positive/negative for correct guess of color, but wrong position and some more. I am not sure, what I am doing wrong. At this point, I'm think the biggest flaw is the observation space and I do not know, what to include in it. The feedback is not allowed to tell, which slots contain right/wrong colors. submitted by /u/l0sti- [link] [comments]  ( 62 min )
    A statistics course for deep reinforcement learning
    I've been working for quite some time on DRL algorithms, yet my knowledge of statistics remains too superficial to feel confident about some of its theoretical aspects. Typical example: I get the overall idea of the SAC reparameterization trick, yet I find the details of the derivation very hard to follow. I would definitely like to remedy that. Would you have some recommandations for a statistics course specifically oriented on the aspects useful for DRL ? Thanks in advance submitted by /u/Scrimbibete [link] [comments]  ( 72 min )
    MuZero learns to play Teamfight Tactics
    TLDR: Created an AI to play Team fight tactics. It is starting to learn but could some help. Hope to bring it to the research world one day. Hey! I am releasing a new trainable AI to learn how to play TFT at https://github.com/silverlight6/TFTMuZeroAgent. This is the first pure reinforcement learning algorithm (no human rules, game knowledge, or legal action set given) to learn how to play TFT to my knowledge and may be the first of any kind of AI. Feel free to clone the repository and run it yourself. It requires python3, numpy, tensorflow, collections, jit and cuda. There are a number of built in python libraries like time and math that are required but I think the 3 libraries above should be all that is needed to install. There is a requirement script for this purpose. This AI is b…  ( 59 min )
    Best Library for Multi-Agent with Custom Policies
    Hello! I'm doing some work with multi-agent RL. In particular, I'm looking at games where all agents have simultaneous actions and observations (rather than sequential). I'm working with Farama PettingZoo as my multi-agent gym and I'm looking for a good library to train the models. I plan on writing my own custom policies in the future, so ideally I want easily extendable libraries. I am currently looking at Stable Baselines3, CleanRL, RLlib and Tianshou. However only RLlib and Tianshou currently directly support multi-agent RL, while for stable baselines and cleanRL, I have to convert my environment first using super suit. Has anyone worked with these libraries for multi-agent RL before? Can you please tell me which is the easiest to work with? Thanks! _____________________ I tried using Tianshou but there's no support for simulatenously acting agents yet (it only support sequentially acting agents). I attempted to write code to add support but I found it too confusing. However, Tianshou provides good ways to create custom policies. I haven't using RLlib. Does anyone have any experience with how difficult it is to write custom policies in RLlib? submitted by /u/_anarchronism [link] [comments]  ( 62 min )
    How can we model an observation space of an env with different features and sizes.
    Hello, My problem: How would you model an observation space of an env , if your state has the following features: 1)- you have 4 differents battery status (bi) of sensors, between 0 and 0.0032 2)- you have throughput (M) with differents values which are: [10240, 40960, 92160, 163840, 256000]. 3)- you have number of required sensors (K) which can go from 1 to 8. Just to explain the scenario, in every timestep (every time I call my step function of my env), I have for example the following state: Lets take K=2 and Mt=256000, so we my state should be : 0.0031, 0.0016, 0.0026, 0.0010 , 256000, 2 . (the four first numbers are bi. the second is M, the third is K) What did I try and why its a problem for me: I tried to use a dict like: Dict({"state": Box(0, 0.0032, shape=(8,)), "Mt": Box(10240, 256000, shape=(1,)), "K": Discrete(9) }) ​ But the problem came later on when I tried to fit it to the algorithm . PS: Iam using DDPG agent from Keras. so in the line 37 of the following link https://github.com/keras-rl/keras-rl/blob/master/examples/ddpg_pendulum.py we need to have observation space shape by " env.observation_space.shape " The output of this line gave me nothing, only the len() worked and gave me 3. ​ Thank you for your help. ​ Best regards submitted by /u/EnvironmentCrazy6381 [link] [comments]  ( 62 min )
    Can we use RNN in RL?
    Recently I saw an article about ChatGPT. It mentioned that ChatGPT uses reinforcement learning to fine-tune over a language model. It got me curious about how researchers are training these RNN or LSTM models in reinforcement learning. I may be wrong but i think recurrent networks require initialising and reusing the internal state of the network. At the same time techniques like DDPG and TD3 have their own memory buffer to store the transitions. Overall the idea of using rnn is awesome as network can take better decisions based on many previous States but won't that violates the Markov decision process? If anyone can explain and/or provide a link to some implementations of DDPG using RNN, it would be super helpful. submitted by /u/Better-Ad8608 [link] [comments]  ( 73 min )
    How do I increase the success rate of my algorithm?
    I've built a car simulation and I want the cars to figure out how to finish the race asap. I built a neutral network with 5 inputs (the distance between the car and the wall in 5 different directions) and 4 outputs (Turn right, turn left, speed up, and slow down). No hidden layers and Relu as an activation function. I run 12 cars a generation and I crossover the top 2 cars in every generation. I have "bonus lines" that give the cars bonus points for driving in the right direction. I run the program and go to sleep (I've done it several times) and I don't get the results I wanted to get, the cars usually do something very weird like turning right 270° instead of 90° to the left etc... am I doing something wrong? (I wanna use NEAT after I make this model successful first) submitted by /u/alonmega100 [link] [comments]  ( 60 min )
    Why can't we do supervised learning in Step 3 of RLHF?
    ​ https://preview.redd.it/oljalnjrqz6a1.png?width=1000&format=png&auto=webp&s=0938681646f9f357c432aef1907b7a4e459397d9 In step 3, why can't we combine the PPO model with the reward model into a single differentiable neural net, and then just backpropagate the reward (as a loss) back to the first half of the combined model? Wouldn't that just be regular supervised learning at that point? I feel like I'm missing something fundamental. submitted by /u/wardellinthehouse [link] [comments]  ( 66 min )
    Some confusion regarding the value prefix in EfficientZero
    If I get what the paper is trying to get across, rather than predicting the reward, we essentially predict the sum of future rewards to a limited depth given the rollout states to be used directly in the calcuation of the Q value, which apparently helps with training stability. What I'm confused about is how they train this, basically, what exactly is the target that it is learning here? submitted by /u/nutpeabutter [link] [comments]  ( 62 min )
  • Open

    [R] The case for 4-bit precision: k-bit Inference Scaling Laws - Tim Dettmers and Luke Zettlemoyer - Findings show that 4-bit precision is almost universally optimal for total model bits and zero-shot accuracy!
    Paper: https://arxiv.org/abs/2212.09720 Abstract: Quantization methods reduce the number of bits required to represent each parameter in a model, trading accuracy for smaller memory footprints and inference latencies. However, the final model size depends on both the number of parameters of the original model and the rate of compression. For example, a 30B 8-bit model and a 60B 4-bit model have the same number of bits but may have very different zero-shot accuracies. In this work, we study this trade-off by developing inference scaling laws of zero-shot performance in Large Language Models (LLMs) to determine the bit-precision and model size that maximizes zero-shot performance. We run more than 35,000 zero-shot experiments with 16-bit inputs and k-bit parameters to examine which quantization methods improve scaling for 3 to 8-bit precision at scales of 19M to 66B parameters across the LLM families BLOOM, OPT, NeoX/Pythia, and GPT-2. We find that it is challenging to improve the bit-level scaling trade-off, with the only improvements being the use of a small block size -- splitting the parameters into small independently quantized blocks -- and the quantization data type being used (e.g., Int vs Float). Overall, our findings show that 4-bit precision is almost universally optimal for total model bits and zero-shot accuracy. https://preview.redd.it/s68eedd5v47a1.jpg?width=720&format=pjpg&auto=webp&s=5e073f1084876dd146910a4db82a703bac6cc263 https://preview.redd.it/sllrw4p5v47a1.jpg?width=927&format=pjpg&auto=webp&s=1d197f2fe9e4629b5a8f8e23340f511c7129e5d7 submitted by /u/Singularian2501 [link] [comments]  ( 70 min )
    [R] Nonparametric Masked Language Modeling - MetaAi 2022 - NPM - 500x fewer parameters than GPT-3 while outperforming it on zero-shot tasks
    Paper: https://arxiv.org/abs/2212.01349 Abstract: Existing language models (LMs) predict tokens with a softmax over a finite vocabulary, which can make it difficult to predict rare tokens or phrases. We introduce NPM, the first nonparametric masked language model that replaces this softmax with a nonparametric distribution over every phrase in a reference corpus. We show that NPM can be efficiently trained with a contrastive objective and an in-batch approximation to full corpus retrieval. Zero-shot evaluation on 9 closed-set tasks and 7 open-set tasks demonstrates that NPM outperforms significantly larger parametric models, either with or without a retrieve-and-generate approach. It is particularly better on dealing with rare patterns (word senses or facts), and predicting rare or nearly unseen words (e.g., non-Latin script). https://preview.redd.it/qf2lqrkku47a1.jpg?width=658&format=pjpg&auto=webp&s=7dc7e76f3075b4b4f0916c2de1e442b19b2c0f49 https://preview.redd.it/gqhlbykku47a1.jpg?width=1241&format=pjpg&auto=webp&s=39f63470d18ea6f4a8ed560b371cc46b939b2c6f https://preview.redd.it/p7bzdukku47a1.jpg?width=883&format=pjpg&auto=webp&s=6a8eb2b66abcb1581abf7280180c1c0e86201232 https://preview.redd.it/z6niwykku47a1.jpg?width=1112&format=pjpg&auto=webp&s=8337a4802db983df1a4b0b11934c0708888641a4 https://preview.redd.it/s8fdhxkku47a1.jpg?width=1361&format=pjpg&auto=webp&s=28b307df857ef2262d3f8348fd1094ebb793a63d https://preview.redd.it/94t5fwkku47a1.jpg?width=1362&format=pjpg&auto=webp&s=da8bca8fd08ecaf956658c674f5a32a930cdd3a2 submitted by /u/Singularian2501 [link] [comments]  ( 62 min )
    [D] Adding a new RL environment to envpool
    Envpool provides high parallelization of RL environments. Unfortunately, there are still many environments that are not supported by them. One of them is FrankaKitchen of D4RL, a library for offline RL. Envpool provides tutorials on how to add new environments here My question is: for anyone with experience adding new environments to envpool, how difficult is it? And while difficulty is in the eye of the beholder, I'd like to know how much time and reading was needed to successfully add the new environment, which is a more objective measure. Full disclaimer: I have not read the tutorial carefully. I first want to have a better idea of how hard it is before I fully commit to it. submitted by /u/carlml [link] [comments]  ( 64 min )
    [D] ICML LaTeX template with section bookmarks
    Hi, It seems that the recent (2022) ICML LaTeX template does not properly generate section bookmarks internally. I tried generating a table of content but it shows nothing. Same with other "bookmark-based" packages like todonotes, nothing works. Never realized the ICML template was missing such a big feature. Has anyone fixed this? submitted by /u/Red-Portal [link] [comments]  ( 62 min )
    [D] Techniques to optimize a model when the loss over the training dataset has a Power Law type curve.
    I have a rather successful model which I have trained to an extent that the loss has now plateaued. The loss over my training dataset follows a Power Law type curve: https://preview.redd.it/qotu2k09237a1.png?width=825&format=png&auto=webp&s=b16ca887ce8e259f8de4a20609e35ff7f7298df9 That means, 80% of the training examples have a loss which is well below my tolerance threshold. 15% have a loss which is slightly above threshold tolerance. 4% have a loss which is significant above threshold. And 1% have a very high loss. This results from the inherent complexity of the training examples themselves. Some are simple. Some are complex. And I was wondering, are there any techniques developed to keep optimizing a model when you encounter such a situation? I thought, such a situation is surely very common so maybe some people came up with some strategies or algorithms, but my Google-fu has failed me. Please refer me to literature on the topic if it exists. So far I have tried pre-selecting and training on the hard examples only and I have tried multiplying the loss gradients with a scalar that depends on the loss itself. None of these approaches give me satisfactory results. Maybe it is just that the model is not complex enough. But I am maxing out my GPU RAM already (Nvdia A100s) so I cannot really do much better. But I am not sure I have yet reached the limits of complexity with this model. submitted by /u/Dartagnjan [link] [comments]  ( 68 min )
    [R] PyTorch implementation of Forward-Forward Algorithm by Geoffrey Hinton and analysis of performances over backpropagation
    Are you an AI researcher itching to test Hinton's Forward-Forward Algorithm? I was too, but could not find any full implementation so I decided to code it myself, from scratch. Here's the GitHub repo and don’t forget to leave a star if you enjoy the project. https://preview.redd.it/zne5aapb837a1.png?width=581&format=png&auto=webp&s=c1a25a2df94b365c283076a1a84c480371ed296e As soon as I read the paper, I started to wonder how AI stands to benefit from Hinton’s FF algorithm (FF = Forward-Forward). I got particularly interested in the following concepts: Local training. Each layer can be trained just comparing the outputs for positive and negative streams. No need to store the activations. Activations are needed during the backpropagation to compute gradients, but often result in nasty O…  ( 68 min )
    [P] flaim - State-of-the-art pre-trained vision backbones for Flax
    Excerpt from GitHub Flax Image Models Introduction flaim is a library of state-of-the-art pre-trained vision models, plus common deep learning modules in computer vision, for Flax. It exposes a host of diverse image models through a straightforward interface with an emphasis on simplicity, leanness, and readability, and offers lower-level modules for designing custom architectures. Installation flaim can be installed through pip install flaim. Beware that pip installs the CPU version of JAX, and you must manually install JAX yourself to run your programs on the GPU or TPU. Usage flaim.get_model is the central function of flaim and manages model retrieval. It accepts a handful of arguments: model_name (str): The name of the model. If it is not recognized, an exception is thrown. …  ( 70 min )
    [D] Why are we stuck with Python for something that require so much speed and parallelism (neural networks)?
    Every time I try to implement something I have to make sure I never use loops and I only use Pytorch/tf tensors. If I want to have efficient code, I must kind of abandon Python and only use data structures and operations that are provided by those frameworks. Every time I have a solution in my head, I must think how do Implement it using ONLY the framework, and not the programming language (python). We basically constraint ourselves to those limited operations that someone implemented in C++ for us. This make things harder, not easier. We are not programming in python at all. We use a language within a language that really constraints us. Why not just move to C++ or something new like Rust/Go? submitted by /u/vprokopev [link] [comments]  ( 81 min )
    [R] Foresight: Deep Generative Modelling of Patient Timelines using Electronic Health Records
    Hi everyone, my lab has recently made Foresight - in short, it is a GPT-3 like language model that can simulate a patient's future (forecast disorders, medications, procedures, symptoms, ...). It was trained and tested on two large hospitals in UK covering both physical and mental health. Any feedback is much appreciated (Twitter or here). Paper: arxiv Demo: foresight submitted by /u/w_is_h [link] [comments]  ( 68 min )
    [R] Swin transformer while using a rectangular attention window
    Hello, Anyone knows of there is an implementation of swin transformer that uses a rectangular attention window (Width!=height)? Thanks 🙏 submitted by /u/Meddhouib10 [link] [comments]  ( 64 min )
    [D] Question: best 'starting' server to train deep ML models
    Hi everyone! I want to use a server to continuously train my ML models without keeping on my pc 24/7. I am currently running fairly simple deep learning models that would take a week on my computer. So far the best solution to start with that I found is the AWS t2.micro instance which could be good for starting. I've seen that also google cloud and Nvidia have other options. Could you please guide me thru or giving me suggestions about which one could be better as I am not an expert and it is the first time I do it? submitted by /u/KlausMich [link] [comments]  ( 75 min )
    [N] JaxLightning: PyTorch Lightning for Jax
    Hi, I've been working extensively with PyTorch for the last couple of years and became very accustomed to PyTorch Lightning, which takes all the boiler plate code off you hands while allowing you to access almost any step in the training/validation/test loop. Jax offers cool tools to write elegant code (albeit be it in a very functional style) with vmap, jit, pmap etc. The ecosystem of PyTorch is very strong with a nice OOP style of structuring code and training management packages like Lightning. Packages like Equinox and Treex have introduced a very PyTorch like neural network packages. Having worked extensively with PyTorch Lightning I realized that you can hijack Lightning with two simple steps to run full Jax models. The trick is to run Lightning in a pure Numpy mode and to turn off automatic optimization via `automatic_optimization=False`. Then you simply run your jit compiled forward, backward and gradient update inside the train loop. Jax can do device management on its own and without much extra effort, as far as I know. Other than that, you get basically every advantage of PyTorch Lightning and every advantage of Jax in one swoop. Link to more details and examples: https://github.com/ludwigwinkler/JaxLightning submitted by /u/ludixiv [link] [comments]  ( 65 min )
    [D] Deep Learning based Recommendation Systems
    Hey! Currently, I am reading the papers on Deep Learning based Recommender Systems. After around 20 papers, I realised the base idea of the papers is the same - recommendation task either Top-K recommendations or simply predicting the utility (i am not talking about those frameworks that simply model the auxiliary information). The papers have differences in base models (I am reading DNN/MLP, Autoencoder and Attentive models based), but the methodology is the same - replace the way to factorize the matrix to find the latent feature vectors of users/items/social relations, only some papers introduce custom loss function with regularisation terms (just to model the social network I would say). And all these models perform as "state-of-the-art". The question is where is this research field going/developing? All these findings/performance results are simply empirical with no theoretical evidence. submitted by /u/Awekonti [link] [comments]  ( 69 min )
    [D] Area of research focused on extracting data from RL / AI agents in games to inform strategies used by humans?
    My question is a bit half-baked, so apologies for the weird title. I wasn't sure how to phrase what I'm asking for / don't know if what I'm looking for exists. Basically, in all of my RL / classical AI studies, I've noticed an understandable focus on building agents that are good at all kinds of games (chess, backgammon, go, etc.), to the point of being superhuman. I'm interested if anyone here knows if there has been any successful research into extracting and distilling information contained in these agents to inform higher level strategies about the game (e.g. based on how the agent performs in certain scenarios over a simulated set of games). My assumption is that if an agent is really good at a game, then there are probably things to be learned from its performance to improve our own. I just don't know if there are any papers or articles that dive into these topics, nor do I know how to articulate what I'm looking for well enough to google it. TIA for helping point me in the right direction! submitted by /u/Ligmatologist [link] [comments]  ( 69 min )
    [D] AllenAi Predoctoral Young Investigators Program
    Sorry if this isn't the right place to post. Has anyone heard back from any of the teams for AllenAI PYI program, or does anyone know what the full interview loop is like? I have had a phone screen for one of the teams. submitted by /u/6ottle [link] [comments]  ( 62 min )
    [R] Auxiliary tasks in task-oriented-dialogue systems can highly enhance generalizability and lead to more human-sounding responses. In our recent study, we were able to train an architecture with state-of-the-art benchmarks, but with 3 times less parameters than baselines.
    Abstract: The adoption of pre-trained language models in task-oriented dialogue systems has resulted in significant enhancements of their text generation abilities. However, these architectures are slow to use because of the large number of trainable parameters and can sometimes fail to generate diverse responses. To address these limitations, we propose two models with auxiliary tasks for response selection - (1) distinguishing distractors from ground truth responses and (2) distinguishing synthetic responses from ground truth labels. They achieve state-of-the-art results on the MultiWOZ 2.1 dataset with combined scores of 107.5 and 108.3 and outperform a baseline with three times more parameters. We publish reproducible code and checkpoints and discuss the effects of applying auxiliary tasks to T5-based architectures. Project available on GitHub: https://github.com/radi-cho/RSTOD Our paper was presented at https://www.icnlsp.org/. Publication in process. submitted by /u/radi-cho [link] [comments]  ( 64 min )
    [R] [D] Are Optuna and Boruta Shap acceptable for academic research?
    Hi, I am working on a project that uses some gradient boosting machine learning algorithms for a regression task. I am not from computer science background and my knowledge about ML is mostly from Coursera courses and kaggle. I wanted to use Optuna for hyper parameter optimization and Boruta Shap for feature selection as it is fairly common in Kaggle and I learnt to use these libraries from there. But is it acceptable or standard practice to use these libraries for academic research? Or should I resort to vanilla random search/ grid search for hyper parameter optimization, and correlation matrix for feature selection? This is my first research project, and I am not acquainted with the standard practices in ML academia. submitted by /u/franticpizzaeater [link] [comments]  ( 66 min )
  • Open

    Point-E: OpenAI shows DALL-E for 3D models
    submitted by /u/Peaking_AI [link] [comments]  ( 48 min )
    The Surprising Things ChatGPT Can’t Do (Yet)
    submitted by /u/SupPandaHugger [link] [comments]  ( 48 min )
    Deleted tweet from Rippling co-founder: Microsoft is all-in on GPT. GPT-4 10x better than 3.5(ChatGPT), clearing turing test and any standard tests.
    submitted by /u/Sebrosen1 [link] [comments]  ( 53 min )
    Google working on AI capable of reading doctors' handwriting on prescriptions
    tbh this one is just comedic, like seriously??? doctors’ handwritings are so bad we need a freaking billion-dollar company to fix it? naah I’m on the floor just rolling laughing! During an address at the Google for India event, a senior executive said that the company is currently developing an AI model that will be able to decipher any kind of handwriting. Google has been in contact with pharmacists for research purposes and the feature will a part of Google Lens. “This will act as an assistive technology for digitizing handwritten medical documents by augmenting the humans in the loop such as pharmacists, however, no decision will be made solely based on the output provided by this technology,” the company said in a statement according to TechCrunch. The company did not announce any time frame for the launch of the feature but they made it clear that “much work still remains to be done before this system is ready for the real world.” This is from the AI With Vibes Newsletter, read the full issue here: https://aiwithvibes.beehiiv.com/p/user-caught-using-chatgpt-reply-tweets-google-developing-ai-read-doctors-handwriting submitted by /u/Mk_Makanaki [link] [comments]  ( 48 min )
    Interesting response, what's your take on this?
    submitted by /u/prsadr [link] [comments]  ( 47 min )
    interesting idea if you ask me
    submitted by /u/justinlongbranch [link] [comments]  ( 47 min )
    Is it possible to create something like a QR code for artists to place on their work, and the algorithm will learn something disruptive instead?
    Or just simply will learn nothing. Like it will become blank if your art is digitalized? submitted by /u/ninja-brc [link] [comments]  ( 52 min )
    Deepfake + AI Voice Changer Joe Biden singing Christmas songs sounds pretty legit
    submitted by /u/cneaky [link] [comments]  ( 48 min )
    AI Dream 132 - When AI meets M.C. Escher
    submitted by /u/LordPewPew777 [link] [comments]  ( 47 min )
    🤖 Artificial Intelligence's Relentless Year of 2022
    submitted by /u/BackgroundResult [link] [comments]  ( 50 min )
    AI Content Detector - a tool detects whether a text was written by a machine or a human.
    submitted by /u/PeteyCruiser123 [link] [comments]  ( 46 min )
    Tell me what is your reaction to this article
    Article link: This article is Bipolar : Role of AI in Education I have heard so many times from the top-level employees (C-suite) of the company about the visionary outlook of the next few years to come. I always thought of it as just fluff. But I might be wrong so I wrote an article in the same way as my leadership spoke about the Role of AI in XYZ. I wanted to check with Redditors on their take. submitted by /u/Opitmus_Prime [link] [comments]  ( 49 min )
    Generative AI Solutions for Real-World Problems
    submitted by /u/modzykirsten [link] [comments]  ( 60 min )
    The Truth About AI and Copyright: is DALL-E stealing art?
    submitted by /u/taranasus [link] [comments]  ( 55 min )
    Searching for work on relevance of answers compared to questions
    Does any of you know if there is some sort of research out there about whether an answer actually answers the question or not? ​ I have a rather specific setup, in which the AI is asking questions and the human is answering them. I wanted to check weather it is possible if a given answer actually answers a question or is off-topic. submitted by /u/Fabianslife [link] [comments]  ( 50 min )
    peparony (Charecter AI)
    submitted by /u/No_Vacation3747 [link] [comments]  ( 48 min )
    peparony (Charecter AI)
    submitted by /u/No_Vacation3747 [link] [comments]  ( 48 min )
    Deep Learning based Recommendation Systems
    Hey! Currently, I am reading the papers on Deep Learning based Recommender Systems. After around 20 papers, I realised the base idea of the papers is the same - recommendation task either Top-K recommendations or simply predicting the utility (i am not talking about those frameworks that simply model the auxiliary information). The papers has differences in base models (I am reading DNN/MLP, Autoencoder and Attentive models based), but the task is the same - replace the way to factorize the matrix to find the latent feature vectors of users/items/social relations, only some papers introduce custom loss function with regularisation terms (just to model the social network I would say). And all there models perform as "state-of-the-art". The question is where is this research field going/developing? All these findings/performance results are simply empirical with no theoretical evidence. submitted by /u/Awekonti [link] [comments]  ( 53 min )
    3 Disadvantages of Using Artificial Intelligence to Reduce Employee Attrition
    Can We Predict When an Employee Is About to Quit? Resignations represent one of the most emotional, stressful, and challenging situations leaders face. They undermine confidence in ourselves, our leadership, and our organizations. They threaten the status quo. And they have the potential to compromise team dynamics and business results. The biggest problem with resignations is that they are supposedly unpredictable. There is no way any manager can predict with some degree of accuracy when an employee will quit. Now things will change with AI. IBM has created a new AI system that can predict with 95% accuracy which workers are about to quit their jobs. It is called the "predictive attrition program," which was developed with IBM Watson software to predict employee flight risk and prescribe actions for managers to engage employees. As IBM CEO Gini Rometty said, “It took time to convince company management it was accurate and, so far, it has saved IBM nearly $300 million in retention costs.” Further, Rometty claimed that the AI system could zero in on an individual's strengths. In turn, this can enable a manager to direct an employee to future opportunities they may not have seen using traditional methods. This will also help employees develop future skills and prevent getting redundant. All this is wonderful. But a pertinent question remains unanswered. Why do we need to run a sophisticated prediction analysis program on our employees to gauge which ones will leave? We are all humans, after all. Do we need a machine to find out if we are unhappy? The short answer is No. While AI will give the numbers and dates, it cannot be a one-stop solution to address the inherent problem of an employee quitting the organization. In fact, it can aggravate the problem if used indiscriminately. Read more... https://discover.hubpages.com/technology/3-Disadvantages-of-Using-Artificial-Intelligence-to-Reduce-Employee-Attrition submitted by /u/IcyCartoonist1955 [link] [comments]  ( 54 min )
    I made a site that tracks jobs at top AI companies/startups
    https://thriveml.com most job boards i've seen include technical roles (e.g., ML engineer, data scientist), so I wanted to make one that includes non-technical positions in sales, customer support, etc. i think we'll see a lot of startups (and jobs) in this space in the coming year. lmk what you think! happy to add more companies. submitted by /u/thriveml [link] [comments]  ( 49 min )
  • Open

    Unconventional CNN architectures, any thoughts?
    Hi, Im learning about CNNs right now and im working on a CNN classifier for binary classes. I came across a situation that i have difficulty explaining/justifying to myself... ​ In tutorials and reference books/websites, i always see a pattern of increasing the amount of filters after each pooling. For example: Starting with a 100px X 100px image: ConV1: (100, 100, 32) MaxPool(50, 50, 32) ConV2(50, 50, 64) MaxPool(25, 25, 64) ConV3(25, 25, 128) MaxPool(12, 12, 128) ConV3(12, 12, 256) It is my understanding that we increase the amount of filters as the size of the matrices are reduced in other to catch patterns (descriptors) that are more and more abstract. I think i understand this well enough. Here's the rub for me, i did some crude hyperparameter tuning on my model for fun and i obtained a much better accuracy on my test dataset with an architecture that varied between a high amount and a low amount of filters, such as: ConV1: (100, 100, 64) MaxPool(50, 50, 64) ConV2(50, 50, 128) MaxPool(25, 25, 128) ConV3(25, 25, 32) MaxPool(12, 12, 32) ConV4(12, 12, 128) MaxPool(6, 6, 128) ConV5(6, 6, 8) This architecture kind of remind me of the bottlenecking you'd see in an autoencoder? I know its not the same thing at all but is there a reason it works or is it a fluke? In my case, im getting a 4% increase in accuracy on my testing dataset with the 2nd model. However, I cant find any papers about this type of architecture or why it actually worked as well as it did. Does anyone know of any explanation? submitted by /u/Limiv0rous [link] [comments]  ( 51 min )
    Peter Norvig on Deep Learning
    submitted by /u/DataHack23 [link] [comments]  ( 54 min )
  • Open

    Differential Privacy Accounting by Connecting the Dots
    Posted by Pritish Kamath and Pasin Manurangsi, Research Scientists, Google Research Differential privacy (DP) is an approach that enables data analytics and machine learning (ML) with a mathematical guarantee on the privacy of user data. DP quantifies the “privacy cost” of an algorithm, i.e., the level of guarantee that the algorithm’s output distribution for a given dataset will not change significantly if a single user’s data is added to or removed from it. The algorithm is characterized by two parameters, ε and δ, where smaller values of both indicate “more private”. There is a natural tension between the privacy budget (ε, δ) and the utility of the algorithm: a smaller privacy budget requires the output to be more “noisy”, often leading to less utility. Thus, a fundamental goal of D…  ( 93 min )
  • Open

    Power recommendations and search using an IMDb knowledge graph – Part 2
    This three-part series demonstrates how to use graph neural networks (GNNs) and Amazon Neptune to generate movie recommendations using the IMDb and Box Office Mojo Movies/TV/OTT licensable data package, which provides a wide range of entertainment metadata, including over 1 billion user ratings; credits for more than 11 million cast and crew members; 9 million […]  ( 8 min )
    Power recommendation and search using an IMDb knowledge graph – Part 1
    The IMDb and Box Office Mojo Movies/TV/OTT licensable data package provides a wide range of entertainment metadata, including over 1 billion user ratings; credits for more than 11 million cast and crew members; 9 million movie, TV, and entertainment titles; and global box office reporting data from more than 60 countries. Many AWS media and […]  ( 9 min )
    Accelerate the investment process with AWS Low Code-No Code services
    The last few years have seen a tremendous paradigm shift in how institutional asset managers source and integrate multiple data sources into their investment process. With frequent shifts in risk correlations, unexpected sources of volatility, and increasing competition from passive strategies, asset managers are employing a broader set of third-party data sources to gain a […]  ( 11 min )
  • Open

    3D Artist Edward McEvenue Animates Holiday Cheer This Week ‘In the NVIDIA Studio’
    3D artist Edward McEvenue shares his imaginative, holiday-themed short film "The Great Candy Inquisition" this week In the NVIDIA Studio. The post 3D Artist Edward McEvenue Animates Holiday Cheer This Week ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.  ( 8 min )
  • Open

    Top posts of 2022
    These were the most popular posts on my site this year. #10: How is portable radio possible? The length of an antenna is typically 1/2 or 1/4 of the length of the radio wave it’s designed to receive. How does an AM radio not need an antenna as long as a football field? See also […] Top posts of 2022 first appeared on John D. Cook.  ( 5 min )
  • Open

    Conditional Generative Adversarial Network for keystroke presentation attack. (arXiv:2212.08445v1 [cs.CR])
    Cybersecurity is a crucial step in data protection to ensure user security and personal data privacy. In this sense, many companies have started to control and restrict access to their data using authentication systems. However, these traditional authentication methods, are not enough for ensuring data protection, and for this reason, behavioral biometrics have gained importance. Despite their promising results and the wide range of applications, biometric systems have shown to be vulnerable to malicious attacks, such as Presentation Attacks. For this reason, in this work, we propose to study a new approach aiming to deploy a presentation attack towards a keystroke authentication system. Our idea is to use Conditional Generative Adversarial Networks (cGAN) for generating synthetic keystroke data that can be used for impersonating an authorized user. These synthetic data are generated following two different real use cases, one in which the order of the typed words is known (ordered dynamic) and the other in which this order is unknown (no-ordered dynamic). Finally, both keystroke dynamics (ordered and no-ordered) are validated using an external keystroke authentication system. Results indicate that the cGAN can effectively generate keystroke dynamics patterns that can be used for deceiving keystroke authentication systems.  ( 2 min )
    Detect, Retrieve, Comprehend: A Flexible Framework for Zero-Shot Document-Level Question Answering. (arXiv:2210.01959v2 [cs.CL] UPDATED)
    Researchers produce thousands of scholarly documents containing valuable technical knowledge. The community faces the laborious task of reading these documents to identify, extract, and synthesize information. To automate information gathering, document-level question answering (QA) offers a flexible framework where human-posed questions can be adapted to extract diverse knowledge. Finetuning QA systems requires access to labeled data (tuples of context, question and answer). However, data curation for document QA is uniquely challenging because the context (i.e. answer evidence passage) needs to be retrieved from potentially long, ill-formatted documents. Existing QA datasets sidestep this challenge by providing short, well-defined contexts that are unrealistic in real-world applications. We present a three-stage document QA approach: (1) text extraction from PDF; (2) evidence retrieval from extracted texts to form well-posed contexts; (3) QA to extract knowledge from contexts to return high-quality answers -- extractive, abstractive, or Boolean. Using QASPER for evaluation, our detect-retrieve-comprehend (DRC) system achieves a +7.19 improvement in Answer-F1 over existing baselines while delivering superior context selection. Our results demonstrate that DRC holds tremendous promise as a flexible framework for practical scientific document QA.
    Partially Observable RL with B-Stability: Unified Structural Condition and Sharp Sample-Efficient Algorithms. (arXiv:2209.14990v2 [cs.LG] UPDATED)
    Partial Observability -- where agents can only observe partial information about the true underlying state of the system -- is ubiquitous in real-world applications of Reinforcement Learning (RL). Theoretically, learning a near-optimal policy under partial observability is known to be hard in the worst case due to an exponential sample complexity lower bound. Recent work has identified several tractable subclasses that are learnable with polynomial samples, such as Partially Observable Markov Decision Processes (POMDPs) with certain revealing or decodability conditions. However, this line of research is still in its infancy, where (1) unified structural conditions enabling sample-efficient learning are lacking; (2) existing sample complexities for known tractable subclasses are far from sharp; and (3) fewer sample-efficient algorithms are available than in fully observable RL. This paper advances all three aspects above for Partially Observable RL in the general setting of Predictive State Representations (PSRs). First, we propose a natural and unified structural condition for PSRs called \emph{B-stability}. B-stable PSRs encompasses the vast majority of known tractable subclasses such as weakly revealing POMDPs, low-rank future-sufficient POMDPs, decodable POMDPs, and regular PSRs. Next, we show that any B-stable PSR can be learned with polynomial samples in relevant problem parameters. When instantiated in the aforementioned subclasses, our sample complexities improve substantially over the current best ones. Finally, our results are achieved by three algorithms simultaneously: Optimistic Maximum Likelihood Estimation, Estimation-to-Decisions, and Model-Based Optimistic Posterior Sampling. The latter two algorithms are new for sample-efficient learning of POMDPs/PSRs.
    Variable-Based Calibration for Machine Learning Classifiers. (arXiv:2209.15154v2 [cs.LG] UPDATED)
    The deployment of machine learning classifiers in high-stakes domains requires well-calibrated confidence scores for model predictions. In this paper we introduce the notion of variable-based calibration to characterize calibration properties of a model with respect to a variable of interest, generalizing traditional score-based calibration and metrics such as expected calibration error (ECE). In particular, we find that models with near-perfect ECE can exhibit significant variable-based calibration error as a function of features of the data. We demonstrate this phenomenon both theoretically and in practice on multiple well-known datasets, and show that it can persist after the application of existing recalibration methods. To mitigate this issue, we propose strategies for detection, visualization, and quantification of variable-based calibration error. We then examine the limitations of current score-based recalibration methods and explore potential modifications. Finally, we discuss the implications of these findings, emphasizing that an understanding of calibration beyond simple aggregate measures is crucial for endeavors such as fairness and model interpretability.
    Adversarial Inter-Group Link Injection Degrades the Fairness of Graph Neural Networks. (arXiv:2209.05957v2 [cs.LG] UPDATED)
    We present evidence for the existence and effectiveness of adversarial attacks on graph neural networks (GNNs) that aim to degrade fairness. These attacks can disadvantage a particular subgroup of nodes in GNN-based node classification, where nodes of the underlying network have sensitive attributes, such as race or gender. We conduct qualitative and experimental analyses explaining how adversarial link injection impairs the fairness of GNN predictions. For example, an attacker can compromise the fairness of GNN-based node classification by injecting adversarial links between nodes belonging to opposite subgroups and opposite class labels. Our experiments on empirical datasets demonstrate that adversarial fairness attacks can significantly degrade the fairness of GNN predictions (attacks are effective) with a low perturbation rate (attacks are efficient) and without a significant drop in accuracy (attacks are deceptive). This work demonstrates the vulnerability of GNN models to adversarial fairness attacks. We hope our findings raise awareness about this issue in our community and lay a foundation for the future development of GNN models that are more robust to such attacks.
    Domain Adaptation Principal Component Analysis: base linear method for learning with out-of-distribution data. (arXiv:2208.13290v2 [cs.LG] UPDATED)
    Domain adaptation is a popular paradigm in modern machine learning which aims at tackling the problem of divergence (or shift) between the labeled training and validation datasets (source domain) and a potentially large unlabeled dataset (target domain). The task is to embed both datasets red into a common space in which the source dataset is informative for training while the divergence between source and target is minimized. The most popular domain adaptation solutions are based on training neural networks that combine classification and adversarial learning modules, frequently making them both data-hungry and difficult to train. We present a method called Domain Adaptation Principal Component Analysis (DAPCA) that identifies a linear reduced data representation useful for solving the domain adaptation task. DAPCA algorithm introduces positive and negative weights between pairs of data points, and generalizes the supervised extension of principal component analysis. DAPCA is an iterative algorithm that solves a simple quadratic optimization problem at each iteration. The convergence of the algorithm is guaranteed, and the number of iterations is small in practice. We validate the suggested algorithm on previously proposed benchmarks for solving the domain adaptation task. We also show the benefit of using DAPCA in analyzing the single-cell omics datasets in biomedical applications. Overall, DAPCA can serve as a practical preprocessing step in many machine learning applications leading to reduced dataset representations, taking into account possible divergence between source and target domains.
    Variational Graph Generator for Multi-View Graph Clustering. (arXiv:2210.07011v2 [cs.LG] UPDATED)
    Multi-view graph clustering (MGC) methods are increasingly being studied due to the explosion of multi-view data with graph structural information. The critical point of MGC is to better utilize the view-specific and view-common information in features and graphs of multiple views. However, existing works have an inherent limitation that they are unable to concurrently utilize the consensus graph information across multiple graphs and the view-specific feature information. To address this issue, we propose Variational Graph Generator for Multi-View Graph Clustering (VGMGC). Specifically, a novel variational graph generator is proposed to extract common information among multiple graphs. This generator infers a reliable variational consensus graph based on a priori assumption over multiple graphs. Then a simple yet effective graph encoder in conjunction with the multi-view clustering objective is presented to learn the desired graph embeddings for clustering, which embeds the inferred view-common graph and view-specific graphs together with features. Finally, theoretical results illustrate the rationality of VGMGC by analyzing the uncertainty of the inferred consensus graph with information bottleneck principle. Extensive experiments demonstrate the superior performance of our VGMGC over SOTAs.
    Image Amodal Completion: A Survey. (arXiv:2207.02062v2 [cs.CV] UPDATED)
    Existing computer vision systems can compete with humans in understanding the visible parts of objects, but still fall far short of humans when it comes to depicting the invisible parts of partially occluded objects. Image amodal completion aims to equip computers with human-like amodal completion functions to understand an intact object despite it being partially occluded. The main purpose of this survey is to provide an intuitive understanding of the research hotspots, key technologies and future trends in the field of image amodal completion. Firstly, we present a comprehensive review of the latest literature in this emerging field, exploring three key tasks in image amodal completion, including amodal shape completion, amodal appearance completion, and order perception. Then we examine popular datasets related to image amodal completion along with their common data collection methods and evaluation metrics. Finally, we discuss real-world applications and future research directions for image amodal completion, facilitating the reader's understanding of the challenges of existing technologies and upcoming research trends.
    From Play to Policy: Conditional Behavior Generation from Uncurated Robot Data. (arXiv:2210.10047v3 [cs.RO] UPDATED)
    While large-scale sequence modeling from offline data has led to impressive performance gains in natural language and image generation, directly translating such ideas to robotics has been challenging. One critical reason for this is that uncurated robot demonstration data, i.e. play data, collected from non-expert human demonstrators are often noisy, diverse, and distributionally multi-modal. This makes extracting useful, task-centric behaviors from such data a difficult generative modeling problem. In this work, we present Conditional Behavior Transformers (C-BeT), a method that combines the multi-modal generation ability of Behavior Transformer with future-conditioned goal specification. On a suite of simulated benchmark tasks, we find that C-BeT improves upon prior state-of-the-art work in learning from play data by an average of 45.7%. Further, we demonstrate for the first time that useful task-centric behaviors can be learned on a real-world robot purely from play data without any task labels or reward information. Robot videos are best viewed on our project website: https://play-to-policy.github.io
    PointConvFormer: Revenge of the Point-based Convolution. (arXiv:2208.02879v2 [cs.CV] UPDATED)
    We introduce PointConvFormer, a novel building block for point cloud based deep network architectures. Inspired by generalization theory, PointConvFormer combines ideas from point convolution, where filter weights are only based on relative position, and Transformers which utilize feature-based attention. In PointConvFormer, attention computed from feature difference between neighboring points is used to modify the convolutional weights at each point. Hence, invariances from point convolution are preserved, whereas attention helps to select relevant points in the neighborhood. PointConvFormer is suitable for multiple tasks that require details at the point level, such as segmentation and scene flow estimation tasks. We experiment on both tasks with multiple datasets including ScanNet, SemanticKitti, FlyingThings3D and KITTI. Our results show that PointConvFormer substantially outperforms classic convolutions, regular transformers, and voxelized sparse convolution approaches with much smaller and faster networks. Visualizations show that PointConvFormer performs similarly to convolution on flat areas, whereas the neighborhood selection effect is stronger on object boundaries, showing that it has got the best of both worlds.
    Jujutsu: A Two-stage Defense against Adversarial Patch Attacks on Deep Neural Networks. (arXiv:2108.05075v4 [cs.CR] UPDATED)
    Adversarial patch attacks create adversarial examples by injecting arbitrary distortions within a bounded region of the input to fool deep neural networks (DNNs). These attacks are robust (i.e., physically-realizable) and universally malicious, and hence represent a severe security threat to real-world DNN-based systems. We propose Jujutsu, a two-stage technique to detect and mitigate robust and universal adversarial patch attacks. We first observe that adversarial patches are crafted as localized features that yield large influence on the prediction output, and continue to dominate the prediction on any input. Jujutsu leverages this observation for accurate attack detection with low false positives. Patch attacks corrupt only a localized region of the input, while the majority of the input remains unperturbed. Therefore, Jujutsu leverages generative adversarial networks (GAN) to perform localized attack recovery by synthesizing the semantic contents of the input that are corrupted by the attacks, and reconstructs a ``clean'' input for correct prediction. We evaluate Jujutsu on four diverse datasets spanning 8 different DNN models, and find that it achieves superior performance and significantly outperforms four existing defenses. We further evaluate Jujutsu against physical-world attacks, as well as adaptive attacks.
    Backdoor Attacks on Time Series: A Generative Approach. (arXiv:2211.07915v4 [cs.LG] UPDATED)
    Backdoor attacks have emerged as one of the major security threats to deep learning models as they can easily control the model's test-time predictions by pre-injecting a backdoor trigger into the model at training time. While backdoor attacks have been extensively studied on images, few works have investigated the threat of backdoor attacks on time series data. To fill this gap, in this paper we present a novel generative approach for time series backdoor attacks against deep learning based time series classifiers. Backdoor attacks have two main goals: high stealthiness and high attack success rate. We find that, compared to images, it can be more challenging to achieve the two goals on time series. This is because time series have fewer input dimensions and lower degrees of freedom, making it hard to achieve a high attack success rate without compromising stealthiness. Our generative approach addresses this challenge by generating trigger patterns that are as realistic as real-time series patterns while achieving a high attack success rate without causing a significant drop in clean accuracy. We also show that our proposed attack is resistant to potential backdoor defenses. Furthermore, we propose a novel universal generator that can poison any type of time series with a single generator that allows universal attacks without the need to fine-tune the generative model for new time series datasets.
    DAGMA: Learning DAGs via M-matrices and a Log-Determinant Acyclicity Characterization. (arXiv:2209.08037v2 [cs.LG] UPDATED)
    The combinatorial problem of learning directed acyclic graphs (DAGs) from data was recently framed as a purely continuous optimization problem by leveraging a differentiable acyclicity characterization of DAGs based on the trace of a matrix exponential function. Existing acyclicity characterizations are based on the idea that powers of an adjacency matrix contain information about walks and cycles. In this work, we propose a new acyclicity characterization based on the log-determinant (log-det) function, which leverages the nilpotency property of DAGs. To deal with the inherent asymmetries of a DAG, we relate the domain of our log-det characterization to the set of $\textit{M-matrices}$, which is a key difference to the classical log-det function defined over the cone of positive definite matrices. Similar to acyclicity functions previously proposed, our characterization is also exact and differentiable. However, when compared to existing characterizations, our log-det function: (1) Is better at detecting large cycles; (2) Has better-behaved gradients; and (3) Its runtime is in practice about an order of magnitude faster. From the optimization side, we drop the typically used augmented Lagrangian scheme and propose DAGMA ($\textit{DAGs via M-matrices for Acyclicity}$), a method that resembles the central path for barrier methods. Each point in the central path of DAGMA is a solution to an unconstrained problem regularized by our log-det function, then we show that at the limit of the central path the solution is guaranteed to be a DAG. Finally, we provide extensive experiments for $\textit{linear}$ and $\textit{nonlinear}$ SEMs and show that our approach can reach large speed-ups and smaller structural Hamming distances against state-of-the-art methods. Code implementing the proposed method is open-source and publicly available at https://github.com/kevinsbello/dagma.
    How Robust is Unsupervised Representation Learning to Distribution Shift?. (arXiv:2206.08871v2 [cs.LG] UPDATED)
    The robustness of machine learning algorithms to distributions shift is primarily discussed in the context of supervised learning (SL). As such, there is a lack of insight on the robustness of the representations learned from unsupervised methods, such as self-supervised learning (SSL) and auto-encoder based algorithms (AE), to distribution shift. We posit that the input-driven objectives of unsupervised algorithms lead to representations that are more robust to distribution shift than the target-driven objective of SL. We verify this by extensively evaluating the performance of SSL and AE on both synthetic and realistic distribution shift datasets. Following observations that the linear layer used for classification itself can be susceptible to spurious correlations, we evaluate the representations using a linear head trained on a small amount of out-of-distribution (OOD) data, to isolate the robustness of the learned representations from that of the linear head. We also develop "controllable" versions of existing realistic domain generalisation datasets with adjustable degrees of distribution shifts. This allows us to study the robustness of different learning algorithms under versatile yet realistic distribution shift conditions. Our experiments show that representations learned from unsupervised learning algorithms generalise better than SL under a wide variety of extreme as well as realistic distribution shifts.
    Estimation Contracts for Outlier-Robust Geometric Perception. (arXiv:2208.10521v2 [stat.ML] UPDATED)
    Outlier-robust estimation is a fundamental problem and has been extensively investigated by statisticians and practitioners. The last few years have seen a convergence across research fields towards "algorithmic robust statistics", which focuses on developing tractable outlier-robust techniques for high-dimensional estimation problems. Despite this convergence, research efforts across fields have been mostly disconnected from one another. This monograph bridges recent work on certifiable outlier-robust estimation for geometric perception in robotics and computer vision with parallel work in robust statistics. In particular, we adapt and extend recent results on robust linear regression (applicable to the low-outlier regime with > 50% outliers) to the setup commonly found in robotics and vision, where (i) variables (e.g., rotations, poses) belong to a non-convex domain, (ii) measurements are vector-valued, and (iii) the number of outliers is not known a priori. The emphasis here is on performance guarantees: rather than proposing radically new algorithms, we provide conditions on the input measurements under which modern estimation algorithms (possibly after small modifications) are guaranteed to recover an estimate close to the ground truth in the presence of outliers. These conditions are what we call an "estimation contract". Besides the proposed extensions of existing results, we believe the main contributions of this monograph are (i) to unify parallel research lines by pointing out commonalities and differences, (ii) to introduce advanced material (e.g., sum-of-squares proofs) in an accessible and self-contained presentation for the practitioner, and (iii) to point out a few immediate opportunities and open questions in outlier-robust geometric perception.
    Dataset Inference for Self-Supervised Models. (arXiv:2209.09024v2 [cs.LG] UPDATED)
    Self-supervised models are increasingly prevalent in machine learning (ML) since they reduce the need for expensively labeled data. Because of their versatility in downstream applications, they are increasingly used as a service exposed via public APIs. At the same time, these encoder models are particularly vulnerable to model stealing attacks due to the high dimensionality of vector representations they output. Yet, encoders remain undefended: existing mitigation strategies for stealing attacks focus on supervised learning. We introduce a new dataset inference defense, which uses the private training set of the victim encoder model to attribute its ownership in the event of stealing. The intuition is that the log-likelihood of an encoder's output representations is higher on the victim's training data than on test data if it is stolen from the victim, but not if it is independently trained. We compute this log-likelihood using density estimation models. As part of our evaluation, we also propose measuring the fidelity of stolen encoders and quantifying the effectiveness of the theft detection without involving downstream tasks; instead, we leverage mutual information and distance measurements. Our extensive empirical results in the vision domain demonstrate that dataset inference is a promising direction for defending self-supervised models against model stealing.
    Modeling Volatility and Dependence of European Carbon and Energy Prices. (arXiv:2208.14311v3 [q-fin.ST] UPDATED)
    We study the prices of European Emission Allowances (EUA), whereby we analyze their uncertainty and dependencies on related energy prices (natural gas, coal, and oil). We propose a probabilistic multivariate conditional time series model with a VECM-Copula-GARCH structure which exploits key characteristics of the data. Data are normalized with respect to inflation and carbon emissions to allow for proper cross-series evaluation. The forecasting performance is evaluated in an extensive rolling-window forecasting study, covering eight years out-of-sample. We discuss our findings for both levels- and log-transformed data, focusing on time-varying correlations, and in view of the Russian invasion of Ukraine.
    Deep Learning and Its Applications to WiFi Human Sensing: A Benchmark and A Tutorial. (arXiv:2207.07859v2 [cs.LG] UPDATED)
    WiFi sensing has been evolving rapidly in recent years. Empowered by propagation models and deep learning methods, many challenging applications are realized such as WiFi-based human activity recognition and gesture recognition. However, in contrast to deep learning for visual recognition and natural language processing, no sufficiently comprehensive public benchmark exists. In this paper, we highlight the recent progress on deep learning enabled WiFi sensing, and then propose a benchmark, SenseFi, to study the effectiveness of various deep learning models for WiFi sensing. These advanced models are compared in terms of distinct sensing tasks, WiFi platforms, recognition accuracy, model size, computational complexity, feature transferability, and adaptability of unsupervised learning. It is also regarded as a tutorial for deep learning based WiFi sensing, starting from CSI hardware platform to sensing algorithms. The extensive experiments provide us with experiences in deep model design, learning strategy skills and training techniques for real-world applications. To the best of our knowledge, this is the first benchmark with an open-source library for deep learning in WiFi sensing research. The benchmark codes are available at https://github.com/CHENXINYAN-sg/WiFi-CSI-Sensing-Benchmark.
    Neural Implicit k-Space for Binning-free Non-Cartesian Cardiac MR Imaging. (arXiv:2212.08479v1 [eess.IV])
    In this work, we propose a novel image reconstruction framework that directly learns a neural implicit representation in k-space for ECG-triggered non-Cartesian Cardiac Magnetic Resonance Imaging (CMR). While existing methods bin acquired data from neighboring time points to reconstruct one phase of the cardiac motion, our framework allows for a continuous, binning-free, and subject-specific k-space representation.We assign a unique coordinate that consists of time, coil index, and frequency domain location to each sampled k-space point. We then learn the subject-specific mapping from these unique coordinates to k-space intensities using a multi-layer perceptron with frequency domain regularization. During inference, we obtain a complete k-space for Cartesian coordinates and an arbitrary temporal resolution. A simple inverse Fourier transform recovers the image, eliminating the need for density compensation and costly non-uniform Fourier transforms for non-Cartesian data. This novel imaging framework was tested on 42 radially sampled datasets from 6 subjects. The proposed method outperforms other techniques qualitatively and quantitatively using data from four and one heartbeat(s) and 30 cardiac phases. Our results for one heartbeat reconstruction of 50 cardiac phases show improved artifact removal and spatio-temporal resolution, leveraging the potential for real-time CMR.
    Successor Feature Representations. (arXiv:2110.15701v3 [cs.LG] UPDATED)
    Transfer in Reinforcement Learning aims to improve learning performance on target tasks using knowledge from experienced source tasks. Successor Representations (SR) and their extension Successor Features (SF) are prominent transfer mechanisms in domains where reward functions change between tasks. They reevaluate the expected return of previously learned policies in a new target task to transfer their knowledge. The SF framework extended SR by linearly decomposing rewards into successor features and a reward weight vector allowing their application in high-dimensional tasks. But this came with the cost of having a linear relationship between reward functions and successor features, limiting its application to such tasks. We propose a novel formulation of SR based on learning the cumulative discounted probability of successor features, called Successor Feature Representations (SFR). Crucially, SFR allows to reevaluate the expected return of policies for general reward functions. We introduce different SFR variations, prove its convergence, and provide a guarantee on its transfer performance. Experimental evaluations based on SFR with function approximation demonstrate its advantage over SF not only for general reward functions but also in the case of linearly decomposable reward functions.
    GeneFormer: Learned Gene Compression using Transformer-based Context Modeling. (arXiv:2212.08379v1 [cs.LG])
    With the development of gene sequencing technology, an explosive growth of gene data has been witnessed. And the storage of gene data has become an important issue. Traditional gene data compression methods rely on general software like G-zip, which fails to utilize the interrelation of nucleotide sequence. Recently, many researchers begin to investigate deep learning based gene data compression method. In this paper, we propose a transformer-based gene compression method named GeneFormer. Specifically, we first introduce a modified transformer structure to fully explore the nucleotide sequence dependency. Then, we propose fixed-length parallel grouping to accelerate the decoding speed of our autoregressive model. Experimental results on real-world datasets show that our method saves 29.7% bit rate compared with the state-of-the-art method, and the decoding speed is significantly faster than all existing learning-based gene compression methods.
    Audio-based AI classifiers show no evidence of improved COVID-19 screening over simple symptoms checkers. (arXiv:2212.08570v1 [cs.SD])
    Recent work has reported that AI classifiers trained on audio recordings can accurately predict severe acute respiratory syndrome coronavirus 2 (SARSCoV2) infection status. Here, we undertake a large scale study of audio-based deep learning classifiers, as part of the UK governments pandemic response. We collect and analyse a dataset of audio recordings from 67,842 individuals with linked metadata, including reverse transcription polymerase chain reaction (PCR) test outcomes, of whom 23,514 tested positive for SARS CoV 2. Subjects were recruited via the UK governments National Health Service Test-and-Trace programme and the REal-time Assessment of Community Transmission (REACT) randomised surveillance survey. In an unadjusted analysis of our dataset AI classifiers predict SARS-CoV-2 infection status with high accuracy (Receiver Operating Characteristic Area Under the Curve (ROCAUC) 0.846 [0.838, 0.854]) consistent with the findings of previous studies. However, after matching on measured confounders, such as age, gender, and self reported symptoms, our classifiers performance is much weaker (ROC-AUC 0.619 [0.594, 0.644]). Upon quantifying the utility of audio based classifiers in practical settings, we find them to be outperformed by simple predictive scores based on user reported symptoms.
    A Rigorous Information-Theoretic Definition of Redundancy and Relevancy in Feature Selection Based on (Partial) Information Decomposition. (arXiv:2105.04187v3 [cs.IT] UPDATED)
    Selecting a minimal feature set that is maximally informative about a target variable is a central task in machine learning and statistics. Information theory provides a powerful framework for formulating feature selection algorithms -- yet, a rigorous, information-theoretic definition of feature relevancy, which accounts for feature interactions such as redundant and synergistic contributions, is still missing. We argue that this lack is inherent to classical information theory which does not provide measures to decompose the information a set of variables provides about a target into unique, redundant, and synergistic contributions. Such a decomposition has been introduced only recently by the partial information decomposition (PID) framework. Using PID, we clarify why feature selection is a conceptually difficult problem when approached using information theory and provide a novel definition of feature relevancy and redundancy in PID terms. From this definition, we show that the conditional mutual information (CMI) maximizes relevancy while minimizing redundancy and propose an iterative, CMI-based algorithm for practical feature selection. We demonstrate the power of our CMI-based algorithm in comparison to the unconditional mutual information on benchmark examples and provide corresponding PID estimates to highlight how PID allows to quantify information contribution of features and their interactions in feature-selection problems.
    MURMUR: Modular Multi-Step Reasoning for Semi-Structured Data-to-Text Generation. (arXiv:2212.08607v1 [cs.CL])
    Prompting large language models has enabled significant recent progress in multi-step reasoning over text. However, when applied to text generation from semi-structured data (e.g., graphs or tables), these methods typically suffer from low semantic coverage, hallucination, and logical inconsistency. We propose MURMUR, a neuro-symbolic modular approach to text generation from semi-structured data with multi-step reasoning. MURMUR is a best-first search method that generates reasoning paths using: (1) neural and symbolic modules with specific linguistic and logical skills, (2) a grammar whose production rules define valid compositions of modules, and (3) value functions that assess the quality of each reasoning step. We conduct experiments on two diverse data-to-text generation tasks like WebNLG and LogicNLG. These tasks differ in their data representations (graphs and tables) and span multiple linguistic and logical skills. MURMUR obtains significant improvements over recent few-shot baselines like direct prompting and chain-of-thought prompting, while also achieving comparable performance to fine-tuned GPT-2 on out-of-domain data. Moreover, human evaluation shows that MURMUR generates highly faithful and correct reasoning paths that lead to 26% more logically consistent summaries on LogicNLG, compared to direct prompting.
    Quantifying the Preferential Direction of the Model Gradient in Adversarial Training With Projected Gradient Descent. (arXiv:2009.04709v4 [stat.ML] UPDATED)
    Adversarial training, especially projected gradient descent (PGD), has proven to be a successful approach for improving robustness against adversarial attacks. After adversarial training, gradients of models with respect to their inputs have a preferential direction. However, the direction of alignment is not mathematically well established, making it difficult to evaluate quantitatively. We propose a novel definition of this direction as the direction of the vector pointing toward the closest point of the support of the closest inaccurate class in decision space. To evaluate the alignment with this direction after adversarial training, we apply a metric that uses generative adversarial networks to produce the smallest residual needed to change the class present in the image. We show that PGD-trained models have a higher alignment than the baseline according to our definition, that our metric presents higher alignment values than a competing metric formulation, and that enforcing this alignment increases the robustness of models.
    Planning Visual-Tactile Precision Grasps via Complementary Use of Vision and Touch. (arXiv:2212.08604v1 [cs.RO])
    Reliably planning fingertip grasps for multi-fingered hands lies as a key challenge for many tasks including tool use, insertion, and dexterous in-hand manipulation. This task becomes even more difficult when the robot lacks an accurate model of the object to be grasped. Tactile sensing offers a promising approach to account for uncertainties in object shape. However, current robotic hands tend to lack full tactile coverage. As such, a problem arises of how to plan and execute grasps for multi-fingered hands such that contact is made with the area covered by the tactile sensors. To address this issue, we propose an approach to grasp planning that explicitly reasons about where the fingertips should contact the estimated object surface while maximizing the probability of grasp success. Key to our method's success is the use of visual surface estimation for initial planning to encode the contact constraint. The robot then executes this plan using a tactile-feedback controller that enables the robot to adapt to online estimates of the object's surface to correct for errors in the initial plan. Importantly, the robot never explicitly integrates object pose or surface estimates between visual and tactile sensing, instead it uses the two modalities in complementary ways. Vision guides the robots motion prior to contact; touch updates the plan when contact occurs differently than predicted from vision. We show that our method successfully synthesises and executes precision grasps for previously unseen objects using surface estimates from a single camera view. Further, our approach outperforms a state of the art multi-fingered grasp planner, while also beating several baselines we propose.
    Flexible Diffusion Modeling of Long Videos. (arXiv:2205.11495v3 [cs.CV] UPDATED)
    We present a framework for video modeling based on denoising diffusion probabilistic models that produces long-duration video completions in a variety of realistic environments. We introduce a generative model that can at test-time sample any arbitrary subset of video frames conditioned on any other subset and present an architecture adapted for this purpose. Doing so allows us to efficiently compare and optimize a variety of schedules for the order in which frames in a long video are sampled and use selective sparse and long-range conditioning on previously sampled frames. We demonstrate improved video modeling over prior work on a number of datasets and sample temporally coherent videos over 25 minutes in length. We additionally release a new video modeling dataset and semantically meaningful metrics based on videos generated in the CARLA autonomous driving simulator.
    SplitGP: Achieving Both Generalization and Personalization in Federated Learning. (arXiv:2212.08343v1 [cs.LG])
    A fundamental challenge to providing edge-AI services is the need for a machine learning (ML) model that achieves personalization (i.e., to individual clients) and generalization (i.e., to unseen data) properties concurrently. Existing techniques in federated learning (FL) have encountered a steep tradeoff between these objectives and impose large computational requirements on edge devices during training and inference. In this paper, we propose SplitGP, a new split learning solution that can simultaneously capture generalization and personalization capabilities for efficient inference across resource-constrained clients (e.g., mobile/IoT devices). Our key idea is to split the full ML model into client-side and server-side components, and impose different roles to them: the client-side model is trained to have strong personalization capability optimized to each client's main task, while the server-side model is trained to have strong generalization capability for handling all clients' out-of-distribution tasks. We analytically characterize the convergence behavior of SplitGP, revealing that all client models approach stationary points asymptotically. Further, we analyze the inference time in SplitGP and provide bounds for determining model split ratios. Experimental results show that SplitGP outperforms existing baselines by wide margins in inference time and test accuracy for varying amounts of out-of-distribution samples.
    Best-Answer Prediction in Q&A Sites Using User Information. (arXiv:2212.08475v1 [cs.CL])
    Community Question Answering (CQA) sites have spread and multiplied significantly in recent years. Sites like Reddit, Quora, and Stack Exchange are becoming popular amongst people interested in finding answers to diverse questions. One practical way of finding such answers is automatically predicting the best candidate given existing answers and comments. Many studies were conducted on answer prediction in CQA but with limited focus on using the background information of the questionnaires. We address this limitation using a novel method for predicting the best answers using the questioner's background information and other features, such as the textual content or the relationships with other participants. Our answer classification model was trained using the Stack Exchange dataset and validated using the Area Under the Curve (AUC) metric. The experimental results show that the proposed method complements previous methods by pointing out the importance of the relationships between users, particularly throughout the level of involvement in different communities on Stack Exchange. Furthermore, we point out that there is little overlap between user-relation information and the information represented by the shallow text features and the meta-features, such as time differences.
    Biomedical image analysis competitions: The state of current participation practice. (arXiv:2212.08568v1 [cs.CV])
    The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
    Efficient Conditionally Invariant Representation Learning. (arXiv:2212.08645v1 [cs.LG])
    We introduce the Conditional Independence Regression CovariancE (CIRCE), a measure of conditional independence for multivariate continuous-valued variables. CIRCE applies as a regularizer in settings where we wish to learn neural features $\varphi(X)$ of data $X$ to estimate a target $Y$, while being conditionally independent of a distractor $Z$ given $Y$. Both $Z$ and $Y$ are assumed to be continuous-valued but relatively low dimensional, whereas $X$ and its features may be complex and high dimensional. Relevant settings include domain-invariant learning, fairness, and causal learning. The procedure requires just a single ridge regression from $Y$ to kernelized features of $Z$, which can be done in advance. It is then only necessary to enforce independence of $\varphi(X)$ from residuals of this regression, which is possible with attractive estimation properties and consistency guarantees. By contrast, earlier measures of conditional feature dependence require multiple regressions for each step of feature learning, resulting in more severe bias and variance, and greater computational cost. When sufficiently rich features are used, we establish that CIRCE is zero if and only if $\varphi(X) \perp \!\!\! \perp Z \mid Y$. In experiments, we show superior performance to previous methods on challenging benchmarks, including learning conditionally invariant image features.
    Fast Rule-Based Decoding: Revisiting Syntactic Rules in Neural Constituency Parsing. (arXiv:2212.08458v1 [cs.CL])
    Most recent studies on neural constituency parsing focus on encoder structures, while few developments are devoted to decoders. Previous research has demonstrated that probabilistic statistical methods based on syntactic rules are particularly effective in constituency parsing, whereas syntactic rules are not used during the training of neural models in prior work probably due to their enormous computation requirements. In this paper, we first implement a fast CKY decoding procedure harnessing GPU acceleration, based on which we further derive a syntactic rule-based (rule-constrained) CKY decoding. In the experiments, our method obtains 95.89 and 92.52 F1 on the datasets of PTB and CTB respectively, which shows significant improvements compared with previous approaches. Besides, our parser achieves strong and competitive cross-domain performance in zero-shot settings.
    Federated Learning with Flexible Control. (arXiv:2212.08496v1 [cs.LG])
    Federated learning (FL) enables distributed model training from local data collected by users. In distributed systems with constrained resources and potentially high dynamics, e.g., mobile edge networks, the efficiency of FL is an important problem. Existing works have separately considered different configurations to make FL more efficient, such as infrequent transmission of model updates, client subsampling, and compression of update vectors. However, an important open problem is how to jointly apply and tune these control knobs in a single FL algorithm, to achieve the best performance by allowing a high degree of freedom in control decisions. In this paper, we address this problem and propose FlexFL - an FL algorithm with multiple options that can be adjusted flexibly. Our FlexFL algorithm allows both arbitrary rates of local computation at clients and arbitrary amounts of communication between clients and the server, making both the computation and communication resource consumption adjustable. We prove a convergence upper bound of this algorithm. Based on this result, we further propose a stochastic optimization formulation and algorithm to determine the control decisions that (approximately) minimize the convergence bound, while conforming to constraints related to resource consumption. The advantage of our approach is also verified using experiments.
    Convolution-enhanced Evolving Attention Networks. (arXiv:2212.08330v1 [cs.LG])
    Attention-based neural networks, such as Transformers, have become ubiquitous in numerous applications, including computer vision, natural language processing, and time-series analysis. In all kinds of attention networks, the attention maps are crucial as they encode semantic dependencies between input tokens. However, most existing attention networks perform modeling or reasoning based on representations, wherein the attention maps of different layers are learned separately without explicit interactions. In this paper, we propose a novel and generic evolving attention mechanism, which directly models the evolution of inter-token relationships through a chain of residual convolutional modules. The major motivations are twofold. On the one hand, the attention maps in different layers share transferable knowledge, thus adding a residual connection can facilitate the information flow of inter-token relationships across layers. On the other hand, there is naturally an evolutionary trend among attention maps at different abstraction levels, so it is beneficial to exploit a dedicated convolution-based module to capture this process. Equipped with the proposed mechanism, the convolution-enhanced evolving attention networks achieve superior performance in various applications, including time-series representation, natural language understanding, machine translation, and image classification. Especially on time-series representation tasks, Evolving Attention-enhanced Dilated Convolutional (EA-DC-) Transformer outperforms state-of-the-art models significantly, achieving an average of 17% improvement compared to the best SOTA. To the best of our knowledge, this is the first work that explicitly models the layer-wise evolution of attention maps. Our implementation is available at https://github.com/pkuyym/EvolvingAttention
    POTATO: The Portable Text Annotation Tool. (arXiv:2212.08620v1 [cs.CL])
    We present POTATO, the Portable text annotation tool, a free, fully open-sourced annotation system that 1) supports labeling many types of text and multimodal data; 2) offers easy-to-configure features to maximize the productivity of both deployers and annotators (convenient templates for common ML/NLP tasks, active learning, keypress shortcuts, keyword highlights, tooltips); and 3) supports a high degree of customization (editable UI, inserting pre-screening questions, attention and qualification tests). Experiments over two annotation tasks suggest that POTATO improves labeling speed through its specially-designed productivity features, especially for long documents and complex tasks. POTATO is available at https://github.com/davidjurgens/potato and will continue to be updated.
    Development of A Real-time POCUS Image Quality Assessment and Acquisition Guidance System. (arXiv:2212.08624v1 [eess.IV])
    Point-of-care ultrasound (POCUS) is one of the most commonly applied tools for cardiac function imaging in the clinical routine of the emergency department and pediatric intensive care unit. The prior studies demonstrate that AI-assisted software can guide nurses or novices without prior sonography experience to acquire POCUS by recognizing the interest region, assessing image quality, and providing instructions. However, these AI algorithms cannot simply replace the role of skilled sonographers in acquiring diagnostic-quality POCUS. Unlike chest X-ray, CT, and MRI, which have standardized imaging protocols, POCUS can be acquired with high inter-observer variability. Though being with variability, they are usually all clinically acceptable and interpretable. In challenging clinical environments, sonographers employ novel heuristics to acquire POCUS in complex scenarios. To help novice learners to expedite the training process while reducing the dependency on experienced sonographers in the curriculum implementation, We will develop a framework to perform real-time AI-assisted quality assessment and probe position guidance to provide training process for novice learners with less manual intervention.  ( 2 min )
    Convolutional Filtering in Simplicial Complexes. (arXiv:2201.12584v2 [eess.SP] UPDATED)
    This paper proposes convolutional filtering for data whose structure can be modeled by a simplicial complex (SC). SCs are mathematical tools that not only capture pairwise relationships as graphs but account also for higher-order network structures. These filters are built by following the shift-and-sum principle of the convolution operation and rely on the Hodge-Laplacians to shift the signal within the simplex. But since in SCs we have also inter-simplex coupling, we use the incidence matrices to transfer the signal in adjacent simplices and build a filter bank to jointly filter signals from different levels. We prove some interesting properties for the proposed filter bank, including permutation and orientation equivariance, a computational complexity that is linear in the SC dimension, and a spectral interpretation using the simplicial Fourier transform. We illustrate the proposed approach with numerical experiments.  ( 2 min )
    Generalization Bounds for Inductive Matrix Completion in Low-noise Settings. (arXiv:2212.08339v1 [cs.LG])
    We study inductive matrix completion (matrix completion with side information) under an i.i.d. subgaussian noise assumption at a low noise regime, with uniform sampling of the entries. We obtain for the first time generalization bounds with the following three properties: (1) they scale like the standard deviation of the noise and in particular approach zero in the exact recovery case; (2) even in the presence of noise, they converge to zero when the sample size approaches infinity; and (3) for a fixed dimension of the side information, they only have a logarithmic dependence on the size of the matrix. Differently from many works in approximate recovery, we present results both for bounded Lipschitz losses and for the absolute loss, with the latter relying on Talagrand-type inequalities. The proofs create a bridge between two approaches to the theoretical analysis of matrix completion, since they consist in a combination of techniques from both the exact recovery literature and the approximate recovery literature.  ( 2 min )
    RaLiBEV: Radar and LiDAR BEV Fusion Learning for Anchor Box Free Object Detection System. (arXiv:2211.06108v2 [cs.CV] UPDATED)
    In autonomous driving systems, LiDAR and radar play important roles in the perception of the surrounding environment.LiDAR provides accurate 3D spatial sensing information but cannot work in adverse weather like fog. On the other hand, the radar signal can be diffracted when encountering raindrops or mist particles thanks to its wavelength, but it suffers from large noise. Recent state-of-the-art works reveal that fusion of radar and LiDAR can lead to robust detection in adverse weather. The existing works adopt convolutional neural network architecture to extract features from each sensor data stream, then align and aggregate the two branch features to predict object detection results. However, these methods have low accuracy of bounding box estimations due to a simple design of label assignment and fusion strategies. In this paper, we propose a bird's-eye view fusion learning-based anchor box-free object detection system, which fuses the feature derived from the radar range-azimuth heatmap and the LiDAR point cloud to estimate the possible objects. Different label assignment strategies have been designed to facilitate the consistency between the classification of foreground or background anchor points and the corresponding bounding box regressions. In addition, the performance of the proposed object detector is further enhanced by employing a novel interactive transformer module. The superior performance of the proposed methods in this paper has been demonstrated using the recently published Oxford radar robotCar dataset, showing that the average precision of our system significantly outperforms the best state-of-the-art method by 14.4% and 20.5% at IoU equals 0.8 in clear and foggy weather testing, respectively.  ( 2 min )
    When to Update Your Model: Constrained Model-based Reinforcement Learning. (arXiv:2210.08349v2 [cs.LG] UPDATED)
    Designing and analyzing model-based RL (MBRL) algorithms with guaranteed monotonic improvement has been challenging, mainly due to the interdependence between policy optimization and model learning. Existing discrepancy bounds generally ignore the impacts of model shifts, and their corresponding algorithms are prone to degrade performance by drastic model updating. In this work, we first propose a novel and general theoretical scheme for a non-decreasing performance guarantee of MBRL. Our follow-up derived bounds reveal the relationship between model shifts and performance improvement. These discoveries encourage us to formulate a constrained lower-bound optimization problem to permit the monotonicity of MBRL. A further example demonstrates that learning models from a dynamically-varying number of explorations benefit the eventual returns. Motivated by these analyses, we design a simple but effective algorithm CMLO (Constrained Model-shift Lower-bound Optimization), by introducing an event-triggered mechanism that flexibly determines when to update the model. Experiments show that CMLO surpasses other state-of-the-art methods and produces a boost when various policy optimization methods are employed.  ( 2 min )
    Revisiting Neuron Coverage for DNN Testing: A Layer-Wise and Distribution-Aware Criterion. (arXiv:2112.01955v2 [cs.LG] UPDATED)
    Various deep neural network (DNN) coverage criteria have been proposed to assess DNN test inputs and steer input mutations. The coverage is characterized via neurons having certain outputs, or the discrepancy between neuron outputs. Nevertheless, recent research indicates that neuron coverage criteria show little correlation with test suite quality. In general, DNNs approximate distributions, by incorporating hierarchical layers, to make predictions for inputs. Thus, we champion to deduce DNN behaviors based on its approximated distributions from a layer perspective. A test suite should be assessed using its induced layer output distributions. Accordingly, to fully examine DNN behaviors, input mutation should be directed toward diversifying the approximated distributions. This paper summarizes eight design requirements for DNN coverage criteria, taking into account distribution properties and practical concerns. We then propose a new criterion, NeuraL Coverage (NLC), that satisfies all design requirements. NLC treats a single DNN layer as the basic computational unit (rather than a single neuron) and captures four critical properties of neuron output distributions. Thus, NLC accurately describes how DNNs comprehend inputs via approximated distributions. We demonstrate that NLC is significantly correlated with the diversity of a test suite across a number of tasks (classification and generation) and data formats (image and text). Its capacity to discover DNN prediction errors is promising. Test input mutation guided by NLC results in a greater quality and diversity of exposed erroneous behaviors.  ( 2 min )
    Decoder Tuning: Efficient Language Understanding as Decoding. (arXiv:2212.08408v1 [cs.CL])
    With the evergrowing sizes of pre-trained models (PTMs), it has been an emerging practice to only provide the inference APIs for users, namely model-as-a-service (MaaS) setting. To adapt PTMs with model parameters frozen, most current approaches focus on the input side, seeking for powerful prompts to stimulate models for correct answers. However, we argue that input-side adaptation could be arduous due to the lack of gradient signals and they usually require thousands of API queries, resulting in high computation and time costs. In light of this, we present Decoder Tuning (DecT), which in contrast optimizes task-specific decoder networks on the output side. Specifically, DecT first extracts prompt-stimulated output scores for initial predictions. On top of that, we train an additional decoder network on the output representations to incorporate posterior data knowledge. By gradient-based optimization, DecT can be trained within several seconds and requires only one PTM query per sample. Empirically, we conduct extensive natural language understanding experiments and show that DecT significantly outperforms state-of-the-art algorithms with a $10^3\times$ speed-up.  ( 2 min )
    Statistical Design and Analysis for Robust Machine Learning: A Case Study from COVID-19. (arXiv:2212.08571v1 [cs.SD])
    Since early in the coronavirus disease 2019 (COVID-19) pandemic, there has been interest in using artificial intelligence methods to predict COVID-19 infection status based on vocal audio signals, for example cough recordings. However, existing studies have limitations in terms of data collection and of the assessment of the performances of the proposed predictive models. This paper rigorously assesses state-of-the-art machine learning techniques used to predict COVID-19 infection status based on vocal audio signals, using a dataset collected by the UK Health Security Agency. This dataset includes acoustic recordings and extensive study participant meta-data. We provide guidelines on testing the performance of methods to classify COVID-19 infection status based on acoustic features and we discuss how these can be extended more generally to the development and assessment of predictive methods based on public health datasets.  ( 2 min )
    A Simple Decentralized Cross-Entropy Method. (arXiv:2212.08235v1 [cs.LG])
    Cross-Entropy Method (CEM) is commonly used for planning in model-based reinforcement learning (MBRL) where a centralized approach is typically utilized to update the sampling distribution based on only the top-$k$ operation's results on samples. In this paper, we show that such a centralized approach makes CEM vulnerable to local optima, thus impairing its sample efficiency. To tackle this issue, we propose Decentralized CEM (DecentCEM), a simple but effective improvement over classical CEM, by using an ensemble of CEM instances running independently from one another, and each performing a local improvement of its own sampling distribution. We provide both theoretical and empirical analysis to demonstrate the effectiveness of this simple decentralized approach. We empirically show that, compared to the classical centralized approach using either a single or even a mixture of Gaussian distributions, our DecentCEM finds the global optimum much more consistently thus improves the sample efficiency. Furthermore, we plug in our DecentCEM in the planning problem of MBRL, and evaluate our approach in several continuous control environments, with comparison to the state-of-art CEM based MBRL approaches (PETS and POPLIN). Results show sample efficiency improvement by simply replacing the classical CEM module with our DecentCEM module, while only sacrificing a reasonable amount of computational cost. Lastly, we conduct ablation studies for more in-depth analysis. Code is available at https://github.com/vincentzhang/decentCEM  ( 2 min )
    Werewolf Among Us: A Multimodal Dataset for Modeling Persuasion Behaviors in Social Deduction Games. (arXiv:2212.08279v1 [cs.LG])
    Persuasion modeling is a key building block for conversational agents. Existing works in this direction are limited to analyzing textual dialogue corpus. We argue that visual signals also play an important role in understanding human persuasive behaviors. In this paper, we introduce the first multimodal dataset for modeling persuasion behaviors. Our dataset includes 199 dialogue transcriptions and videos captured in a multi-player social deduction game setting, 26,647 utterance level annotations of persuasion strategy, and game level annotations of deduction game outcomes. We provide extensive experiments to show how dialogue context and visual signals benefit persuasion strategy prediction. We also explore the generalization ability of language models for persuasion modeling and the role of persuasion strategies in predicting social deduction game outcomes. Our dataset, code, and models can be found at https://persuasion-deductiongame.socialai-data.org.  ( 2 min )
    Adversarial Attack on Attackers: Post-Process to Mitigate Black-Box Score-Based Query Attacks. (arXiv:2205.12134v3 [cs.LG] UPDATED)
    The score-based query attacks (SQAs) pose practical threats to deep neural networks by crafting adversarial perturbations within dozens of queries, only using the model's output scores. Nonetheless, we note that if the loss trend of the outputs is slightly perturbed, SQAs could be easily misled and thereby become much less effective. Following this idea, we propose a novel defense, namely Adversarial Attack on Attackers (AAA), to confound SQAs towards incorrect attack directions by slightly modifying the output logits. In this way, (1) SQAs are prevented regardless of the model's worst-case robustness; (2) the original model predictions are hardly changed, i.e., no degradation on clean accuracy; (3) the calibration of confidence scores can be improved simultaneously. Extensive experiments are provided to verify the above advantages. For example, by setting $\ell_\infty=8/255$ on CIFAR-10, our proposed AAA helps WideResNet-28 secure 80.59% accuracy under Square attack (2500 queries), while the best prior defense (i.e., adversarial training) only attains 67.44%. Since AAA attacks SQA's general greedy strategy, such advantages of AAA over 8 defenses can be consistently observed on 8 CIFAR-10/ImageNet models under 6 SQAs, using different attack targets, bounds, norms, losses, and strategies. Moreover, AAA calibrates better without hurting the accuracy. Our code is available at https://github.com/Sizhe-Chen/AAA.  ( 2 min )
    A Multi-Modal Machine Learning Approach to Detect Extreme Rainfall Events in Sicily. (arXiv:2212.08102v1 [physics.ao-ph])
    In 2021 300 mm of rain, nearly half the average annual rainfall, fell near Catania (Sicily island, Italy). Such events took place in just a few hours, with dramatic consequences on the environmental, social, economic, and health systems of the region. This is the reason why, detecting extreme rainfall events is a crucial prerequisite for planning actions able to reverse possibly intensified dramatic future scenarios. In this paper, the Affinity Propagation algorithm, a clustering algorithm grounded on machine learning, was applied, to the best of our knowledge, for the first time, to identify excess rain events in Sicily. This was possible by using a high-frequency, large dataset we collected, ranging from 2009 to 2021 which we named RSE (the Rainfall Sicily Extreme dataset). Weather indicators were then been employed to validate the results, thus confirming the presence of recent anomalous rainfall events in eastern Sicily. We believe that easy-to-use and multi-modal data science techniques, such as the one proposed in this study, could give rise to significant improvements in policy-making for successfully contrasting climate changes.  ( 2 min )
    Mobile Augmented Reality with Federated Learning in the Metaverse. (arXiv:2212.08324v1 [cs.LG])
    The Metaverse is deemed the next evolution of the Internet and has received much attention recently. Metaverse applications via mobile augmented reality (MAR) require rapid and accurate object detection to mix digital data with the real world. As mobile devices evolve, they become more potent in computing. Hence, their computational resources can be leveraged to train machine learning models. In light of the increasing concerns of user privacy and data security, federated learning (FL) has become a promising distributed learning framework for privacy-preserving analytics. In this article, FL and MAR are brought together in the Metaverse. We discuss the necessity and rationality of the combination of FL and MAR. The prospective technologies that power FL and MAR in the Metaverse are also identified. In addition, existing challenges that prevent the fulfilment of FL and MAR in the Metaverse and several application scenarios are presented. Finally, two case studies of Metaverse FL-MAR systems are demonstrated.  ( 2 min )
    An Efficient Framework for Monitoring Subgroup Performance of Machine Learning Systems. (arXiv:2212.08312v1 [cs.LG])
    Monitoring machine learning systems post deployment is critical to ensure the reliability of the systems. Particularly importance is the problem of monitoring the performance of machine learning systems across all the data subgroups (subpopulations). In practice, this process could be prohibitively expensive as the number of data subgroups grows exponentially with the number of input features, and the process of labelling data to evaluate each subgroup's performance is costly. In this paper, we propose an efficient framework for monitoring subgroup performance of machine learning systems. Specifically, we aim to find the data subgroup with the worst performance using a limited number of labeled data. We mathematically formulate this problem as an optimization problem with an expensive black-box objective function, and then suggest to use Bayesian optimization to solve this problem. Our experimental results on various real-world datasets and machine learning systems show that our proposed framework can retrieve the worst-performing data subgroup effectively and efficiently.  ( 2 min )
    Offline Robot Reinforcement Learning with Uncertainty-Guided Human Expert Sampling. (arXiv:2212.08232v1 [cs.LG])
    Recent advances in batch (offline) reinforcement learning have shown promising results in learning from available offline data and proved offline reinforcement learning to be an essential toolkit in learning control policies in a model-free setting. An offline reinforcement learning algorithm applied to a dataset collected by a suboptimal non-learning-based algorithm can result in a policy that outperforms the behavior agent used to collect the data. Such a scenario is frequent in robotics, where existing automation is collecting operational data. Although offline learning techniques can learn from data generated by a sub-optimal behavior agent, there is still an opportunity to improve the sample complexity of existing offline reinforcement learning algorithms by strategically introducing human demonstration data into the training process. To this end, we propose a novel approach that uses uncertainty estimation to trigger the injection of human demonstration data and guide policy training towards optimal behavior while reducing overall sample complexity. Our experiments show that this approach is more sample efficient when compared to a naive way of combining expert data with data collected from a sub-optimal agent. We augmented an existing offline reinforcement learning algorithm Conservative Q-Learning with our approach and performed experiments on data collected from MuJoCo and OffWorld Gym learning environments.  ( 2 min )
    Metaheuristic for Hub-Spoke Facility Location Problem: Application to Indian E-commerce Industry. (arXiv:2212.08299v1 [cs.LG])
    Indian e-commerce industry has evolved over the last decade and is expected to grow over the next few years. The focus has now shifted to turnaround time (TAT) due to the emergence of many third-party logistics providers and higher customer expectations. The key consideration for delivery providers is to balance their overall operating costs while meeting the promised TAT to their customers. E-commerce delivery partners operate through a network of facilities whose strategic locations help to run the operations efficiently. In this work, we identify the locations of hubs throughout the country and their corresponding mapping with the distribution centers. The objective is to minimize the total network costs with TAT adherence. We use Genetic Algorithm and leverage business constraints to reduce the solution search space and hence the solution time. The results indicate an improvement of 9.73% in TAT compliance compared with the current scenario.  ( 2 min )
    Learning on Persistence Diagrams as Radon Measures. (arXiv:2212.08295v1 [cs.CG])
    Persistence diagrams are common descriptors of the topological structure of data appearing in various classification and regression tasks. They can be generalized to Radon measures supported on the birth-death plane and endowed with an optimal transport distance. Examples of such measures are expectations of probability distributions on the space of persistence diagrams. In this paper, we develop methods for approximating continuous functions on the space of Radon measures supported on the birth-death plane, as well as their utilization in supervised learning tasks. Indeed, we show that any continuous function defined on a compact subset of the space of such measures (e.g., a classifier or regressor) can be approximated arbitrarily well by polynomial combinations of features computed using a continuous compactly supported function on the birth-death plane (a template). We provide insights into the structure of relatively compact subsets of the space of Radon measures, and test our approximation methodology on various data sets and supervised learning tasks.  ( 2 min )
    Robust Learning Protocol for Federated Tumor Segmentation Challenge. (arXiv:2212.08290v1 [cs.LG])
    In this work, we devise robust and efficient learning protocols for orchestrating a Federated Learning (FL) process for the Federated Tumor Segmentation Challenge (FeTS 2022). Enabling FL for FeTS setup is challenging mainly due to data heterogeneity among collaborators and communication cost of training. To tackle these challenges, we propose Robust Learning Protocol (RoLePRO) which is a combination of server-side adaptive optimisation (e.g., server-side Adam) and judicious parameter (weights) aggregation schemes (e.g., adaptive weighted aggregation). RoLePRO takes a two-phase approach, where the first phase consists of vanilla Federated Averaging, while the second phase consists of a judicious aggregation scheme that uses a sophisticated reweighting, all in the presence of an adaptive optimisation algorithm at the server. We draw insights from extensive experimentation to tune learning rates for the two phases.  ( 2 min )
    SADM: Sequence-Aware Diffusion Model for Longitudinal Medical Image Generation. (arXiv:2212.08228v1 [cs.CV])
    Human organs constantly undergo anatomical changes due to a complex mix of short-term (e.g., heartbeat) and long-term (e.g., aging) factors. Evidently, prior knowledge of these factors will be beneficial when modeling their future state, i.e., via image generation. However, most of the medical image generation tasks only rely on the input from a single image, thus ignoring the sequential dependency even when longitudinal data is available. Sequence-aware deep generative models, where model input is a sequence of ordered and timestamped images, are still underexplored in the medical imaging domain that is featured by several unique challenges: 1) Sequences with various lengths; 2) Missing data or frame, and 3) High dimensionality. To this end, we propose a sequence-aware diffusion model (SADM) for the generation of longitudinal medical images. Recently, diffusion models have shown promising results on high-fidelity image generation. Our method extends this new technique by introducing a sequence-aware transformer as the conditional module in a diffusion model. The novel design enables learning longitudinal dependency even with missing data during training and allows autoregressive generation of a sequence of images during inference. Our extensive experiments on 3D longitudinal medical images demonstrate the effectiveness of SADM compared with baselines and alternative methods.  ( 2 min )
    Learning for Vehicle-to-Vehicle Cooperative Perception under Lossy Communication. (arXiv:2212.08273v1 [cs.CV])
    Deep learning has been widely used in the perception (e.g., 3D object detection) of intelligent vehicle driving. Due to the beneficial Vehicle-to-Vehicle (V2V) communication, the deep learning based features from other agents can be shared to the ego vehicle so as to improve the perception of the ego vehicle. It is named as Cooperative Perception in the V2V research, whose algorithms have been dramatically advanced recently. However, all the existing cooperative perception algorithms assume the ideal V2V communication without considering the possible lossy shared features because of the Lossy Communication (LC) which is common in the complex real-world driving scenarios. In this paper, we first study the side effect (e.g., detection performance drop) by the lossy communication in the V2V Cooperative Perception, and then we propose a novel intermediate LC-aware feature fusion method to relieve the side effect of lossy communication by a LC-aware Repair Network (LCRN) and enhance the interaction between the ego vehicle and other vehicles by a specially designed V2V Attention Module (V2VAM) including intra-vehicle attention of ego vehicle and uncertainty-aware inter-vehicle attention. The extensive experiment on the public cooperative perception dataset OPV2V (based on digital-twin CARLA simulator) demonstrates that the proposed method is quite effective for the cooperative point cloud based 3D object detection under lossy V2V communication.
    Bridging the Gap Between Offline and Online Reinforcement Learning Evaluation Methodologies. (arXiv:2212.08131v1 [cs.LG])
    Reinforcement learning (RL) has shown great promise with algorithms learning in environments with large state and action spaces purely from scalar reward signals. A crucial challenge for current deep RL algorithms is that they require a tremendous amount of environment interactions for learning. This can be infeasible in situations where such interactions are expensive; such as in robotics. Offline RL algorithms try to address this issue by bootstrapping the learning process from existing logged data without needing to interact with the environment from the very beginning. While online RL algorithms are typically evaluated as a function of the number of environment interactions, there exists no single established protocol for evaluating offline RL methods.In this paper, we propose a sequential approach to evaluate offline RL algorithms as a function of the training set size and thus by their data efficiency. Sequential evaluation provides valuable insights into the data efficiency of the learning process and the robustness of algorithms to distribution changes in the dataset while also harmonizing the visualization of the offline and online learning phases. Our approach is generally applicable and easy to implement. We compare several existing offline RL algorithms using this approach and present insights from a variety of tasks and offline datasets.
    Teaching Small Language Models to Reason. (arXiv:2212.08410v1 [cs.CL])
    Chain of thought prompting successfully improves the reasoning capabilities of large language models, achieving state of the art results on a range of datasets. However, these reasoning capabilities only appear to emerge in models with a size of over 100 billion parameters. In this paper, we explore the transfer of such reasoning capabilities to models with less than 100 billion parameters via knowledge distillation. Specifically, we finetune a student model on the chain of thought outputs generated by a larger teacher model. Our experiments show that the proposed method improves task performance across arithmetic, commonsense and symbolic reasoning datasets. For example, the accuracy of T5 XXL on GSM8K improves from 8.11% to 21.99% when finetuned on PaLM-540B generated chains of thought.
    Parameter estimation of the homodyned K distribution based on neural networks and trainable fractional-order moments. (arXiv:2210.05833v2 [cs.LG] UPDATED)
    Homodyned K (HK) distribution has been widely used to describe the scattering phenomena arising in various research fields, such as ultrasound imaging or optics. In this work, we propose a machine learning based approach to the estimation of the HK distribution parameters. We develop neural networks that can estimate the HK distribution parameters based on the signal-to-noise ratio, skewness and kurtosis calculated using fractional-order moments. Compared to the previous approaches, we consider the orders of the moments as trainable variables that can be optimized along with the network weights using the back-propagation algorithm. Networks are trained based on samples generated from the HK distribution. Obtained results demonstrate that the proposed method can be used to accurately estimate the HK distribution parameters.
    Asymptotic Analysis of Deep Residual Networks. (arXiv:2212.08199v1 [cs.LG])
    We investigate the asymptotic properties of deep Residual networks (ResNets) as the number of layers increases. We first show the existence of scaling regimes for trained weights markedly different from those implicitly assumed in the neural ODE literature. We study the convergence of the hidden state dynamics in these scaling regimes, showing that one may obtain an ODE, a stochastic differential equation (SDE) or neither of these. In particular, our findings point to the existence of a diffusive regime in which the deep network limit is described by a class of stochastic differential equations (SDEs). Finally, we derive the corresponding scaling limits for the backpropagation dynamics.
    Connecting Permutation Equivariant Neural Networks and Partition Diagrams. (arXiv:2212.08648v1 [cs.LG])
    We show how the Schur-Weyl duality that exists between the partition algebra and the symmetric group results in a stronger theoretical foundation for characterising all of the possible permutation equivariant neural networks whose layers are some tensor power of the permutation representation $M_n$ of the symmetric group $S_n$. In doing so, we unify two separate bodies of literature, and we correct some of the major results that are now widely quoted by the machine learning community. In particular, we find a basis of matrices for the learnable, linear, permutation equivariant layer functions between such tensor power spaces in the standard basis of $M_n$ by using an elegant graphical representation of a basis of set partitions for the partition algebra and its related vector spaces. Also, we show how we can calculate the number of weights that must appear in these layer functions by looking at certain paths through the McKay quiver for $M_n$. Finally, we describe how our approach generalises to the construction of neural networks that are equivariant to local symmetries.
    Feature Dropout: Revisiting the Role of Augmentations in Contrastive Learning. (arXiv:2212.08378v1 [cs.LG])
    What role do augmentations play in contrastive learning? Recent work suggests that good augmentations are label-preserving with respect to a specific downstream task. We complicate this picture by showing that label-destroying augmentations can be useful in the foundation model setting, where the goal is to learn diverse, general-purpose representations for multiple downstream tasks. We perform contrastive learning experiments on a range of image and audio datasets with multiple downstream tasks (e.g. for digits superimposed on photographs, predicting the class of one vs. the other). We find that Viewmaker Networks, a recently proposed model for learning augmentations for contrastive learning, produce label-destroying augmentations that stochastically destroy features needed for different downstream tasks. These augmentations are interpretable (e.g. altering shapes, digits, or letters added to images) and surprisingly often result in better performance compared to expert-designed augmentations, despite not preserving label information. To support our empirical results, we theoretically analyze a simple contrastive learning setting with a linear model. In this setting, label-destroying augmentations are crucial for preventing one set of features from suppressing the learning of features useful for another downstream task. Our results highlight the need for analyzing the interaction between multiple downstream tasks when trying to explain the success of foundation models.
    Context Label Learning: Improving Background Class Representations in Semantic Segmentation. (arXiv:2212.08423v1 [cs.CV])
    Background samples provide key contextual information for segmenting regions of interest (ROIs). However, they always cover a diverse set of structures, causing difficulties for the segmentation model to learn good decision boundaries with high sensitivity and precision. The issue concerns the highly heterogeneous nature of the background class, resulting in multi-modal distributions. Empirically, we find that neural networks trained with heterogeneous background struggle to map the corresponding contextual samples to compact clusters in feature space. As a result, the distribution over background logit activations may shift across the decision boundary, leading to systematic over-segmentation across different datasets and tasks. In this study, we propose context label learning (CoLab) to improve the context representations by decomposing the background class into several subclasses. Specifically, we train an auxiliary network as a task generator, along with the primary segmentation model, to automatically generate context labels that positively affect the ROI segmentation accuracy. Extensive experiments are conducted on several challenging segmentation tasks and datasets. The results demonstrate that CoLab can guide the segmentation model to map the logits of background samples away from the decision boundary, resulting in significantly improved segmentation accuracy. Code is available.
    Safe Evaluation For Offline Learning: Are We Ready To Deploy?. (arXiv:2212.08302v1 [cs.LG])
    The world currently offers an abundance of data in multiple domains, from which we can learn reinforcement learning (RL) policies without further interaction with the environment. RL agents learning offline from such data is possible but deploying them while learning might be dangerous in domains where safety is critical. Therefore, it is essential to find a way to estimate how a newly-learned agent will perform if deployed in the target environment before actually deploying it and without the risk of overestimating its true performance. To achieve this, we introduce a framework for safe evaluation of offline learning using approximate high-confidence off-policy evaluation (HCOPE) to estimate the performance of offline policies during learning. In our setting, we assume a source of data, which we split into a train-set, to learn an offline policy, and a test-set, to estimate a lower-bound on the offline policy using off-policy evaluation with bootstrapping. A lower-bound estimate tells us how good a newly-learned target policy would perform before it is deployed in the real environment, and therefore allows us to decide when to deploy our learned policy.
    Estimating truncation effects of quantum bosonic systems using sampling algorithms. (arXiv:2212.08546v1 [quant-ph])
    To simulate bosons on a qubit- or qudit-based quantum computer, one has to regularize the theory by truncating infinite-dimensional local Hilbert spaces to finite dimensions. In the search for practical quantum applications, it is important to know how big the truncation errors can be. In general, it is not easy to estimate errors unless we have a good quantum computer. In this paper we show that traditional sampling methods on classical devices, specifically Markov Chain Monte Carlo, can address this issue with a reasonable amount of computational resources available today. As a demonstration, we apply this idea to the scalar field theory on a two-dimensional lattice, with a size that goes beyond what is achievable using exact diagonalization methods. This method can be used to estimate the resources needed for realistic quantum simulations of bosonic theories, and also, to check the validity of the results of the corresponding quantum simulations.
    An automated parameter domain decomposition approach for gravitational wave surrogates using hp-greedy refinement. (arXiv:2212.08554v1 [gr-qc])
    We introduce hp-greedy, a refinement approach for building gravitational wave surrogates as an extension of the standard reduced basis framework. Our proposal is data-driven, with a domain decomposition of the parameter space, local reduced basis, and a binary tree as the resulting structure, which are obtained in an automated way. When compared to the standard global reduced basis approach, the numerical simulations of our proposal show three salient features: i) representations of lower dimension with no loss of accuracy, ii) a significantly higher accuracy for a fixed maximum dimensionality of the basis, in some cases by orders of magnitude, and iii) results that depend on the reduced basis seed choice used by the refinement algorithm. We first illustrate the key parts of our approach with a toy model and then present a more realistic use case of gravitational waves emitted by the collision of two spinning, non-precessing black holes. We discuss performance aspects of hp-greedy, such as overfitting with respect to the depth of the tree structure, and other hyperparameter dependences. As two direct applications of the proposed hp-greedy refinement, we envision: i) a further acceleration of statistical inference, which might be complementary to focused reduced-order quadratures, and ii) the search of gravitational waves through clustering and nearest neighbors.
    Fairness Transferability Subject to Bounded Distribution Shift. (arXiv:2206.00129v3 [cs.LG] UPDATED)
    Given an algorithmic predictor that is "fair" on some source distribution, will it still be fair on an unknown target distribution that differs from the source within some bound? In this paper, we study the transferability of statistical group fairness for machine learning predictors (i.e., classifiers or regressors) subject to bounded distribution shifts. Such shifts may be introduced by initial training data uncertainties, user adaptation to a deployed predictor, dynamic environments, or the use of pre-trained models in new settings. Herein, we develop a bound that characterizes such transferability, flagging potentially inappropriate deployments of machine learning for socially consequential tasks. We first develop a framework for bounding violations of statistical fairness subject to distribution shift, formulating a generic upper bound for transferred fairness violations as our primary result. We then develop bounds for specific worked examples, focusing on two commonly used fairness definitions (i.e., demographic parity and equalized odds) and two classes of distribution shift (i.e., covariate shift and label shift). Finally, we compare our theoretical bounds to deterministic models of distribution shift and against real-world data, finding that we are able to estimate fairness violation bounds in practice, even when simplifying assumptions are only approximately satisfied.
    Brauer's Group Equivariant Neural Networks. (arXiv:2212.08630v1 [cs.LG])
    We provide a full characterisation of all of the possible group equivariant neural networks whose layers are some tensor power of $\mathbb{R}^{n}$ for three symmetry groups that are missing from the machine learning literature: $O(n)$, the orthogonal group; $SO(n)$, the special orthogonal group; and $Sp(n)$, the symplectic group. In particular, we find a spanning set of matrices for the learnable, linear, equivariant layer functions between such tensor power spaces in the standard basis of $\mathbb{R}^{n}$ when the group is $O(n)$ or $SO(n)$, and in the symplectic basis of $\mathbb{R}^{n}$ when the group is $Sp(n)$. The neural networks that we characterise are simple to implement since our method circumvents the typical requirement when building group equivariant neural networks of having to decompose the tensor power spaces of $\mathbb{R}^{n}$ into irreducible representations. We also describe how our approach generalises to the construction of neural networks that are equivariant to local symmetries. The theoretical background for our results comes from the Schur-Weyl dualities that were established by Brauer in his 1937 paper "On Algebras Which are Connected with the Semisimple Continuous Groups" for each of the three groups in question. We suggest that Schur-Weyl duality is a powerful mathematical concept that could be used to understand the structure of neural networks that are equivariant to groups beyond those considered in this paper.
    Multi-Agent Patrolling with Battery Constraints through Deep Reinforcement Learning. (arXiv:2212.08230v1 [cs.AI])
    Autonomous vehicles are suited for continuous area patrolling problems. However, finding an optimal patrolling strategy can be challenging for many reasons. Firstly, patrolling environments are often complex and can include unknown and evolving environmental factors. Secondly, autonomous vehicles can have failures or hardware constraints such as limited battery lives. Importantly, patrolling large areas often requires multiple agents that need to collectively coordinate their actions. In this work, we consider these limitations and propose an approach based on a distributed, model-free deep reinforcement learning based multi-agent patrolling strategy. In this approach, agents make decisions locally based on their own environmental observations and on shared information. In addition, agents are trained to automatically recharge themselves when required to support continuous collective patrolling. A homogeneous multi-agent architecture is proposed, where all patrolling agents have an identical policy. This architecture provides a robust patrolling system that can tolerate agent failures and allow supplementary agents to be added to replace failed agents or to increase the overall patrol performance. This performance is validated through experiments from multiple perspectives, including the overall patrol performance, the efficiency of the battery recharging strategy, the overall robustness of the system, and the agents' ability to adapt to environment dynamics.
    RepQ-ViT: Scale Reparameterization for Post-Training Quantization of Vision Transformers. (arXiv:2212.08254v1 [cs.CV])
    Post-training quantization (PTQ), which only requires a tiny dataset for calibration without end-to-end retraining, is a light and practical model compression technique. Recently, several PTQ schemes for vision transformers (ViTs) have been presented; unfortunately, they typically suffer from non-trivial accuracy degradation, especially in low-bit cases. In this paper, we propose RepQ-ViT, a novel PTQ framework for ViTs based on quantization scale reparameterization, to address the above issues. RepQ-ViT decouples the quantization and inference processes, where the former employs complex quantizers and the latter employs scale-reparameterized simplified quantizers. This ensures both accurate quantization and efficient inference, which distinguishes it from existing approaches that sacrifice quantization performance to meet the target hardware. More specifically, we focus on two components with extreme distributions: post-LayerNorm activations with severe inter-channel variation and post-Softmax activations with power-law features, and initially apply channel-wise quantization and log$\sqrt{2}$ quantization, respectively. Then, we reparameterize the scales to hardware-friendly layer-wise quantization and log2 quantization for inference, with only slight accuracy or computational costs. Extensive experiments are conducted on multiple vision tasks with different model variants, proving that RepQ-ViT, without hyperparameters and expensive reconstruction procedures, can outperform existing strong baselines and encouragingly improve the accuracy of 4-bit PTQ of ViTs to a usable level.
    Do Not Trust a Model Because It is Confident: Uncovering and Characterizing Unknown Unknowns to Student Success Predictors in Online-Based Learning. (arXiv:2212.08532v1 [cs.LG])
    Student success models might be prone to develop weak spots, i.e., examples hard to accurately classify due to insufficient representation during model creation. This weakness is one of the main factors undermining users' trust, since model predictions could for instance lead an instructor to not intervene on a student in need. In this paper, we unveil the need of detecting and characterizing unknown unknowns in student success prediction in order to better understand when models may fail. Unknown unknowns include the students for which the model is highly confident in its predictions, but is actually wrong. Therefore, we cannot solely rely on the model's confidence when evaluating the predictions quality. We first introduce a framework for the identification and characterization of unknown unknowns. We then assess its informativeness on log data collected from flipped courses and online courses using quantitative analyses and interviews with instructors. Our results show that unknown unknowns are a critical issue in this domain and that our framework can be applied to support their detection. The source code is available at https://github.com/epfl-ml4ed/unknown-unknowns.
    Adversarial Example Defense via Perturbation Grading Strategy. (arXiv:2212.08341v1 [cs.CV])
    Deep Neural Networks have been widely used in many fields. However, studies have shown that DNNs are easily attacked by adversarial examples, which have tiny perturbations and greatly mislead the correct judgment of DNNs. Furthermore, even if malicious attackers cannot obtain all the underlying model parameters, they can use adversarial examples to attack various DNN-based task systems. Researchers have proposed various defense methods to protect DNNs, such as reducing the aggressiveness of adversarial examples by preprocessing or improving the robustness of the model by adding modules. However, some defense methods are only effective for small-scale examples or small perturbations but have limited defense effects for adversarial examples with large perturbations. This paper assigns different defense strategies to adversarial perturbations of different strengths by grading the perturbations on the input examples. Experimental results show that the proposed method effectively improves defense performance. In addition, the proposed method does not modify any task model, which can be used as a preprocessing module, which significantly reduces the deployment cost in practical applications.
    Learnable Commutative Monoids for Graph Neural Networks. (arXiv:2212.08541v1 [cs.LG])
    Graph neural networks (GNNs) have been shown to be highly sensitive to the choice of aggregation function. While summing over a node's neighbours can approximate any permutation-invariant function over discrete inputs, Cohen-Karlik et al. [2020] proved there are set-aggregation problems for which summing cannot generalise to unbounded inputs, proposing recurrent neural networks regularised towards permutation-invariance as a more expressive aggregator. We show that these results carry over to the graph domain: GNNs equipped with recurrent aggregators are competitive with state-of-the-art permutation-invariant aggregators, on both synthetic benchmarks and real-world problems. However, despite the benefits of recurrent aggregators, their $O(V)$ depth makes them both difficult to parallelise and harder to train on large graphs. Inspired by the observation that a well-behaved aggregator for a GNN is a commutative monoid over its latent space, we propose a framework for constructing learnable, commutative, associative binary operators. And with this, we construct an aggregator of $O(\log V)$ depth, yielding exponential improvements for both parallelism and dependency length while achieving performance competitive with recurrent aggregators. Based on our empirical observations, our proposed learnable commutative monoid (LCM) aggregator represents a favourable tradeoff between efficient and expressive aggregators.
    Neural Network Augmented Compartmental Pandemic Models. (arXiv:2212.08481v1 [cs.LG])
    Compartmental models are a tool commonly used in epidemiology for the mathematical modelling of the spread of infectious diseases, with their most popular representative being the Susceptible-Infected-Removed (SIR) model and its derivatives. However, current SIR models are bounded in their capabilities to model government policies in the form of non-pharmaceutical interventions (NPIs) and weather effects and offer limited predictive power. More capable alternatives such as agent based models (ABMs) are computationally expensive and require specialized hardware. We introduce a neural network augmented SIR model that can be run on commodity hardware, takes NPIs and weather effects into account and offers improved predictive power as well as counterfactual analysis capabilities. We demonstrate our models improvement of the state-of-the-art modeling COVID-19 in Austria during the 03.2020 to 03.2021 period and provide an outlook for the future up to 01.2024.
    Neural Enhanced Belief Propagation for Multiobject Tracking. (arXiv:2212.08340v1 [cs.CV])
    Algorithmic solutions for multi-object tracking (MOT) are a key enabler for applications in autonomous navigation and applied ocean sciences. State-of-the-art MOT methods fully rely on a statistical model and typically use preprocessed sensor data as measurements. In particular, measurements are produced by a detector that extracts potential object locations from the raw sensor data collected for a discrete time step. This preparatory processing step reduces data flow and computational complexity but may result in a loss of information. State-of-the-art Bayesian MOT methods that are based on belief propagation (BP) systematically exploit graph structures of the statistical model to reduce computational complexity and improve scalability. However, as a fully model-based approach, BP can only provide suboptimal estimates when there is a mismatch between the statistical model and the true data-generating process. Existing BP-based MOT methods can further only make use of preprocessed measurements. In this paper, we introduce a variant of BP that combines model-based with data-driven MOT. The proposed neural enhanced belief propagation (NEBP) method complements the statistical model of BP by information learned from raw sensor data. This approach conjectures that the learned information can reduce model mismatch and thus improve data association and false alarm rejection. Our NEBP method improves tracking performance compared to model-based methods. At the same time, it inherits the advantages of BP-based MOT, i.e., it scales only quadratically in the number of objects, and it can thus generate and maintain a large number of object tracks. We evaluate the performance of our NEBP approach for MOT on the nuScenes autonomous driving dataset and demonstrate that it has state-of-the-art performance.
    Optimized Symbolic Interval Propagation for Neural Network Verification. (arXiv:2212.08567v1 [cs.LG])
    Neural networks are increasingly applied in safety critical domains, their verification thus is gaining importance. A large class of recent algorithms for proving input-output relations of feed-forward neural networks are based on linear relaxations and symbolic interval propagation. However, due to variable dependencies, the approximations deteriorate with increasing depth of the network. In this paper we present DPNeurifyFV, a novel branch-and-bound solver for ReLU networks with low dimensional input-space that is based on symbolic interval propagation with fresh variables and input-splitting. A new heuristic for choosing the fresh variables allows to ameliorate the dependency problem, while our novel splitting heuristic, in combination with several other improvements, speeds up the branch-and-bound procedure. We evaluate our approach on the airborne collision avoidance networks ACAS Xu and demonstrate runtime improvements compared to state-of-the-art tools.
    Geometry-aware Autoregressive Models for Calorimeter Shower Simulations. (arXiv:2212.08233v1 [physics.ins-det])
    Calorimeter shower simulations are often the bottleneck in simulation time for particle physics detectors. A lot of effort is currently spent on optimizing generative architectures for specific detector geometries, which generalize poorly. We develop a geometry-aware autoregressive model on a range of calorimeter geometries such that the model learns to adapt its energy deposition depending on the size and position of the cells. This is a key proof-of-concept step towards building a model that can generalize to new unseen calorimeter geometries with little to no additional training. Such a model can replace the hundreds of generative models used for calorimeter simulation in a Large Hadron Collider experiment. For the study of future detectors, such a model will dramatically reduce the large upfront investment usually needed to generate simulations.  ( 2 min )
    Multi-Resolution Online Deterministic Annealing: A Hierarchical and Progressive Learning Architecture. (arXiv:2212.08189v1 [cs.LG])
    Hierarchical learning algorithms that gradually approximate a solution to a data-driven optimization problem are essential to decision-making systems, especially under limitations on time and computational resources. In this study, we introduce a general-purpose hierarchical learning architecture that is based on the progressive partitioning of a possibly multi-resolution data space. The optimal partition is gradually approximated by solving a sequence of optimization sub-problems that yield a sequence of partitions with increasing number of subsets. We show that the solution of each optimization problem can be estimated online using gradient-free stochastic approximation updates. As a consequence, a function approximation problem can be defined within each subset of the partition and solved using the theory of two-timescale stochastic approximation algorithms. This simulates an annealing process and defines a robust and interpretable heuristic method to gradually increase the complexity of the learning architecture in a task-agnostic manner, giving emphasis to regions of the data space that are considered more important according to a predefined criterion. Finally, by imposing a tree structure in the progression of the partitions, we provide a means to incorporate potential multi-resolution structure of the data space into this approach, significantly reducing its complexity, while introducing hierarchical feature extraction properties similar to certain classes of deep learning architectures. Asymptotic convergence analysis and experimental results are provided for clustering, classification, and regression problems.
    Graphon Pooling for Reducing Dimensionality of Signals and Convolutional Operators on Graphs. (arXiv:2212.08171v1 [cs.LG])
    In this paper we propose a pooling approach for convolutional information processing on graphs relying on the theory of graphons and limits of dense graph sequences. We present three methods that exploit the induced graphon representation of graphs and graph signals on partitions of [0, 1]2 in the graphon space. As a result we derive low dimensional representations of the convolutional operators, while a dimensionality reduction of the signals is achieved by simple local interpolation of functions in L2([0, 1]). We prove that those low dimensional representations constitute a convergent sequence of graphs and graph signals, respectively. The methods proposed and the theoretical guarantees that we provide show that the reduced graphs and signals inherit spectral-structural properties of the original quantities. We evaluate our approach with a set of numerical experiments performed on graph neural networks (GNNs) that rely on graphon pooling. We observe that graphon pooling performs significantly better than other approaches proposed in the literature when dimensionality reduction ratios between layers are large. We also observe that when graphon pooling is used we have, in general, less overfitting and lower computational cost.  ( 2 min )
    DeepDFA: Dataflow Analysis-Guided Efficient Graph Learning for Vulnerability Detection. (arXiv:2212.08108v1 [cs.SE])
    Deep learning-based vulnerability detection models have recently been shown to be effective and, in some cases, outperform static analysis tools. However, the highest-performing approaches use token-based transformer models, which do not leverage domain knowledge. Classical program analysis techniques such as dataflow analysis can detect many types of bugs and are the most commonly used methods in practice. Motivated by the causal relationship between bugs and dataflow analysis, we present DeepDFA, a dataflow analysis-guided graph learning framework and embedding that uses program semantic features for vulnerability detection. We show that DeepDFA is performant and efficient. DeepDFA ranked first in recall, first in generalizing over unseen projects, and second in F1 among all the state-of-the-art models we experimented with. It is also the smallest model in terms of the number of parameters, and was trained in 9 minutes, 69x faster than the highest-performing baseline. DeepDFA can be used with other models. By integrating LineVul and DeepDFA, we achieved the best vulnerability detection performance of 96.4 F1 score, 98.69 precision, and 94.22 recall.  ( 2 min )
    Fake it till you make it: Learning(s) from a synthetic ImageNet clone. (arXiv:2212.08420v1 [cs.CV])
    Recent large-scale image generation models such as Stable Diffusion have exhibited an impressive ability to generate fairly realistic images starting from a very simple text prompt. Could such models render real images obsolete for training image prediction models? In this paper, we answer part of this provocative question by questioning the need for real images when training models for ImageNet classification. More precisely, provided only with the class names that have been used to build the dataset, we explore the ability of Stable Diffusion to generate synthetic clones of ImageNet and measure how useful they are for training classification models from scratch. We show that with minimal and class-agnostic prompt engineering those ImageNet clones we denote as ImageNet-SD are able to close a large part of the gap between models produced by synthetic images and models trained with real images for the several standard classification benchmarks that we consider in this study. More importantly, we show that models trained on synthetic images exhibit strong generalization properties and perform on par with models trained on real data.  ( 2 min )
    BNSynth: Bounded Boolean Functional Synthesis. (arXiv:2212.08170v1 [cs.AI])
    The automated synthesis of correct-by-construction Boolean functions from logical specifications is known as the Boolean Functional Synthesis (BFS) problem. BFS has many application areas that range from software engineering to circuit design. In this paper, we introduce a tool BNSynth, that is the first to solve the BFS problem under a given bound on the solution space. Bounding the solution space induces the synthesis of smaller functions that benefit resource constrained areas such as circuit design. BNSynth uses a counter-example guided, neural approach to solve the bounded BFS problem. Initial results show promise in synthesizing smaller solutions; we observe at least \textbf{3.2X} (and up to \textbf{24X}) improvement in the reduction of solution size on average, as compared to state of the art tools on our benchmarks. BNSynth is available on GitHub under an open source license.  ( 2 min )
    Swing Distillation: A Privacy-Preserving Knowledge Distillation Framework. (arXiv:2212.08349v1 [cs.LG])
    Knowledge distillation (KD) has been widely used for model compression and knowledge transfer. Typically, a big teacher model trained on sufficient data transfers knowledge to a small student model. However, despite the success of KD, little effort has been made to study whether KD leaks the training data of the teacher model. In this paper, we experimentally reveal that KD suffers from the risk of privacy leakage. To alleviate this issue, we propose a novel knowledge distillation method, swing distillation, which can effectively protect the private information of the teacher model from flowing to the student model. In our framework, the temperature coefficient is dynamically and adaptively adjusted according to the degree of private information contained in the data, rather than a predefined constant hyperparameter. It assigns different temperatures to tokens according to the likelihood that a token in a position contains private information. In addition, we inject noise into soft targets provided to the student model, in order to avoid unshielded knowledge transfer. Experiments on multiple datasets and tasks demonstrate that the proposed swing distillation can significantly reduce (by over 80% in terms of canary exposure) the risk of privacy leakage in comparison to KD with competitive or better performance. Furthermore, swing distillation is robust against the increasing privacy budget.  ( 2 min )
    Azimuth: Systematic Error Analysis for Text Classification. (arXiv:2212.08216v1 [cs.LG])
    We present Azimuth, an open-source and easy-to-use tool to perform error analysis for text classification. Compared to other stages of the ML development cycle, such as model training and hyper-parameter tuning, the process and tooling for the error analysis stage are less mature. However, this stage is critical for the development of reliable and trustworthy AI systems. To make error analysis more systematic, we propose an approach comprising dataset analysis and model quality assessment, which Azimuth facilitates. We aim to help AI practitioners discover and address areas where the model does not generalize by leveraging and integrating a range of ML techniques, such as saliency maps, similarity, uncertainty, and behavioral analyses, all in one tool. Our code and documentation are available at github.com/servicenow/azimuth.
    Softmax Policy Gradient Methods Can Take Exponential Time to Converge. (arXiv:2102.11270v3 [cs.LG] UPDATED)
    The softmax policy gradient (PG) method, which performs gradient ascent under softmax policy parameterization, is arguably one of the de facto implementations of policy optimization in modern reinforcement learning. For $\gamma$-discounted infinite-horizon tabular Markov decision processes (MDPs), remarkable progress has recently been achieved towards establishing global convergence of softmax PG methods in finding a near-optimal policy. However, prior results fall short of delineating clear dependencies of convergence rates on salient parameters such as the cardinality of the state space $\mathcal{S}$ and the effective horizon $\frac{1}{1-\gamma}$, both of which could be excessively large. In this paper, we deliver a pessimistic message regarding the iteration complexity of softmax PG methods, despite assuming access to exact gradient computation. Specifically, we demonstrate that the softmax PG method with stepsize $\eta$ can take \[ \frac{1}{\eta} |\mathcal{S}|^{2^{\Omega\big(\frac{1}{1-\gamma}\big)}} ~\text{iterations} \] to converge, even in the presence of a benign policy initialization and an initial state distribution amenable to exploration (so that the distribution mismatch coefficient is not exceedingly large). This is accomplished by characterizing the algorithmic dynamics over a carefully-constructed MDP containing only three actions. Our exponential lower bound hints at the necessity of carefully adjusting update rules or enforcing proper regularization in accelerating PG methods.  ( 2 min )
    On Evaluating Adversarial Robustness of Chest X-ray Classification: Pitfalls and Best Practices. (arXiv:2212.08130v1 [eess.IV])
    Vulnerability to adversarial attacks is a well-known weakness of Deep Neural Networks. While most of the studies focus on natural images with standardized benchmarks like ImageNet and CIFAR, little research has considered real world applications, in particular in the medical domain. Our research shows that, contrary to previous claims, robustness of chest x-ray classification is much harder to evaluate and leads to very different assessments based on the dataset, the architecture and robustness metric. We argue that previous studies did not take into account the peculiarity of medical diagnosis, like the co-occurrence of diseases, the disagreement of labellers (domain experts), the threat model of the attacks and the risk implications for each successful attack. In this paper, we discuss the methodological foundations, review the pitfalls and best practices, and suggest new methodological considerations for evaluating the robustness of chest xray classification models. Our evaluation on 3 datasets, 7 models, and 18 diseases is the largest evaluation of robustness of chest x-ray classification models.  ( 2 min )
    COVID-19 Monitoring System using Social Distancing and Face Mask Detection on Surveillance video datasets. (arXiv:2110.03905v3 [cs.CV] UPDATED)
    In the current times, the fear and danger of COVID-19 virus still stands large. Manual monitoring of social distancing norms is impractical with a large population moving about and with insufficient task force and resources to administer them. There is a need for a lightweight, robust and 24X7 video-monitoring system that automates this process. This paper proposes a comprehensive and effective solution to perform person detection, social distancing violation detection, face detection and face mask classification using object detection, clustering and Convolution Neural Network (CNN) based binary classifier. For this, YOLOv3, Density-based spatial clustering of applications with noise (DBSCAN), Dual Shot Face Detector (DSFD) and MobileNetV2 based binary classifier have been employed on surveillance video datasets. This paper also provides a comparative study of different face detection and face mask classification models. Finally, a video dataset labelling method is proposed along with the labelled video dataset to compensate for the lack of dataset in the community and is used for evaluation of the system. The system performance is evaluated in terms of accuracy, F1 score as well as the prediction time, which has to be low for practical applicability. The system performs with an accuracy of 91.2% and F1 score of 90.79% on the labelled video dataset and has an average prediction time of 7.12 seconds for 78 frames of a video.  ( 3 min )
    Numerical Optimizations for Weighted Low-rank Estimation on Language Model. (arXiv:2211.09718v2 [cs.CL] UPDATED)
    Singular value decomposition (SVD) is one of the most popular compression methods that approximate a target matrix with smaller matrices. However, standard SVD treats the parameters within the matrix with equal importance, which is a simple but unrealistic assumption. The parameters of a trained neural network model may affect task performance unevenly, which suggests non-equal importance among the parameters. Compared to SVD, the decomposition method aware of parameter importance is the more practical choice in real cases. Unlike standard SVD, weighted value decomposition is a non-convex optimization problem that lacks a closed-form solution. We systematically investigated multiple optimization strategies to tackle the problem and examined our method by compressing Transformer-based language models. Further, we designed a metric to predict when the SVD may introduce a significant performance drop, for which our method can be a rescue strategy. The extensive evaluations demonstrate that our method can perform better than current SOTA methods in compressing Transformer-based language models.
    Flatten the Curve: Efficiently Training Low-Curvature Neural Networks. (arXiv:2206.07144v2 [cs.LG] UPDATED)
    The highly non-linear nature of deep neural networks causes them to be susceptible to adversarial examples and have unstable gradients which hinders interpretability. However, existing methods to solve these issues, such as adversarial training, are expensive and often sacrifice predictive accuracy. In this work, we consider curvature, which is a mathematical quantity which encodes the degree of non-linearity. Using this, we demonstrate low-curvature neural networks (LCNNs) that obtain drastically lower curvature than standard models while exhibiting similar predictive performance, which leads to improved robustness and stable gradients, with only a marginally increased training time. To achieve this, we minimize a data-independent upper bound on the curvature of a neural network, which decomposes overall curvature in terms of curvatures and slopes of its constituent layers. To efficiently minimize this bound, we introduce two novel architectural components: first, a non-linearity called centered-softplus that is a stable variant of the softplus non-linearity, and second, a Lipschitz-constrained batch normalization layer. Our experiments show that LCNNs have lower curvature, more stable gradients and increased off-the-shelf adversarial robustness when compared to their standard high-curvature counterparts, all without affecting predictive performance. Our approach is easy to use and can be readily incorporated into existing neural network models.
    Offline Reinforcement Learning for Visual Navigation. (arXiv:2212.08244v1 [cs.RO])
    Reinforcement learning can enable robots to navigate to distant goals while optimizing user-specified reward functions, including preferences for following lanes, staying on paved paths, or avoiding freshly mowed grass. However, online learning from trial-and-error for real-world robots is logistically challenging, and methods that instead can utilize existing datasets of robotic navigation data could be significantly more scalable and enable broader generalization. In this paper, we present ReViND, the first offline RL system for robotic navigation that can leverage previously collected data to optimize user-specified reward functions in the real-world. We evaluate our system for off-road navigation without any additional data collection or fine-tuning, and show that it can navigate to distant goals using only offline training from this dataset, and exhibit behaviors that qualitatively differ based on the user-specified reward function.  ( 2 min )
    CgAT: Center-Guided Adversarial Training for Deep Hashing-Based Retrieval. (arXiv:2204.10779v4 [cs.CV] UPDATED)
    Deep hashing has been extensively utilized in massive image retrieval because of its efficiency and effectiveness. However, deep hashing models are vulnerable to adversarial examples, making it essential to develop adversarial defense methods for image retrieval. Existing solutions achieved limited defense performance because of using weak adversarial samples for training and lacking discriminative optimization objectives to learn robust features. In this paper, we present a min-max based Center-guided Adversarial Training, namely CgAT, to improve the robustness of deep hashing networks through worst adversarial examples. Specifically, we first formulate the center code as a semantically-discriminative representative of the input image content, which preserves the semantic similarity with positive samples and dissimilarity with negative examples. We prove that a mathematical formula can calculate the center code immediately. After obtaining the center codes in each optimization iteration of the deep hashing network, they are adopted to guide the adversarial training process. On the one hand, CgAT generates the worst adversarial examples as augmented data by maximizing the Hamming distance between the hash codes of the adversarial examples and the center codes. On the other hand, CgAT learns to mitigate the effects of adversarial samples by minimizing the Hamming distance to the center codes. Extensive experiments on the benchmark datasets demonstrate the effectiveness of our adversarial training algorithm in defending against adversarial attacks for deep hashing-based retrieval. Compared with the current state-of-the-art defense method, we significantly improve the defense performance by an average of 18.61%, 12.35%, and 11.56% on FLICKR-25K, NUS-WIDE, and MS-COCO, respectively.  ( 2 min )
    Efficient Long Sequence Modeling via State Space Augmented Transformer. (arXiv:2212.08136v1 [cs.CL])
    Transformer models have achieved superior performance in various natural language processing tasks. However, the quadratic computational cost of the attention mechanism limits its practicality for long sequences. There are existing attention variants that improve the computational efficiency, but they have limited ability to effectively compute global information. In parallel to Transformer models, state space models (SSMs) are tailored for long sequences, but they are not flexible enough to capture complicated local information. We propose SPADE, short for $\underline{\textbf{S}}$tate s$\underline{\textbf{P}}$ace $\underline{\textbf{A}}$ugmente$\underline{\textbf{D}}$ Transform$\underline{\textbf{E}}$r. Specifically, we augment a SSM into the bottom layer of SPADE, and we employ efficient local attention methods for the other layers. The SSM augments global information, which complements the lack of long-range dependency issue in local attention methods. Experimental results on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method. To further demonstrate the scalability of SPADE, we pre-train large encoder-decoder models and present fine-tuning results on natural language understanding and natural language generation tasks.  ( 2 min )
    Dimensional criterion for forecasting nonlinear systems by reservoir computing. (arXiv:2202.05159v3 [cs.LG] UPDATED)
    Reservoir computers (RC) have proven useful as surrogate models in forecasting and replicating systems of chaotic dynamics. The quality of surrogate models based on RCs is crucially dependent on their optimal implementation that involves selecting optimal reservoir topology and hyperparameters. By systematically applying Bayesian hyperparameter optimization and using ensembles of reservoirs of various topology we show that connectednes of reservoirs is of significance only in forecasting and replication of chaotic system of sufficient complexity. By applying RCs of different topology in forecasting and replicating the Lorenz system, a coupled Wilson-Cowan system, and the Kuramoto-Sivashinsky system, we show that simple reservoirs of unconnected nodes (RUN) outperform reservoirs of connected nodes for target systems whose estimated fractal dimension dimension is $d \lesssim 5.5$ and that linked reservoirs are better for systems with $d > 5.5$. This finding is highly important for evaluation of reservoir computing methods and on selecting a method for prediction of signals measured on nonlinear systems.
    Capturing Label Characteristics in VAEs. (arXiv:2006.10102v3 [cs.LG] UPDATED)
    We present a principled approach to incorporating labels in VAEs that captures the rich characteristic information associated with those labels. While prior work has typically conflated these by learning latent variables that directly correspond to label values, we argue this is contrary to the intended effect of supervision in VAEs-capturing rich label characteristics with the latents. For example, we may want to capture the characteristics of a face that make it look young, rather than just the age of the person. To this end, we develop the CCVAE, a novel VAE model and concomitant variational objective which captures label characteristics explicitly in the latent space, eschewing direct correspondences between label values and latents. Through judicious structuring of mappings between such characteristic latents and labels, we show that the CCVAE can effectively learn meaningful representations of the characteristics of interest across a variety of supervision schemes. In particular, we show that the CCVAE allows for more effective and more general interventions to be performed, such as smooth traversals within the characteristics for a given label, diverse conditional generation, and transferring characteristics across datapoints.  ( 2 min )
    Robust Explanation Constraints for Neural Networks. (arXiv:2212.08507v1 [cs.LG])
    Post-hoc explanation methods are used with the intent of providing insights about neural networks and are sometimes said to help engender trust in their outputs. However, popular explanations methods have been found to be fragile to minor perturbations of input features or model parameters. Relying on constraint relaxation techniques from non-convex optimization, we develop a method that upper-bounds the largest change an adversary can make to a gradient-based explanation via bounded manipulation of either the input features or model parameters. By propagating a compact input or parameter set as symbolic intervals through the forwards and backwards computations of the neural network we can formally certify the robustness of gradient-based explanations. Our bounds are differentiable, hence we can incorporate provable explanation robustness into neural network training. Empirically, our method surpasses the robustness provided by previous heuristic approaches. We find that our training method is the only method able to learn neural networks with certificates of explanation robustness across all six datasets tested.  ( 2 min )
    Can We Find Strong Lottery Tickets in Generative Models?. (arXiv:2212.08311v1 [cs.CV])
    Yes. In this paper, we investigate strong lottery tickets in generative models, the subnetworks that achieve good generative performance without any weight update. Neural network pruning is considered the main cornerstone of model compression for reducing the costs of computation and memory. Unfortunately, pruning a generative model has not been extensively explored, and all existing pruning algorithms suffer from excessive weight-training costs, performance degradation, limited generalizability, or complicated training. To address these problems, we propose to find a strong lottery ticket via moment-matching scores. Our experimental results show that the discovered subnetwork can perform similarly or better than the trained dense model even when only 10% of the weights remain. To the best of our knowledge, we are the first to show the existence of strong lottery tickets in generative models and provide an algorithm to find it stably. Our code and supplementary materials are publicly available.  ( 2 min )
    An Empirical Study of Deep Learning Models for Vulnerability Detection. (arXiv:2212.08109v1 [cs.SE])
    Deep learning (DL) models of code have recently reported great progress for vulnerability detection. In some cases, DL-based models have outperformed static analysis tools. Although many great models have been proposed, we do not yet have a good understanding of these models. This limits the further advancement of model robustness, debugging, and deployment for the vulnerability detection. In this paper, we surveyed and reproduced 9 state-of-the-art (SOTA) deep learning models on 2 widely used vulnerability detection datasets: Devign and MSR. We investigated 6 research questions in three areas, namely model capabilities, training data, and model interpretation. We experimentally demonstrated the variability between different runs of a model and the low agreement among different models' outputs. We investigated models trained for specific types of vulnerabilities compared to a model that is trained on all the vulnerabilities at once. We explored the types of programs DL may consider "hard" to handle. We investigated the relations of training data sizes and training data composition with model performance. Finally, we studied model interpretations and analyzed important features that the models used to make predictions. We believe that our findings can help better understand model results, provide guidance on preparing training data, and improve the robustness of the models. All of our datasets, code, and results are available at https://figshare.com/s/284abfba67dba448fdc2.  ( 2 min )
    Learning to Drop Out: An Adversarial Approach to Training Sequence VAEs. (arXiv:2209.12590v2 [cs.LG] UPDATED)
    In principle, applying variational autoencoders (VAEs) to sequential data offers a method for controlled sequence generation, manipulation, and structured representation learning. However, training sequence VAEs is challenging: autoregressive decoders can often explain the data without utilizing the latent space, known as posterior collapse. To mitigate this, state-of-the-art models weaken the powerful decoder by applying uniformly random dropout to the decoder input. We show theoretically that this removes pointwise mutual information provided by the decoder input, which is compensated for by utilizing the latent space. We then propose an adversarial training strategy to achieve information-based stochastic dropout. Compared to uniform dropout on standard text benchmark datasets, our targeted approach increases both sequence modeling performance and the information captured in the latent space.  ( 2 min )
    Foresight -- Deep Generative Modelling of Patient Timelines using Electronic Health Records. (arXiv:2212.08072v1 [cs.CL])
    Electronic Health Records (EHRs) hold detailed longitudinal information about each patient's health status and general clinical history, a large portion of which is stored within the unstructured text. Temporal modelling of this medical history, which considers the sequence of events, can be used to forecast and simulate future events, estimate risk, suggest alternative diagnoses or forecast complications. While most prediction approaches use mainly structured data or a subset of single-domain forecasts and outcomes, we processed the entire free-text portion of EHRs for longitudinal modelling. We present Foresight, a novel GPT3-based pipeline that uses NER+L tools (i.e. MedCAT) to convert document text into structured, coded concepts, followed by providing probabilistic forecasts for future medical events such as disorders, medications, symptoms and interventions. Since large portions of EHR data are in text form, such an approach benefits from a granular and detailed view of a patient while introducing modest additional noise. On tests in two large UK hospitals (King's College Hospital, South London and Maudsley) and the US MIMIC-III dataset precision@10 of 0.80, 0.81 and 0.91 was achieved for forecasting the next biomedical concept. Foresight was also validated on 34 synthetic patient timelines by 5 clinicians and achieved relevancy of 97% for the top forecasted candidate disorder. Foresight can be easily trained and deployed locally as it only requires free-text data (as a minimum). As a generative model, it can simulate follow-on disorders, medications and interventions for as many steps as required. Foresight is a general-purpose model for biomedical concept modelling that can be used for real-world risk estimation, virtual trials and clinical research to study the progression of diseases, simulate interventions and counterfactuals, and for educational purposes.  ( 2 min )
    Dual Moving Average Pseudo-Labeling for Source-Free Inductive Domain Adaptation. (arXiv:2212.08187v1 [cs.LG])
    Unsupervised domain adaptation reduces the reliance on data annotation in deep learning by adapting knowledge from a source to a target domain. For privacy and efficiency concerns, source-free domain adaptation extends unsupervised domain adaptation by adapting a pre-trained source model to an unlabeled target domain without accessing the source data. However, most existing source-free domain adaptation methods to date focus on the transductive setting, where the target training set is also the testing set. In this paper, we address source-free domain adaptation in the more realistic inductive setting, where the target training and testing sets are mutually exclusive. We propose a new semi-supervised fine-tuning method named Dual Moving Average Pseudo-Labeling (DMAPL) for source-free inductive domain adaptation. We first split the unlabeled training set in the target domain into a pseudo-labeled confident subset and an unlabeled less-confident subset according to the prediction confidence scores from the pre-trained source model. Then we propose a soft-label moving-average updating strategy for the unlabeled subset based on a moving-average prototypical classifier, which gradually adapts the source model towards the target domain. Experiments show that our proposed method achieves state-of-the-art performance and outperforms previous methods by large margins.  ( 2 min )
    Uniform Sequence Better: Time Interval Aware Data Augmentation for Sequential Recommendation. (arXiv:2212.08262v1 [cs.IR])
    Sequential recommendation is an important task to predict the next-item to access based on a sequence of interacted items. Most existing works learn user preference as the transition pattern from the previous item to the next one, ignoring the time interval between these two items. However, we observe that the time interval in a sequence may vary significantly different, and thus result in the ineffectiveness of user modeling due to the issue of \emph{preference drift}. In fact, we conducted an empirical study to validate this observation, and found that a sequence with uniformly distributed time interval (denoted as uniform sequence) is more beneficial for performance improvement than that with greatly varying time interval. Therefore, we propose to augment sequence data from the perspective of time interval, which is not studied in the literature. Specifically, we design five operators (Ti-Crop, Ti-Reorder, Ti-Mask, Ti-Substitute, Ti-Insert) to transform the original non-uniform sequence to uniform sequence with the consideration of variance of time intervals. Then, we devise a control strategy to execute data augmentation on item sequences in different lengths. Finally, we implement these improvements on a state-of-the-art model CoSeRec and validate our approach on four real datasets. The experimental results show that our approach reaches significantly better performance than the other 11 competing methods. Our implementation is available: https://github.com/KingGugu/TiCoSeRec.  ( 2 min )
    Reducing Sequence Length Learning Impacts on Transformer Models. (arXiv:2212.08399v1 [cs.LG])
    Classification algorithms using Transformer architectures can be affected by the sequence length learning problem whenever observations from different classes have a different length distribution. This problem brings models to use sequence length as a predictive feature instead of relying on important textual information. Even if most public datasets are not affected by this problem, privately corpora for fields such as medicine and insurance may carry this data bias. This poses challenges throughout the value chain given their usage in a machine learning application. In this paper, we empirically expose this problem and present approaches to minimize its impacts.  ( 2 min )
    Evaluation of Synthetic Datasets for Conversational Recommender Systems. (arXiv:2212.08167v1 [cs.CL])
    For researchers leveraging Large-Language Models (LLMs) in the generation of training datasets, especially for conversational recommender systems - the absence of robust evaluation frameworks has been a long-standing problem. The efficiency brought about by LLMs in the data generation phase is impeded during the process of evaluation of the generated data, since it generally requires human-raters to ensure that the data generated is of high quality and has sufficient diversity. Since the quality of training data is critical for downstream applications, it is important to develop metrics that evaluate the quality holistically and identify biases. In this paper, we present a framework that takes a multi-faceted approach towards evaluating datasets produced by generative models and discuss the advantages and limitations of various evaluation methods.  ( 2 min )
    Toward Improved Generalization: Meta Transfer of Self-supervised Knowledge on Graphs. (arXiv:2212.08217v1 [cs.LG])
    Despite the remarkable success achieved by graph convolutional networks for functional brain activity analysis, the heterogeneity of functional patterns and the scarcity of imaging data still pose challenges in many tasks. Transferring knowledge from a source domain with abundant training data to a target domain is effective for improving representation learning on scarce training data. However, traditional transfer learning methods often fail to generalize the pre-trained knowledge to the target task due to domain discrepancy. Self-supervised learning on graphs can increase the generalizability of graph features since self-supervision concentrates on inherent graph properties that are not limited to a particular supervised task. We propose a novel knowledge transfer strategy by integrating meta-learning with self-supervised learning to deal with the heterogeneity and scarcity of fMRI data. Specifically, we perform a self-supervised task on the source domain and apply meta-learning, which strongly improves the generalizability of the model using the bi-level optimization, to transfer the self-supervised knowledge to the target domain. Through experiments on a neurological disorder classification task, we demonstrate that the proposed strategy significantly improves target task performance by increasing the generalizability and transferability of graph-based knowledge.
    FiDO: Fusion-in-Decoder optimized for stronger performance and faster inference. (arXiv:2212.08153v1 [cs.CL])
    Fusion-in-Decoder (FiD) is a powerful retrieval-augmented language model that sets the state-of-the-art on many knowledge-intensive NLP tasks. However, FiD suffers from very expensive inference. We show that the majority of inference time results from memory bandwidth constraints in the decoder, and propose two simple changes to the FiD architecture to speed up inference by 7x. The faster decoder inference then allows for a much larger decoder. We denote FiD with the above modifications as FiDO, and show that it strongly improves performance over existing FiD models for a wide range of inference budgets. For example, FiDO-Large-XXL performs faster inference than FiD-Base and achieves better performance than FiD-Large.  ( 2 min )
    Provable Fairness for Neural Network Models using Formal Verification. (arXiv:2212.08578v1 [cs.LG])
    Machine learning models are increasingly deployed for critical decision-making tasks, making it important to verify that they do not contain gender or racial biases picked up from training data. Typical approaches to achieve fairness revolve around efforts to clean or curate training data, with post-hoc statistical evaluation of the fairness of the model on evaluation data. In contrast, we propose techniques to \emph{prove} fairness using recently developed formal methods that verify properties of neural network models.Beyond the strength of guarantee implied by a formal proof, our methods have the advantage that we do not need explicit training or evaluation data (which is often proprietary) in order to analyze a given trained model. In experiments on two familiar datasets in the fairness literature (COMPAS and ADULTS), we show that through proper training, we can reduce unfairness by an average of 65.4\% at a cost of less than 1\% in AUC score.  ( 2 min )
    Person Detection Using an Ultra Low-resolution Thermal Imager on a Low-cost MCU. (arXiv:2212.08415v1 [cs.CV])
    Detecting persons in images or video with neural networks is a well-studied subject in literature. However, such works usually assume the availability of a camera of decent resolution and a high-performance processor or GPU to run the detection algorithm, which significantly increases the cost of a complete detection system. However, many applications require low-cost solutions, composed of cheap sensors and simple microcontrollers. In this paper, we demonstrate that even on such hardware we are not condemned to simple classic image processing techniques. We propose a novel ultra-lightweight CNN-based person detector that processes thermal video from a low-cost 32x24 pixel static imager. Trained and compressed on our own recorded dataset, our model achieves up to 91.62% accuracy (F1-score), has less than 10k parameters, and runs as fast as 87ms and 46ms on low-cost microcontrollers STM32F407 and STM32F746, respectively.  ( 2 min )
    Learning Multimodal VAEs through Mutual Supervision. (arXiv:2106.12570v3 [cs.LG] UPDATED)
    Multimodal VAEs seek to model the joint distribution over heterogeneous data (e.g.\ vision, language), whilst also capturing a shared representation across such modalities. Prior work has typically combined information from the modalities by reconciling idiosyncratic representations directly in the recognition model through explicit products, mixtures, or other such factorisations. Here we introduce a novel alternative, the MEME, that avoids such explicit combinations by repurposing semi-supervised VAEs to combine information between modalities implicitly through mutual supervision. This formulation naturally allows learning from partially-observed data where some modalities can be entirely missing -- something that most existing approaches either cannot handle, or do so to a limited extent. We demonstrate that MEME outperforms baselines on standard metrics across both partial and complete observation schemes on the MNIST-SVHN (image-image) and CUB (image-text) datasets. We also contrast the quality of the representations learnt by mutual supervision against standard approaches and observe interesting trends in its ability to capture relatedness between data.  ( 2 min )
    Learning to repeatedly solve routing problems. (arXiv:2212.08101v1 [math.OC])
    In the last years, there has been a great interest in machine-learning-based heuristics for solving NP-hard combinatorial optimization problems. The developed methods have shown potential on many optimization problems. In this paper, we present a learned heuristic for the reoptimization of a problem after a minor change in its data. We focus on the case of the capacited vehicle routing problem with static clients (i.e., same client locations) and changed demands. Given the edges of an original solution, the goal is to predict and fix the ones that have a high chance of remaining in an optimal solution after a change of client demands. This partial prediction of the solution reduces the complexity of the problem and speeds up its resolution, while yielding a good quality solution. The proposed approach resulted in solutions with an optimality gap ranging from 0\% to 1.7\% on different benchmark instances within a reasonable computing time.  ( 2 min )
    The KITMUS Test: Evaluating Knowledge Integration from Multiple Sources in Natural Language Understanding Systems. (arXiv:2212.08192v1 [cs.CL])
    Many state-of-the-art natural language understanding (NLU) models are based on pretrained neural language models. These models often make inferences using information from multiple sources. An important class of such inferences are those that require both background knowledge, presumably contained in a model's pretrained parameters, and instance-specific information that is supplied at inference time. However, the integration and reasoning abilities of NLU models in the presence of multiple knowledge sources have been largely understudied. In this work, we propose a test suite of coreference resolution tasks that require reasoning over multiple facts. Our dataset is organized into subtasks that differ in terms of which knowledge sources contain relevant facts. We evaluate state-of-the-art coreference resolution models on our dataset. Our results indicate that several models struggle to reason on-the-fly over knowledge observed both at pretrain time and at inference time. However, with task-specific training, a subset of models demonstrates the ability to integrate certain knowledge types from multiple sources.  ( 2 min )
    Bayesian posterior approximation with stochastic ensembles. (arXiv:2212.08123v1 [cs.LG])
    We introduce ensembles of stochastic neural networks to approximate the Bayesian posterior, combining stochastic methods such as dropout with deep ensembles. The stochastic ensembles are formulated as families of distributions and trained to approximate the Bayesian posterior with variational inference. We implement stochastic ensembles based on Monte Carlo dropout, DropConnect and a novel non-parametric version of dropout and evaluate them on a toy problem and CIFAR image classification. For CIFAR, the stochastic ensembles are quantitatively compared to published Hamiltonian Monte Carlo results for a ResNet-20 architecture. We also test the quality of the posteriors directly against Hamiltonian Monte Carlo simulations in a simplified toy model. Our results show that in a number of settings, stochastic ensembles provide more accurate posterior estimates than regular deep ensembles.  ( 2 min )
    Huber-energy measure quantization. (arXiv:2212.08162v1 [stat.ML])
    We describe a measure quantization procedure i.e., an algorithm which finds the best approximation of a target probability law (and more generally signed finite variation measure) by a sum of Q Dirac masses (Q being the quantization parameter). The procedure is implemented by minimizing the statistical distance between the original measure and its quantized version; the distance is built from a negative definite kernel and, if necessary, can be computed on the fly and feed to a stochastic optimization algorithm (such as SGD, Adam, ...). We investigate theoretically the fundamental questions of existence of the optimal measure quantizer and identify what are the required kernel properties that guarantee suitable behavior. We test the procedure, called HEMQ, on several databases: multi-dimensional Gaussian mixtures, Wiener space cubature, Italian wine cultivars and the MNIST image database. The results indicate that the HEMQ algorithm is robust and versatile and, for the class of Huber-energy kernels, it matches the expected intuitive behavior.  ( 2 min )
    NBC-Softmax : Darkweb Author fingerprinting and migration tracking. (arXiv:2212.08184v1 [cs.LG])
    Metric learning aims to learn distances from the data, which enhances the performance of similarity-based algorithms. An author style detection task is a metric learning problem, where learning style features with small intra-class variations and larger inter-class differences is of great importance to achieve better performance. Recently, metric learning based on softmax loss has been used successfully for style detection. While softmax loss can produce separable representations, its discriminative power is relatively poor. In this work, we propose NBC-Softmax, a contrastive loss based clustering technique for softmax loss, which is more intuitive and able to achieve superior performance. Our technique meets the criterion for larger number of samples, thus achieving block contrastiveness, which is proven to outperform pair-wise losses. It uses mini-batch sampling effectively and is scalable. Experiments on 4 darkweb social forums, with NBCSAuthor that uses the proposed NBC-Softmax for author and sybil detection, shows that our negative block contrastive approach constantly outperforms state-of-the-art methods using the same network architecture. Our code is publicly available at : https://github.com/gayanku/NBC-Softmax  ( 2 min )
    Reliable Measures of Spread in High Dimensional Latent Spaces. (arXiv:2212.08172v1 [cs.LG])
    Understanding geometric properties of natural language processing models' latent spaces allows the manipulation of these properties for improved performance on downstream tasks. One such property is the amount of data spread in a model's latent space, or how fully the available latent space is being used. In this work, we define data spread and demonstrate that the commonly used measures of data spread, Average Cosine Similarity and a partition function min/max ratio I(V), do not provide reliable metrics to compare the use of latent space across models. We propose and examine eight alternative measures of data spread, all but one of which improve over these current metrics when applied to seven synthetic data distributions. Of our proposed measures, we recommend one principal component-based measure and one entropy-based measure that provide reliable, relative measures of spread and can be used to compare models of different sizes and dimensionalities.  ( 2 min )
    Non-IID Transfer Learning on Graphs. (arXiv:2212.08174v1 [cs.LG])
    Transfer learning refers to the transfer of knowledge or information from a relevant source domain to a target domain. However, most existing transfer learning theories and algorithms focus on IID tasks, where the source/target samples are assumed to be independent and identically distributed. Very little effort is devoted to theoretically studying the knowledge transferability on non-IID tasks, e.g., cross-network mining. To bridge the gap, in this paper, we propose rigorous generalization bounds and algorithms for cross-network transfer learning from a source graph to a target graph. The crucial idea is to characterize the cross-network knowledge transferability from the perspective of the Weisfeiler-Lehman graph isomorphism test. To this end, we propose a novel Graph Subtree Discrepancy to measure the graph distribution shift between source and target graphs. Then the generalization error bounds on cross-network transfer learning, including both cross-network node classification and link prediction tasks, can be derived in terms of the source knowledge and the Graph Subtree Discrepancy across domains. This thereby motivates us to propose a generic graph adaptive network (GRADE) to minimize the distribution shift between source and target graphs for cross-network transfer learning. Experimental results verify the effectiveness and efficiency of our GRADE framework on both cross-network node classification and cross-domain recommendation tasks.  ( 2 min )
    Improving self-supervised representation learning via sequential adversarial masking. (arXiv:2212.08277v1 [cs.CV])
    Recent methods in self-supervised learning have demonstrated that masking-based pretext tasks extend beyond NLP, serving as useful pretraining objectives in computer vision. However, existing approaches apply random or ad hoc masking strategies that limit the difficulty of the reconstruction task and, consequently, the strength of the learnt representations. We improve upon current state-of-the-art work in learning adversarial masks by proposing a new framework that generates masks in a sequential fashion with different constraints on the adversary. This leads to improvements in performance on various downstream tasks, such as classification on ImageNet100, STL10, and CIFAR10/100 and segmentation on Pascal VOC. Our results further demonstrate the promising capabilities of masking-based approaches for SSL in computer vision.  ( 2 min )
    Learning Sparsity and Randomness for Data-driven Low Rank Approximation. (arXiv:2212.08186v1 [cs.LG])
    Learning-based low rank approximation algorithms can significantly improve the performance of randomized low rank approximation with sketch matrix. With the learned value and fixed non-zero positions for sketch matrices from learning-based algorithms, these matrices can reduce the test error of low rank approximation significantly. However, there is still no good method to learn non-zero positions as well as overcome the out-of-distribution performance loss. In this work, we introduce two new methods Learning Sparsity and Learning Randomness which try to learn a better sparsity patterns and add randomness to the value of sketch matrix. These two methods can be applied with any learning-based algorithms which use sketch matrix directly. Our experiments show that these two methods can improve the performance of previous learning-based algorithm for both test error and out-of-distribution test error without adding too much complexity.  ( 2 min )
    LiFe-net: Data-driven Modelling of Time-dependent Temperatures and Charging Statistics Of Tesla's LiFePo4 EV Battery. (arXiv:2212.08403v1 [cs.LG])
    Modelling the temperature of Electric Vehicle (EV) batteries is a fundamental task of EV manufacturing. Extreme temperatures in the battery packs can affect their longevity and power output. Although theoretical models exist for describing heat transfer in battery packs, they are computationally expensive to simulate. Furthermore, it is difficult to acquire data measurements from within the battery cell. In this work, we propose a data-driven surrogate model (LiFe-net) that uses readily accessible driving diagnostics for battery temperature estimation to overcome these limitations. This model incorporates Neural Operators with a traditional numerical integration scheme to estimate the temperature evolution. Moreover, we propose two further variations of the baseline model: LiFe-net trained with a regulariser and LiFe-net trained with time stability loss. We compared these models in terms of generalization error on test data. The results showed that LiFe-net trained with time stability loss outperforms the other two models and can estimate the temperature evolution on unseen data with a relative error of 2.77 % on average.  ( 2 min )
    An ensemble neural network approach to forecast Dengue outbreak based on climatic condition. (arXiv:2212.08323v1 [q-bio.PE])
    Dengue fever is a virulent disease spreading over 100 tropical and subtropical countries in Africa, the Americas, and Asia. This arboviral disease affects around 400 million people globally, severely distressing the healthcare systems. The unavailability of a specific drug and ready-to-use vaccine makes the situation worse. Hence, policymakers must rely on early warning systems to control intervention-related decisions. Forecasts routinely provide critical information for dangerous epidemic events. However, the available forecasting models (e.g., weather-driven mechanistic, statistical time series, and machine learning models) lack a clear understanding of different components to improve prediction accuracy and often provide unstable and unreliable forecasts. This study proposes an ensemble wavelet neural network with exogenous factor(s) (XEWNet) model that can produce reliable estimates for dengue outbreak prediction for three geographical regions, namely San Juan, Iquitos, and Ahmedabad. The proposed XEWNet model is flexible and can easily incorporate exogenous climate variable(s) confirmed by statistical causality tests in its scalable framework. The proposed model is an integrated approach that uses wavelet transformation into an ensemble neural network framework that helps in generating more reliable long-term forecasts. The proposed XEWNet allows complex non-linear relationships between the dengue incidence cases and rainfall; however, mathematically interpretable, fast in execution, and easily comprehensible. The proposal's competitiveness is measured using computational experiments based on various statistical metrics and several statistical comparison tests. In comparison with statistical, machine learning, and deep learning methods, our proposed XEWNet performs better in 75% of the cases for short-term and long-term forecasting of dengue incidence.  ( 2 min )
    First De-Trend then Attend: Rethinking Attention for Time-Series Forecasting. (arXiv:2212.08151v1 [cs.LG])
    Transformer-based models have gained large popularity and demonstrated promising results in long-term time-series forecasting in recent years. In addition to learning attention in time domain, recent works also explore learning attention in frequency domains (e.g., Fourier domain, wavelet domain), given that seasonal patterns can be better captured in these domains. In this work, we seek to understand the relationships between attention models in different time and frequency domains. Theoretically, we show that attention models in different domains are equivalent under linear conditions (i.e., linear kernel to attention scores). Empirically, we analyze how attention models of different domains show different behaviors through various synthetic experiments with seasonality, trend and noise, with emphasis on the role of softmax operation therein. Both these theoretical and empirical analyses motivate us to propose a new method: TDformer (Trend Decomposition Transformer), that first applies seasonal-trend decomposition, and then additively combines an MLP which predicts the trend component with Fourier attention which predicts the seasonal component to obtain the final prediction. Extensive experiments on benchmark time-series forecasting datasets demonstrate that TDformer achieves state-of-the-art performance against existing attention-based models.  ( 2 min )
    Materials Discovery using Max K-Armed Bandit. (arXiv:2212.08225v1 [stat.ML])
    Search algorithms for the bandit problems are applicable in materials discovery. However, the objectives of the conventional bandit problem are different from those of materials discovery. The conventional bandit problem aims to maximize the total rewards, whereas materials discovery aims to achieve breakthroughs in material properties. The max K-armed bandit (MKB) problem, which aims to acquire the single best reward, matches with the discovery tasks better than the conventional bandit. Thus, here, we propose a search algorithm for materials discovery based on the MKB problem using a pseudo-value of the upper confidence bound of expected improvement of the best reward. This approach is pseudo-guaranteed to be asymptotic oracles that do not depends on the time horizon. In addition, compared with other MKB algorithms, the proposed algorithm has only one hyperparameter, which is advantageous in materials discovery. We applied the proposed algorithm to synthetic problems and molecular-design demonstrations using a Monte Carlo tree search. According to the results, the proposed algorithm stably outperformed other bandit algorithms in the late stage of the search process when the optimal arm of the MKB could not be determined based on its expectation reward.  ( 2 min )
    Fundamental limits to learning closed-form mathematical models from data. (arXiv:2204.02704v2 [cs.LG] UPDATED)
    Given a finite and noisy dataset generated with a closed-form mathematical model, when is it possible to learn the true generating model from the data alone? This is the question we investigate here. We show that this model-learning problem displays a transition from a low-noise phase in which the true model can be learned, to a phase in which the observation noise is too high for the true model to be learned by any method. Both in the low-noise phase and in the high-noise phase, probabilistic model selection leads to optimal generalization to unseen data. This is in contrast to standard machine learning approaches, including artificial neural networks, which in this particular problem are limited, in the low-noise phase, by their ability to interpolate. In the transition region between the learnable and unlearnable phases, generalization is hard for all approaches including probabilistic model selection.  ( 2 min )
    Preventing RNN from Using Sequence Length as a Feature. (arXiv:2212.08276v1 [cs.LG])
    Recurrent neural networks are deep learning topologies that can be trained to classify long documents. However, in our recent work, we found a critical problem with these cells: they can use the length differences between texts of different classes as a prominent classification feature. This has the effect of producing models that are brittle and fragile to concept drift, can provide misleading performances and are trivially explainable regardless of text content. This paper illustrates the problem using synthetic and real-world data and provides a simple solution using weight decay regularization.  ( 2 min )
    Shapley variable importance cloud for machine learning models. (arXiv:2212.08370v1 [cs.LG])
    Current practice in interpretable machine learning often focuses on explaining the final model trained from data, e.g., by using the Shapley additive explanations (SHAP) method. The recently developed Shapley variable importance cloud (ShapleyVIC) extends the current practice to a group of "nearly optimal models" to provide comprehensive and robust variable importance assessments, with estimated uncertainty intervals for a more complete understanding of variable contributions to predictions. ShapleyVIC was initially developed for applications with traditional regression models, and the benefits of ShapleyVIC inference have been demonstrated in real-life prediction tasks using the logistic regression model. However, as a model-agnostic approach, ShapleyVIC application is not limited to such scenarios. In this work, we extend ShapleyVIC implementation for machine learning models to enable wider applications, and propose it as a useful complement to the current SHAP analysis to enable more trustworthy applications of these black-box models.  ( 2 min )
  • Open

    Connecting Permutation Equivariant Neural Networks and Partition Diagrams. (arXiv:2212.08648v1 [cs.LG])
    We show how the Schur-Weyl duality that exists between the partition algebra and the symmetric group results in a stronger theoretical foundation for characterising all of the possible permutation equivariant neural networks whose layers are some tensor power of the permutation representation $M_n$ of the symmetric group $S_n$. In doing so, we unify two separate bodies of literature, and we correct some of the major results that are now widely quoted by the machine learning community. In particular, we find a basis of matrices for the learnable, linear, permutation equivariant layer functions between such tensor power spaces in the standard basis of $M_n$ by using an elegant graphical representation of a basis of set partitions for the partition algebra and its related vector spaces. Also, we show how we can calculate the number of weights that must appear in these layer functions by looking at certain paths through the McKay quiver for $M_n$. Finally, we describe how our approach generalises to the construction of neural networks that are equivariant to local symmetries.
    Statistical Inference for Maximin Effects: Identifying Stable Associations across Multiple Studies. (arXiv:2011.07568v4 [stat.ME] UPDATED)
    Integrative analysis of data from multiple sources is critical to making generalizable discoveries. Associations that are consistently observed across multiple source populations are more likely to be generalized to target populations with possible distributional shifts. In this paper, we model the heterogeneous multi-source data with multiple high-dimensional regressions and make inferences for the maximin effect (Meinshausen, B{\"u}hlmann, AoS, 43(4), 1801--1830). The maximin effect provides a measure of stable associations across multi-source data. A significant maximin effect indicates that a variable has commonly shared effects across multiple source populations, and these shared effects may be generalized to a broader set of target populations. There are challenges associated with inferring maximin effects because its point estimator can have a non-standard limiting distribution. We devise a novel sampling method to construct valid confidence intervals for maximin effects. The proposed confidence interval attains a parametric length. This sampling procedure and the related theoretical analysis are of independent interest for solving other non-standard inference problems. Using genetic data on yeast growth in multiple environments, we demonstrate that the genetic variants with significant maximin effects have generalizable effects under new environments.
    Generalization Bounds for Inductive Matrix Completion in Low-noise Settings. (arXiv:2212.08339v1 [cs.LG])
    We study inductive matrix completion (matrix completion with side information) under an i.i.d. subgaussian noise assumption at a low noise regime, with uniform sampling of the entries. We obtain for the first time generalization bounds with the following three properties: (1) they scale like the standard deviation of the noise and in particular approach zero in the exact recovery case; (2) even in the presence of noise, they converge to zero when the sample size approaches infinity; and (3) for a fixed dimension of the side information, they only have a logarithmic dependence on the size of the matrix. Differently from many works in approximate recovery, we present results both for bounded Lipschitz losses and for the absolute loss, with the latter relying on Talagrand-type inequalities. The proofs create a bridge between two approaches to the theoretical analysis of matrix completion, since they consist in a combination of techniques from both the exact recovery literature and the approximate recovery literature.  ( 2 min )
    How Robust is Unsupervised Representation Learning to Distribution Shift?. (arXiv:2206.08871v2 [cs.LG] UPDATED)
    The robustness of machine learning algorithms to distributions shift is primarily discussed in the context of supervised learning (SL). As such, there is a lack of insight on the robustness of the representations learned from unsupervised methods, such as self-supervised learning (SSL) and auto-encoder based algorithms (AE), to distribution shift. We posit that the input-driven objectives of unsupervised algorithms lead to representations that are more robust to distribution shift than the target-driven objective of SL. We verify this by extensively evaluating the performance of SSL and AE on both synthetic and realistic distribution shift datasets. Following observations that the linear layer used for classification itself can be susceptible to spurious correlations, we evaluate the representations using a linear head trained on a small amount of out-of-distribution (OOD) data, to isolate the robustness of the learned representations from that of the linear head. We also develop "controllable" versions of existing realistic domain generalisation datasets with adjustable degrees of distribution shifts. This allows us to study the robustness of different learning algorithms under versatile yet realistic distribution shift conditions. Our experiments show that representations learned from unsupervised learning algorithms generalise better than SL under a wide variety of extreme as well as realistic distribution shifts.  ( 2 min )
    Efficient Conditionally Invariant Representation Learning. (arXiv:2212.08645v1 [cs.LG])
    We introduce the Conditional Independence Regression CovariancE (CIRCE), a measure of conditional independence for multivariate continuous-valued variables. CIRCE applies as a regularizer in settings where we wish to learn neural features $\varphi(X)$ of data $X$ to estimate a target $Y$, while being conditionally independent of a distractor $Z$ given $Y$. Both $Z$ and $Y$ are assumed to be continuous-valued but relatively low dimensional, whereas $X$ and its features may be complex and high dimensional. Relevant settings include domain-invariant learning, fairness, and causal learning. The procedure requires just a single ridge regression from $Y$ to kernelized features of $Z$, which can be done in advance. It is then only necessary to enforce independence of $\varphi(X)$ from residuals of this regression, which is possible with attractive estimation properties and consistency guarantees. By contrast, earlier measures of conditional feature dependence require multiple regressions for each step of feature learning, resulting in more severe bias and variance, and greater computational cost. When sufficiently rich features are used, we establish that CIRCE is zero if and only if $\varphi(X) \perp \!\!\! \perp Z \mid Y$. In experiments, we show superior performance to previous methods on challenging benchmarks, including learning conditionally invariant image features.  ( 2 min )
    Bayesian posterior approximation with stochastic ensembles. (arXiv:2212.08123v1 [cs.LG])
    We introduce ensembles of stochastic neural networks to approximate the Bayesian posterior, combining stochastic methods such as dropout with deep ensembles. The stochastic ensembles are formulated as families of distributions and trained to approximate the Bayesian posterior with variational inference. We implement stochastic ensembles based on Monte Carlo dropout, DropConnect and a novel non-parametric version of dropout and evaluate them on a toy problem and CIFAR image classification. For CIFAR, the stochastic ensembles are quantitatively compared to published Hamiltonian Monte Carlo results for a ResNet-20 architecture. We also test the quality of the posteriors directly against Hamiltonian Monte Carlo simulations in a simplified toy model. Our results show that in a number of settings, stochastic ensembles provide more accurate posterior estimates than regular deep ensembles.  ( 2 min )
    DAGMA: Learning DAGs via M-matrices and a Log-Determinant Acyclicity Characterization. (arXiv:2209.08037v2 [cs.LG] UPDATED)
    The combinatorial problem of learning directed acyclic graphs (DAGs) from data was recently framed as a purely continuous optimization problem by leveraging a differentiable acyclicity characterization of DAGs based on the trace of a matrix exponential function. Existing acyclicity characterizations are based on the idea that powers of an adjacency matrix contain information about walks and cycles. In this work, we propose a new acyclicity characterization based on the log-determinant (log-det) function, which leverages the nilpotency property of DAGs. To deal with the inherent asymmetries of a DAG, we relate the domain of our log-det characterization to the set of $\textit{M-matrices}$, which is a key difference to the classical log-det function defined over the cone of positive definite matrices. Similar to acyclicity functions previously proposed, our characterization is also exact and differentiable. However, when compared to existing characterizations, our log-det function: (1) Is better at detecting large cycles; (2) Has better-behaved gradients; and (3) Its runtime is in practice about an order of magnitude faster. From the optimization side, we drop the typically used augmented Lagrangian scheme and propose DAGMA ($\textit{DAGs via M-matrices for Acyclicity}$), a method that resembles the central path for barrier methods. Each point in the central path of DAGMA is a solution to an unconstrained problem regularized by our log-det function, then we show that at the limit of the central path the solution is guaranteed to be a DAG. Finally, we provide extensive experiments for $\textit{linear}$ and $\textit{nonlinear}$ SEMs and show that our approach can reach large speed-ups and smaller structural Hamming distances against state-of-the-art methods. Code implementing the proposed method is open-source and publicly available at https://github.com/kevinsbello/dagma.  ( 3 min )
    Improving uplift model evaluation on RCT data. (arXiv:2210.02152v3 [stat.ME] UPDATED)
    Estimating treatment effects is one of the most challenging and important tasks of data analysts. In many applications, like online marketing and personalized medicine, treatment needs to be allocated to the individuals where it yields a high positive treatment effect. Uplift models help select the right individuals for treatment and maximize the overall treatment effect (uplift). A major challenge in uplift modeling concerns model evaluation. Previous literature suggests methods like the Qini curve and the transformed outcome mean squared error. However, these metrics suffer from variance: their evaluations are strongly affected by random noise in the data, which renders their signals, to a certain degree, arbitrary. We theoretically analyze the variance of uplift evaluation metrics and derive possible methods of variance reduction, which are based on statistical adjustment of the outcome. We derive simple conditions under which the variance reduction methods improve the uplift evaluation metrics and empirically demonstrate their benefits on simulated and real-world data. Our paper provides strong evidence in favor of applying the suggested variance reduction procedures by default when evaluating uplift models on RCT data.  ( 2 min )
    Partially Observable RL with B-Stability: Unified Structural Condition and Sharp Sample-Efficient Algorithms. (arXiv:2209.14990v2 [cs.LG] UPDATED)
    Partial Observability -- where agents can only observe partial information about the true underlying state of the system -- is ubiquitous in real-world applications of Reinforcement Learning (RL). Theoretically, learning a near-optimal policy under partial observability is known to be hard in the worst case due to an exponential sample complexity lower bound. Recent work has identified several tractable subclasses that are learnable with polynomial samples, such as Partially Observable Markov Decision Processes (POMDPs) with certain revealing or decodability conditions. However, this line of research is still in its infancy, where (1) unified structural conditions enabling sample-efficient learning are lacking; (2) existing sample complexities for known tractable subclasses are far from sharp; and (3) fewer sample-efficient algorithms are available than in fully observable RL. This paper advances all three aspects above for Partially Observable RL in the general setting of Predictive State Representations (PSRs). First, we propose a natural and unified structural condition for PSRs called \emph{B-stability}. B-stable PSRs encompasses the vast majority of known tractable subclasses such as weakly revealing POMDPs, low-rank future-sufficient POMDPs, decodable POMDPs, and regular PSRs. Next, we show that any B-stable PSR can be learned with polynomial samples in relevant problem parameters. When instantiated in the aforementioned subclasses, our sample complexities improve substantially over the current best ones. Finally, our results are achieved by three algorithms simultaneously: Optimistic Maximum Likelihood Estimation, Estimation-to-Decisions, and Model-Based Optimistic Posterior Sampling. The latter two algorithms are new for sample-efficient learning of POMDPs/PSRs.  ( 2 min )
    Brauer's Group Equivariant Neural Networks. (arXiv:2212.08630v1 [cs.LG])
    We provide a full characterisation of all of the possible group equivariant neural networks whose layers are some tensor power of $\mathbb{R}^{n}$ for three symmetry groups that are missing from the machine learning literature: $O(n)$, the orthogonal group; $SO(n)$, the special orthogonal group; and $Sp(n)$, the symplectic group. In particular, we find a spanning set of matrices for the learnable, linear, equivariant layer functions between such tensor power spaces in the standard basis of $\mathbb{R}^{n}$ when the group is $O(n)$ or $SO(n)$, and in the symplectic basis of $\mathbb{R}^{n}$ when the group is $Sp(n)$. The neural networks that we characterise are simple to implement since our method circumvents the typical requirement when building group equivariant neural networks of having to decompose the tensor power spaces of $\mathbb{R}^{n}$ into irreducible representations. We also describe how our approach generalises to the construction of neural networks that are equivariant to local symmetries. The theoretical background for our results comes from the Schur-Weyl dualities that were established by Brauer in his 1937 paper "On Algebras Which are Connected with the Semisimple Continuous Groups" for each of the three groups in question. We suggest that Schur-Weyl duality is a powerful mathematical concept that could be used to understand the structure of neural networks that are equivariant to groups beyond those considered in this paper.  ( 2 min )
    Learnable Commutative Monoids for Graph Neural Networks. (arXiv:2212.08541v1 [cs.LG])
    Graph neural networks (GNNs) have been shown to be highly sensitive to the choice of aggregation function. While summing over a node's neighbours can approximate any permutation-invariant function over discrete inputs, Cohen-Karlik et al. [2020] proved there are set-aggregation problems for which summing cannot generalise to unbounded inputs, proposing recurrent neural networks regularised towards permutation-invariance as a more expressive aggregator. We show that these results carry over to the graph domain: GNNs equipped with recurrent aggregators are competitive with state-of-the-art permutation-invariant aggregators, on both synthetic benchmarks and real-world problems. However, despite the benefits of recurrent aggregators, their $O(V)$ depth makes them both difficult to parallelise and harder to train on large graphs. Inspired by the observation that a well-behaved aggregator for a GNN is a commutative monoid over its latent space, we propose a framework for constructing learnable, commutative, associative binary operators. And with this, we construct an aggregator of $O(\log V)$ depth, yielding exponential improvements for both parallelism and dependency length while achieving performance competitive with recurrent aggregators. Based on our empirical observations, our proposed learnable commutative monoid (LCM) aggregator represents a favourable tradeoff between efficient and expressive aggregators.  ( 2 min )
    Estimating Higher-Order Mixed Memberships via the $\ell_{2,\infty}$ Tensor Perturbation Bound. (arXiv:2212.08642v1 [math.ST])
    Higher-order multiway data is ubiquitous in machine learning and statistics and often exhibits community-like structures, where each component (node) along each different mode has a community membership associated with it. In this paper we propose the tensor mixed-membership blockmodel, a generalization of the tensor blockmodel positing that memberships need not be discrete, but instead are convex combinations of latent communities. We establish the identifiability of our model and propose a computationally efficient estimation procedure based on the higher-order orthogonal iteration algorithm (HOOI) for tensor SVD composed with a simplex corner-finding algorithm. We then demonstrate the consistency of our estimation procedure by providing a per-node error bound, which showcases the effect of higher-order structures on estimation accuracy. To prove our consistency result, we develop the $\ell_{2,\infty}$ tensor perturbation bound for HOOI under independent, possibly heteroskedastic, subgaussian noise that may be of independent interest. Our analysis uses a novel leave-one-out construction for the iterates, and our bounds depend only on spectral properties of the underlying low-rank tensor under nearly optimal signal-to-noise ratio conditions such that tensor SVD is computationally feasible. Whereas other leave-one-out analyses typically focus on sequences constructed by analyzing the output of a given algorithm with a small part of the noise removed, our leave-one-out analysis constructions use both the previous iterates and the additional tensor structure to eliminate a potential additional source of error. Finally, we apply our methodology to real and simulated data, including applications to two flight datasets and a trade network dataset, demonstrating some effects not identifiable from the model with discrete community memberships.  ( 2 min )
    Dynamic Network Sampling for Community Detection. (arXiv:2208.13921v2 [cs.SI] UPDATED)
    We propose a dynamic network sampling scheme to optimize block recovery for stochastic blockmodel (SBM) in the case where it is prohibitively expensive to observe the entire graph. Theoretically, we provide justification of our proposed Chernoff-optimal dynamic sampling scheme via the Chernoff information. Practically, we evaluate the performance, in terms of block recovery, of our method on several real datasets from different domains. Both theoretically and practically results suggest that our method can identify vertices that have the most impact on block structure so that one can only check whether there are edges between them to save significant resources but still recover the block structure.  ( 2 min )
    Quantifying the Preferential Direction of the Model Gradient in Adversarial Training With Projected Gradient Descent. (arXiv:2009.04709v4 [stat.ML] UPDATED)
    Adversarial training, especially projected gradient descent (PGD), has proven to be a successful approach for improving robustness against adversarial attacks. After adversarial training, gradients of models with respect to their inputs have a preferential direction. However, the direction of alignment is not mathematically well established, making it difficult to evaluate quantitatively. We propose a novel definition of this direction as the direction of the vector pointing toward the closest point of the support of the closest inaccurate class in decision space. To evaluate the alignment with this direction after adversarial training, we apply a metric that uses generative adversarial networks to produce the smallest residual needed to change the class present in the image. We show that PGD-trained models have a higher alignment than the baseline according to our definition, that our metric presents higher alignment values than a competing metric formulation, and that enforcing this alignment increases the robustness of models.  ( 2 min )
    Softmax Policy Gradient Methods Can Take Exponential Time to Converge. (arXiv:2102.11270v3 [cs.LG] UPDATED)
    The softmax policy gradient (PG) method, which performs gradient ascent under softmax policy parameterization, is arguably one of the de facto implementations of policy optimization in modern reinforcement learning. For $\gamma$-discounted infinite-horizon tabular Markov decision processes (MDPs), remarkable progress has recently been achieved towards establishing global convergence of softmax PG methods in finding a near-optimal policy. However, prior results fall short of delineating clear dependencies of convergence rates on salient parameters such as the cardinality of the state space $\mathcal{S}$ and the effective horizon $\frac{1}{1-\gamma}$, both of which could be excessively large. In this paper, we deliver a pessimistic message regarding the iteration complexity of softmax PG methods, despite assuming access to exact gradient computation. Specifically, we demonstrate that the softmax PG method with stepsize $\eta$ can take \[ \frac{1}{\eta} |\mathcal{S}|^{2^{\Omega\big(\frac{1}{1-\gamma}\big)}} ~\text{iterations} \] to converge, even in the presence of a benign policy initialization and an initial state distribution amenable to exploration (so that the distribution mismatch coefficient is not exceedingly large). This is accomplished by characterizing the algorithmic dynamics over a carefully-constructed MDP containing only three actions. Our exponential lower bound hints at the necessity of carefully adjusting update rules or enforcing proper regularization in accelerating PG methods.  ( 2 min )
    Estimation Contracts for Outlier-Robust Geometric Perception. (arXiv:2208.10521v2 [stat.ML] UPDATED)
    Outlier-robust estimation is a fundamental problem and has been extensively investigated by statisticians and practitioners. The last few years have seen a convergence across research fields towards "algorithmic robust statistics", which focuses on developing tractable outlier-robust techniques for high-dimensional estimation problems. Despite this convergence, research efforts across fields have been mostly disconnected from one another. This monograph bridges recent work on certifiable outlier-robust estimation for geometric perception in robotics and computer vision with parallel work in robust statistics. In particular, we adapt and extend recent results on robust linear regression (applicable to the low-outlier regime with > 50% outliers) to the setup commonly found in robotics and vision, where (i) variables (e.g., rotations, poses) belong to a non-convex domain, (ii) measurements are vector-valued, and (iii) the number of outliers is not known a priori. The emphasis here is on performance guarantees: rather than proposing radically new algorithms, we provide conditions on the input measurements under which modern estimation algorithms (possibly after small modifications) are guaranteed to recover an estimate close to the ground truth in the presence of outliers. These conditions are what we call an "estimation contract". Besides the proposed extensions of existing results, we believe the main contributions of this monograph are (i) to unify parallel research lines by pointing out commonalities and differences, (ii) to introduce advanced material (e.g., sum-of-squares proofs) in an accessible and self-contained presentation for the practitioner, and (iii) to point out a few immediate opportunities and open questions in outlier-robust geometric perception.  ( 2 min )
    Materials Discovery using Max K-Armed Bandit. (arXiv:2212.08225v1 [stat.ML])
    Search algorithms for the bandit problems are applicable in materials discovery. However, the objectives of the conventional bandit problem are different from those of materials discovery. The conventional bandit problem aims to maximize the total rewards, whereas materials discovery aims to achieve breakthroughs in material properties. The max K-armed bandit (MKB) problem, which aims to acquire the single best reward, matches with the discovery tasks better than the conventional bandit. Thus, here, we propose a search algorithm for materials discovery based on the MKB problem using a pseudo-value of the upper confidence bound of expected improvement of the best reward. This approach is pseudo-guaranteed to be asymptotic oracles that do not depends on the time horizon. In addition, compared with other MKB algorithms, the proposed algorithm has only one hyperparameter, which is advantageous in materials discovery. We applied the proposed algorithm to synthetic problems and molecular-design demonstrations using a Monte Carlo tree search. According to the results, the proposed algorithm stably outperformed other bandit algorithms in the late stage of the search process when the optimal arm of the MKB could not be determined based on its expectation reward.  ( 2 min )
    Huber-energy measure quantization. (arXiv:2212.08162v1 [stat.ML])
    We describe a measure quantization procedure i.e., an algorithm which finds the best approximation of a target probability law (and more generally signed finite variation measure) by a sum of Q Dirac masses (Q being the quantization parameter). The procedure is implemented by minimizing the statistical distance between the original measure and its quantized version; the distance is built from a negative definite kernel and, if necessary, can be computed on the fly and feed to a stochastic optimization algorithm (such as SGD, Adam, ...). We investigate theoretically the fundamental questions of existence of the optimal measure quantizer and identify what are the required kernel properties that guarantee suitable behavior. We test the procedure, called HEMQ, on several databases: multi-dimensional Gaussian mixtures, Wiener space cubature, Italian wine cultivars and the MNIST image database. The results indicate that the HEMQ algorithm is robust and versatile and, for the class of Huber-energy kernels, it matches the expected intuitive behavior.  ( 2 min )
    Penalised regression with multiple sources of prior effects. (arXiv:2212.08581v1 [stat.ME])
    In many high-dimensional prediction or classification tasks, complementary data on the features are available, e.g. prior biological knowledge on (epi)genetic markers. Here we consider tasks with numerical prior information that provide an insight into the importance (weight) and the direction (sign) of the feature effects, e.g. regression coefficients from previous studies. We propose an approach for integrating multiple sources of such prior information into penalised regression. If suitable co-data are available, this improves the predictive performance, as shown by simulation and application. The proposed method is implemented in the R package `transreg' (https://github.com/lcsb-bds/transreg).  ( 2 min )
    Text-to-speech synthesis based on latent variable conversion using diffusion probabilistic model and variational autoencoder. (arXiv:2212.08329v1 [eess.AS])
    Text-to-speech synthesis (TTS) is a task to convert texts into speech. Two of the factors that have been driving TTS are the advancements of probabilistic models and latent representation learning. We propose a TTS method based on latent variable conversion using a diffusion probabilistic model and the variational autoencoder (VAE). In our TTS method, we use a waveform model based on VAE, a diffusion model that predicts the distribution of latent variables in the waveform model from texts, and an alignment model that learns alignments between the text and speech latent sequences. Our method integrates diffusion with VAE by modeling both mean and variance parameters with diffusion, where the target distribution is determined by approximation from VAE. This latent variable conversion framework potentially enables us to flexibly incorporate various latent feature extractors. Our experiments show that our method is robust to linguistic labels with poor orthography and alignment errors.  ( 2 min )
    A Sieve Quasi-likelihood Ratio Test for Neural Networks with Applications to Genetic Association Studies. (arXiv:2212.08255v1 [stat.ML])
    Neural networks (NN) play a central role in modern Artificial intelligence (AI) technology and has been successfully used in areas such as natural language processing and image recognition. While majority of NN applications focus on prediction and classification, there are increasing interests in studying statistical inference of neural networks. The study of NN statistical inference can enhance our understanding of NN statistical proprieties. Moreover, it can facilitate the NN-based hypothesis testing that can be applied to hypothesis-driven clinical and biomedical research. In this paper, we propose a sieve quasi-likelihood ratio test based on NN with one hidden layer for testing complex associations. The test statistic has asymptotic chi-squared distribution, and therefore it is computationally efficient and easy for implementation in real data analysis. The validity of the asymptotic distribution is investigated via simulations. Finally, we demonstrate the use of the proposed test by performing a genetic association analysis of the sequencing data from Alzheimer's Disease Neuroimaging Initiative (ADNI).  ( 2 min )
    Nested Gradient Codes for Straggler Mitigation in Distributed Machine Learning. (arXiv:2212.08580v1 [cs.IT])
    We consider distributed learning in the presence of slow and unresponsive worker nodes, referred to as stragglers. In order to mitigate the effect of stragglers, gradient coding redundantly assigns partial computations to the worker such that the overall result can be recovered from only the non-straggling workers. Gradient codes are designed to tolerate a fixed number of stragglers. Since the number of stragglers in practice is random and unknown a priori, tolerating a fixed number of stragglers can yield a sub-optimal computation load and can result in higher latency. We propose a gradient coding scheme that can tolerate a flexible number of stragglers by carefully concatenating gradient codes for different straggler tolerance. By proper task scheduling and small additional signaling, our scheme adapts the computation load of the workers to the actual number of stragglers. We analyze the latency of our proposed scheme and show that it has a significantly lower latency than gradient codes.  ( 2 min )
    Capturing Label Characteristics in VAEs. (arXiv:2006.10102v3 [cs.LG] UPDATED)
    We present a principled approach to incorporating labels in VAEs that captures the rich characteristic information associated with those labels. While prior work has typically conflated these by learning latent variables that directly correspond to label values, we argue this is contrary to the intended effect of supervision in VAEs-capturing rich label characteristics with the latents. For example, we may want to capture the characteristics of a face that make it look young, rather than just the age of the person. To this end, we develop the CCVAE, a novel VAE model and concomitant variational objective which captures label characteristics explicitly in the latent space, eschewing direct correspondences between label values and latents. Through judicious structuring of mappings between such characteristic latents and labels, we show that the CCVAE can effectively learn meaningful representations of the characteristics of interest across a variety of supervision schemes. In particular, we show that the CCVAE allows for more effective and more general interventions to be performed, such as smooth traversals within the characteristics for a given label, diverse conditional generation, and transferring characteristics across datapoints.

  • Open

    [Project] the language of modifications : I collected and analyzed how humans use language to modify existing drawings at Neurips 2022. (n=60)
    At big conferences it is fun to get some human labels, because it is super high quality and high density. At neurips 2022 I wanted to study how humans use language to "fix" or "correct" an existing artifact. The current big models, such as stable diffusion, are generative from descriptions -- It is impossible to have an output image, and describe precisely how one might want to change it to improve it. To quote from the blog post: Imagine describing a task for your friend to perform. It is unlikely they’ll get it right on the first try. Often, additional communications are needed to modify and improve what is being done so far. At Neurips 2022, I conducted a small study to get a sense of the following: Q1: How valuable is the modification process? Q2: Are the languages of modification and description different? Check out the blog (5min read) for the full report: https://evanthebouncy.medium.com/the-language-of-modifications-17fac974c1ef TL;DR: We find that modification is both valuable and distinct from descriptive language. have a good one! --evan submitted by /u/evanthebouncy [link] [comments]  ( 66 min )
    [P] Open source discord bot for Stable Diffusion
    Hey everyone - I made a discord bot awhile ago (now on ~450 servers you can get it here) to run Stable diffusion inference in a discord server and I've open sourced all of the code to run it on GitHub: https://github.com/mystic-ai/pipeline/tree/main/examples/apps/stable_diffusion_discord_bot It has everything needed to run it including some Dockerfiles + docker-compose.yml to run it out of the box once you get your auth tokens from discord. Inference takes normally around 2-3s per 512x512 image submitted by /u/paulcjh [link] [comments]  ( 65 min )
    [D] Is this the right way to perform PCA on the test set?
    I would have thought that, in order to avoid any train/test contamination, we need to take the parameters from the training set PCA and use them to transform the test set, as oppose to just combining train and test and performing PCA at once. Does anyone have any idea how this is done in sklearn? I found this post which offers a solution, could anyone perhaps confirm that this is the correct way to do this? Cheers. submitted by /u/Steve_Sizzou [link] [comments]  ( 65 min )
    [R][D] a new community on Machine Learning based Early Decision Making
    Hello :-) ML-EDM is a new field of research that consists in optimizing the decision moments of a Machine Learning model that observes data collected over time. Here is a community dedicated to this topic: r/EarlyMachineLearning Do not hesitate to subscribe, we will add content soon, such as scientific articles, instructional videos, and later a python library and tutorials ... see you soon! submitted by /u/ML-EDM [link] [comments]  ( 66 min )
    [R] The alignment problem from a deep learning perspective
    submitted by /u/hardmaru [link] [comments]  ( 65 min )
    [D] We’re Brian Retford, Jason Morton, and Ryan Cao, various researchers and developers in the ZKML (zero knowledge machine learning) space and we’ve been asked by r/privacy mods to help explain and answer questions about ZKML and why it’s important for the future of data privacy! AMA
    submitted by /u/carrotcypher [link] [comments]  ( 75 min )
    [D] AAMAS 2023 Review Discussion
    Did not find a thread to discuss AAMAS reviews, so made this. submitted by /u/NoOne3051 [link] [comments]  ( 65 min )
  • Open

    Is there a ChatGPT-similar AI which is better suited for research or more current information?
    ChatGPT is remarkable, and I’m interested in learning more about how I can use it within my organization. Is there a similar product or service which can reference more up-to-date databases? submitted by /u/pmmechoccymilk [link] [comments]  ( 47 min )
    I currated the most clicked AI apps/resources from my newsletter
    submitted by /u/anitakirkovska [link] [comments]  ( 46 min )
    This is how Decentralised Organizations are using Artificial Intelligence
    https://nounsummaries.substack.com/p/meet-the-nouns-dao-ai-noc-lite-at submitted by /u/Environmental_Fly691 [link] [comments]  ( 46 min )
    Top universities snap into action over AI cheating
    oooohhhh the irony, Professors be like we are not scared of AI, also Universities proceed to begin placing measures to stop AI cheating, lol Group of Eight (Go8), which comprises the University of Sydney, UNSW, Monash, UniMelb, UWA, ANU, the University of Queensland, and the University of Adelaide, is "proactively tackling the emergence of AI" through redesigning assessments and using new targeted detection strategies, Chief Executive Vicki Thomson said. "Our universities have revised how they will run assessments in 2023," she said. This includes "Live+" exams for offshore and online students, in which a supervisor monitors their computer screen throughout, more in-person supervision for local students, and greater use of pen and paper exams and tests. They will also limit the use of "R…  ( 49 min )
    AI Dream 116 - Incredible Molten Rainbows Vivid Flight by AI
    submitted by /u/LordPewPew777 [link] [comments]  ( 44 min )
    Ai Created a Cinematic Universe for Us
    submitted by /u/GFWaltz [link] [comments]  ( 45 min )
    instagram automation with AI
    Hey guys, Maybe you have a solution for that: I'm searching a program which automatically 1. downloads instagram reels from a folder 2. uploads them again every 3-6 hours to my acc Pls let me know if there is a cheap way to do this submitted by /u/timslck [link] [comments]  ( 48 min )
    What is the “forward-forward” algorithm, Geoffrey Hinton’s new AI technique?
    submitted by /u/bendee983 [link] [comments]  ( 46 min )
    When and How to use AI Solutions in your Product?
    submitted by /u/softcrater [link] [comments]  ( 48 min )
    DALL-E 2 vs The Real World! A game to guess if the image is real or AI generated
    I built this project last weekend. I was impressed by the realism of the generated images from DALL-E, so I created this small game where players guess if the displayed image is real or AI generated. This is my first ever public project, so I appreciate any feedback :) Link: https://real-or-fake-the-ai-game.onrender.com/ submitted by /u/ntsrys [link] [comments]  ( 47 min )
    The largest database for AI
    Here is the largest database for AI that has over 1,200 AI-related resources including tools, books, courses and a lot more. It has been created to help everyone with their search for AI resources so that they can get every available tool, community, course or anything AI at one place. Check it out: https://www.creaitives.com/ submitted by /u/Creaitives [link] [comments]  ( 48 min )
    ChatGPT - A Free Trial of The Future
    submitted by /u/arnolds112 [link] [comments]  ( 48 min )
    How to make a text based on other text?
    Hy I would like to make an ai written art critique. What should I use for this idea. I have tried using tensorflow but it didn’t work. submitted by /u/NolanDeC [link] [comments]  ( 48 min )
    9 Things You Need to Know About Chat Gpt: Chatbots that Provide Answers
    submitted by /u/Techugoltd [link] [comments]  ( 49 min )
    Stable Diffusion Weekly AI Art Images 12.19.22
    submitted by /u/prfitofthesngularity [link] [comments]  ( 45 min )
    Loading a custom plaintext dataset for additional Pre-Processing of a RoBERTa like model, used for text generation on a Pre-Trained model
    Hello, I am trying to train an existing RoBERTa model variant in my own language (czech), to generate text, based on a prompt. I have a dataset of about 70 books prepared and formatted in plain text, but now I am faced with the issue of loading the data for additional training of the already pretrained model (the model I am using is called robeczech-base). When I tried using dataset = LineByLineTextDataset( tokenizer=tokenizer, file_path="./dataset.txt", block_size=512 ) I get an error with the index being out of range, which, after some research, seems to indicate that my dataset has a vocabulary size out of the range of the one the model was trained on. This leads me to believe that my method of loading the single plaintext file into a dataset might be incorrect. I wasn't able to find many tutorials on this subject, since most assume the use of a labeled dataset, or multiple datasets for training and testing, which don't seem like they apply to my situation, since I want the model to generate text similar to the ones it has in my dataset, not classify it in some way for question answering (the dataset contains mostly classic fantasy literature in my native language). Do you have any idea of what I am doing wrong? Is my approach completely incorrect? Is a RoBERTa like-model even suited for this task? Do you have any resources, pertaining to long-form text generation (aka. short stories) tutorials and guides, that I could use? I have been stuck for about a week, so any help would be much appreciated :) submitted by /u/Maty_the_Red [link] [comments]  ( 51 min )
    Which hardware company will be the most essential for AI development? Intel, AMD or Nvidia?
    submitted by /u/piranha_studio [link] [comments]  ( 48 min )
    Looking for ai / tool to enhance image from camera feed
    Hello there, There is a thief stole something from a home that is still under construction, however a camera on the street picked him, but the quality isn't good. I'm looking for ai/tool/script/algorithm/whatever to enhance and zoom in to get his face details. Tools on the internet where you upload to process it is not professional and doesn't make any good in my case. Thank you for your help. submitted by /u/abdo_shahba [link] [comments]  ( 48 min )
    Sam Altman, OpenAI CEO explains the 'Alignment Problem'
    submitted by /u/Microsis [link] [comments]  ( 53 min )
    I asked AI to make a Music Video… the results are trippy
    submitted by /u/Prior_Appearance_44 [link] [comments]  ( 47 min )
    Uploading research papers to AI
    Every 6 months or so I test AI to see how much it has learned about sandponics/iAVs, the last 6 months have seen a big improvement, but unfortunately there is still a LOT of child level mistakes. If I ask what sandponics is, I get an average answer. 12 months ago it wouldn't have a clue......it is getting better, but when you ask for deeper info it fails majorly. My question is, can I take research papers and upload them to AI, and then get some useful info out of it? I don't mean summary or alternative text, is AI ready to understand or is it still too young? It seems like it can be 'tricked' by asking the same question in different ways AI will provide different answers. submitted by /u/djdefenda [link] [comments]  ( 49 min )
    IARPA Wants AI to 'Identify Overlooked Info, Auto-Generate Comments' for Intel Reporting
    submitted by /u/Publicize [link] [comments]  ( 49 min )
  • Open

    Question about HalfCheetah observation space
    I am currently running some experiments on the HalfCheetah environment and had a question about the environment's observation space. I found a description of the observation space here but I was wondering if someone could clarify whether the "angles" described in the observation space are angles relative to the other joints or absolute angles relative to the world axis. Thanks! submitted by /u/superkaiba [link] [comments]  ( 54 min )
    Let’s learn about Deep Q-Learning by training our agent to play Space Invaders (Deep Reinforcement Learning Free Course by Hugging Face 🤗)
    Hey there! I’m happy to announce that we just published the third Unit of the Deep Reinforcement Learning Course 🥳 In this Unit, you'll learn about Deep Q-Learning and train a DQN agent to play Atari games using RL-Baselines3-Zoo 🔥 After that, you’re going to learn about Optuna, a hyperparameter search library. You’ll be able to compare the results of your agent using the leaderboard 🏆 The Deep Q-Learning chapter 👉 https://huggingface.co/deep-rl-course/unit3/introduction The leaderboard 👉 https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard https://preview.redd.it/hbr73gbcpv6a1.jpg?width=1920&format=pjpg&auto=webp&s=42a63276bc544aa547275da663ef1b342505d510 If you didn’t sign up yet, don’t worry. There’s still time, we wrote an introduction unit to help you get started. You can start learning now 👉 https://huggingface.co/deep-rl-course/unit0/introduction If you have questions or feedback I would love to answer them. submitted by /u/cranthir_ [link] [comments]  ( 57 min )
    Question about designing the reward function
    Hi, assuming the task is about reaching a goal position (x,y,z) with a robot with 3 dof (q1, q2, q3). The condition for this task is that q1 can not be used with q2, q3. In other words, if q1 > 0 then q2 and q3 must be 0 and vice versa. Currently, the reward is described as follow: reward = norm (goal_pos - current_pos) + abs( action_q1 - max(action_q2, action_q3) ) / (action_q1 + max(action_q2, action_q3))). But, the agent only tries to use the q2 and q3 by suppressing the use of q1. The goal positions can be sometimes reached. Here, the agent utilizes q2 and q3 only. Although, I see by using q1 interchangeably the goal position can be more easily reached. In other cases, the rule of using q1 separately is not kept so that, action_q2 >0 and max(action_q2, action_q3) > 0. How could one reformulate this reward function either with action masking or to encourage to more efficiently use q1? submitted by /u/Fun-Moose-3841 [link] [comments]  ( 56 min )
    RL with Growing Action Space
    Hey, I am looking for Reinforcement Learning projects or papers that are dealing with action spaces where the action space not statically defined before training, but is growing over the episode. Let me make an example: An agent is in an environment and needs to interact with a number of objects in a specific order to get a reward. The classic example would be "first, find the key, then go to the door to unlock it", but with more objects. One could now make a list of all the objects that the agent already discovered and take this list of objects as the action space. In other words, the RL model is supposed to pick the object that promises the most reward in the long run. You could phrase that as a growing action space or also say that the action space is also part of the state space. Does anyone of you know of other works that deal with such a scenario? submitted by /u/Nescyo [link] [comments]  ( 58 min )
  • Open

    Automatically retrain neural networks with Renate
    Today we announce the general availability of Renate, an open-source Python library for automatic model retraining. The library provides continual learning algorithms able to incrementally train a neural network as more data becomes available. By open-sourcing Renate, we would like to create a venue where practitioners working on real-world machine learning systems and researchers interested […]  ( 6 min )
    Create Amazon SageMaker models using the PyTorch Model Zoo
    Deploying high-quality, trained machine learning (ML) models to perform either batch or real-time inference is a critical piece of bringing value to customers. However, the ML experimentation process can be tedious—there are a lot of approaches requiring a significant amount of time to implement. That’s why pre-trained ML models like the ones provided in the PyTorch […]  ( 10 min )
  • Open

    Research @ Microsoft 2022: A look back at a year of accelerating progress in AI
    2022 has seen remarkable progress in foundational technologies that have helped to advance human knowledge and create new possibilities to address some of society’s most challenging problems. Significant advances in AI have also enabled Microsoft to bring new capabilities to customers through our products and services, including GitHub Copilot, an AI pair programmer capable of turning natural language prompts into code, and a preview of Microsoft Designer, a graphic design app that supports the creation of social media posts, invitations, posters, and one-of-a-kind images. The post Research @ Microsoft 2022: A look back at a year of accelerating progress in AI appeared first on Microsoft Research.  ( 18 min )
  • Open

    Top 5 Edge AI Trends to Watch in 2023
    With the state of the world under constant flux in 2022, some technology trends were put on hold while others were accelerated. Supply chain challenges, labor shortages and economic uncertainty had companies reevaluating their budgets for new technology. For many organizations, AI is viewed as the solution to a lot of the uncertainty bringing improved Read article > The post Top 5 Edge AI Trends to Watch in 2023 appeared first on NVIDIA Blog.  ( 7 min )
  • Open

    Defining a model for neural networks
    Hi, I am not sure if this question belongs here, but anyway: I want to define a model for a neural network in pytorch that takes two inputs, namely a document and a query, and outputs the relevance between these two inputs. The queries and documents have such a format: Format of queries and documents Additionally, there are also relevance indicators, which should be the results to which I compare my predictions to. They have this format: 1 indicating relevant, 0 indicating irrelevant Does someone know how you would design a model for this given problem, or has some input to my problem? submitted by /u/serious153 [link] [comments]  ( 58 min )
    I asked AI to make a Music Video… the results are trippy
    submitted by /u/Prior_Appearance_44 [link] [comments]  ( 49 min )
  • Open

    Polynomial approximations to sine
    Taylor polynomials are terrific local approximations but poor global approximations. Taylor polynomials are optimal in some sense near their center, but are seldom the best choice over a large interval. This post will look at approximating sin(πx) over [-1, 1] with fifth degree polynomials. First, this plot compares the approximation error for a fifth order […] Polynomial approximations to sine first appeared on John D. Cook.  ( 4 min )
  • Open

    Leveraging Heteroscedastic Uncertainty in Learning Complex Spectral Mapping for Single-channel Speech Enhancement. (arXiv:2211.08624v2 [cs.SD] UPDATED)
    Most speech enhancement (SE) models learn a point estimate, and do not make use of uncertainty estimation in the learning process. In this paper, we show that modeling heteroscedastic uncertainty by minimizing a multivariate Gaussian negative log-likelihood (NLL) improves SE performance at no extra cost. During training, our approach augments a model learning complex spectral mapping with a temporary submodel to predict the covariance of the enhancement error at each time-frequency bin. Due to unrestricted heteroscedastic uncertainty, the covariance introduces an undersampling effect, detrimental to SE performance. To mitigate undersampling, our approach inflates the uncertainty lower bound and weights each loss component with their uncertainty, effectively compensating severely undersampled components with more penalties. Our multivariate setting reveals common covariance assumptions such as scalar and diagonal matrices. By weakening these assumptions, we show that the NLL achieves superior performance compared to popular losses including the mean squared error (MSE), mean absolute error (MAE), and scale-invariant signal-to-distortion ratio (SI-SDR).  ( 2 min )

  • Open

    [D] Will there be a replacement for Machine Learning Twitter?
    It seems like a lot of prominent ML researchers are pretty active on Twitter, and that it's a decent place to hear about new research and promote your own stuff. But, the inmates appear to have taken over the asylum over there. Will there be, for example, an ML Mastodon instance? submitted by /u/MrAcurite [link] [comments]  ( 61 min )
    [D] Usage of GPU profiling tools
    What scenarios have you used GPU profiling tools(Ex. NVIDIA Nsight) to improve model inference speed? Is the primary use case when models are deployed as part of a more complex pipeline? Or can it be useful for improving stand-alone models? Finally, are there any references/books/tutorials that you'd recommend to learn more about the tech/tools/practices for model inference? Definitely leaning more towards practical/engineering aspects rather than model-centric (Eg. Quantization, Pruning). submitted by /u/answersareallyouneed [link] [comments]  ( 72 min )
    [D] Resources to learn and fully understand Diffusion Model Codes
    There are several resources to understand how to make a GAN from scratch. But with Diffusion Models, the DDPM paper code and especially the Improved DDPM paper code are so hard and complicated to understand. I'm currently doing research on Diffusion Models. I understand the math very well from the paper: "Understanding Diffusion Models: A Unified Perspective". It gives a very intuitive and step by step guide to the mathematics and intuition behind Diffusion Models. I loved the paper. But from that paper, I don't have the necessary talent or skill to reproduce the code. It's too big of a project for me to do on my own from scratch. So, I wanted help from this community which can give me some guide on how to find articles, GitHub repos or YouTube videos which gives me step by step guide on how to build research level code on Diffusion models from scratch. The closest I found was the annotated diffusion model from the Huggingface community but that was very basic and when I wanted to reproduce their own repo on GitHub(annotated diffusion is the name of the blog but they have their own repo on Diffusion model in Pytorch which is in research level) in Pytorch, that was too heavy for me. Also, maybe some other guides maybe also helpful on how can I can start from scratch and build models and eventually go to a stage where I can reproduce results like in the Improved DDPM paper. I have the hardware resources in my laboratory. But no one in my lab has done any projects on Diffusion Models. So, I'm the first one. That's why I had to find resources on my own. It will be a big help for me if I can find some help from this community. Thank you very much in advance. submitted by /u/Itachi_99 [link] [comments]  ( 65 min )
    [R] The Infinite Index: Information Retrieval on Generative Text-To-Image Models
    Hi all, in our recent paper we cast text-to-image generation as a retrieval task, thereby connecting text-to-image models to information retrieval. An essential part of our paper is a case study on game artwork search using Stable Diffusion in which we demonstrate the challenges of prompt engineering. We are curious to hear your feedback! arXiv: https://arxiv.org/pdf/2212.07476.pdf Twitter thread: https://twitter.com/webis_de/status/1604469981043134465 Abstract: The text-to-image model Stable Diffusion has recently become very popular. Only weeks after its open source release, millions are experimenting with image generation. This is due to its ease of use, since all it takes is a brief description of the desired image to “prompt” the generative model. Rarely do the images generated…  ( 64 min )
    [P] Generate transcripts with Whisper AI and automatically translate with LibreTranslate
    Hey guys wanted to show you my app which offers a convenient frontend to use Whisper for transcriptions with Libretranslate to power automatic translations Code is all open-source here: https://github.com/mayeaux/generate-subtitles Also running an instance that you can use for free at https://freesubtitles.ai submitted by /u/meddit_app [link] [comments]  ( 65 min )
    MAML compatible with GAN's? [D]
    There are two recent works that I came across, both of which try to use MAML for generative and image translation tasks respectively. To my eye there are various potential problems pursuing this path, I believe that both papers have been accepted as workshop/conference papers. I took CS 330 this autumn and I have been analyzing these papers as part of my final project. Here are some issues that are worrying me. Let's start with the first paper, Meta-GAN for Few-Shot Image Generation | OpenReview . The first reviewers seem to miss that he did not actually train with just one of (0-8) he trained with all except 9. But that is a relatively minor issue. Here are the three major issues. 1, How does this not suffer from serious memorization issues? MAML does poorly when a single function can so…  ( 66 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]  ( 62 min )
    [N] Neural Rendering: Reconstruct your city in 3D using only your mobile phone and CitySynth!
    submitted by /u/ydrive-ai [link] [comments]  ( 66 min )
    [D] Is there any good resource to learn about sports analytics
    Hi Everyone, I have good knowledge on standard machine learning models and techniques and have some knowledge of deep learning. I want to tackle the field of sports analytics. Is there any platform for good resources and problems ? submitted by /u/sidney_lumet [link] [comments]  ( 67 min )
    Classic 'Quantum' Transformers? [R]
    Is this possible? Train transformers on a quantum computer to model them for classical computing purposes such as running quantum cross validated regression locally? https://discuss.huggingface.co/t/quantum-transformer/28044 submitted by /u/Thistleknot [link] [comments]  ( 61 min )
    [R][P] Riffusion, music generation with stable diffusion with Gradio Demo
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 62 min )
    [P] Unsupervised learning project for social media
    I’ve been toying with the idea of building an unsupervised model to find (demographic and other) patterns between someone’s followers on Instagram/TikTok. Aside from jumping over hurdles to actually acquire training data, does anyone have any relevant experience/insights about this? (ie the best model framework to use, the best data to train the model on to find salient patterns, etc) submitted by /u/iamnotavisionary [link] [comments]  ( 66 min )
    [D] Product assortment/range optimisation for stores
    Hi guys, From your experience, what models would you consider for product assortment/range optimisation for FMCG retail sector ( i.e getting the right product range in the right store/ product placement). I’ve seen some use XGBoost for product placement, but curious to know what other models work well. I know we have to take product substitution into consideration so Customer Decision Tree Model is already taken care of. submitted by /u/Maria_Adel [link] [comments]  ( 69 min )
  • Open

    AI-Generated and Illustrated Sci-Fi Novel - What do you think? Download it and see for yourself!
    Hi everyone. Have you ever read a book written 100% by AI? Now you got the chance. I started to work on my Sci-Fi novel using Artificial Intelligence. Together, we wrote an exciting book and made fantastic illustrations for it. It's called Beyond The Horizon, it takes place in the year 2120. The main character, Jenna, a young astronaut is sent on a mission to explore a new planet. Here is the link for the first chapter (pdf version): https://drive.google.com/file/d/1eaY3EOVW4LucsEfcNvugknxbkti9xaqD/view?usp=share_link Let me know what you think! Thanks, Tom submitted by /u/Character-Bison7282 [link] [comments]  ( 48 min )
    AI Dream 112 - AI created this EPIC ART Animation.
    submitted by /u/LordPewPew777 [link] [comments]  ( 45 min )
    A.I. Submission games - The interrogation
    submitted by /u/goronmask [link] [comments]  ( 45 min )
    A self-driving lab (SDL) at the University of Toronto has discovered organic lasers with state-of-the-art performance⁠ — and it only took 2 days. SDLs work by combining artificial intelligence, automation, and advanced computing to discover new things
    submitted by /u/Aerothermal [link] [comments]  ( 48 min )
    I asked ChatGPT to write a Twitter hook. This was the result:
    submitted by /u/TheVellerShow [link] [comments]  ( 46 min )
    Why censoring the OpenAI/ ChatGPT is useless: It's ''PC'' on surface but you can actually force/submit it into bypassing it's restrictions
    submitted by /u/exoboy1993 [link] [comments]  ( 48 min )
    How to clean artifacts with AI from old animations?
    Hello, I grew up watching the tom sawyer movie in my native language. When searching for it I could not buy it in my native language and could only download it from an old forum. I decided to buy the DVD in English (I shared it for free btw https://archive.org/details/a-storybook-classic-tom-sawyer-2005) and then maybe put my native's language lines over it. the problem is, it's kinda dirty from artifacts. Is there any way to clean it using AI? thanks in advance :) submitted by /u/suicidal_boy_ [link] [comments]  ( 48 min )
    Xmas gift book rec - game theory-related ML/AI book
    Hello AI Reddit, I am writing to ask for help Christmas shopping! : ) My brother has a masters in ML and works at an AI company. He is working on a long-term project, and he explained to me that he is approaching the project through a game theory lens, while other AI teams working on the same project have taken a nat'l language processing-focused approach. The project involves solving a complex board game for real-life applications. I am looking for books that would offer him interesting/useful information about game theory in ML context. I am also open to any other AI/ML/comp sci book recs. He reads textbooks about chemistry and comp sci for pleasure, so this would NOT be a "super lame" gift for him. ​ Thank you! submitted by /u/userofreddit2021777 [link] [comments]  ( 47 min )
    Is AI the public doesn't know about already in use?
    I'm wondering about political candidates. Could they be using AI to find out what to say and not say, and to write speeches to improve their chances of getting elected? Is the military using AI do improve military scenarios and design more effective weapons? Is the Federal Reserve Bank using AI to make financial decisions? Are investment banks using AI to model stock and bond markets, to improve their performance? Are American military experts using AI to help Ukraine succeed on the battlefield? If so, how much of this is really secret, and how much known to insiders but just catching the public eye? submitted by /u/jollybumpkin [link] [comments]  ( 50 min )
  • Open

    Showing the "good" values does not help the PPO algorithm?
    Hi, in the given environment (https://github.com/NVIDIA-Omniverse/IsaacGymEnvs/blob/main/isaacgymenvs/tasks/franka_cabinet.py), the task for the robot is to open a cabinet. The action values, which are the output of the agent, are the target velocity values for the robot's joints. To accelerate the learning, I manually controlled the robot and saved the corresponding joint velocity values in a separate file and overwrote the action values from the agent with the recorded values (see below). In this way, I hoped that the agent gets learned, which actions would lead to a goal. However, after 100 epoch, when taking the actions from the agent, again, I see that the agent has not learned anything. Am I missing something? def pre_physics_step(self, actions): if global_epoch < 100: # recorded_actions: values from manual control for i in range(len(recorded_actions)): self.actions = recorded_actions[i] else: # actions : values from agent self.actions = actions.clone().to(self.device) targets = self.franka_dof_targets[:, :self.num_franka_dofs] + self.franka_dof_speed_scales * self.dt * self.actions * self.action_scale self.franka_dof_targets[:, :self.num_franka_dofs] = tensor_clamp( targets, self.franka_dof_lower_limits, self.franka_dof_upper_limits) env_ids_int32 = torch.arange(self.num_envs, dtype=torch.int32, device=self.device) self.gym.set_dof_position_target_tensor(self.sim, gymtorch.unwrap_tensor(self.franka_dof_targets)) submitted by /u/Fun-Moose-3841 [link] [comments]  ( 56 min )
  • Open

    Application and Benefits of Business Intelligence in Manufacturing
    Businesses collect a huge volume of data daily from various sources like ERMs, e-commerce platforms, supply chains, and many other internal and external sources. In making use of this data, we make use of data-driven decisions, organizations need business intelligence (BI). What is business intelligence? It refers to a mix of business analytics, data mining,… Read More »Application and Benefits of Business Intelligence in Manufacturing The post Application and Benefits of Business Intelligence in Manufacturing appeared first on Data Science Central.  ( 20 min )
    The Technology Crutch and Crossing the Cultural Learning Chasm
    “The future is already here. It is just unevenly distributed.” – William Gibson The Boston Consulting Group (BCG) released research showing that while 94% of companies have big aspirations to deliver substantial impact via digital transformation, the majority of these digital transformations will fail.  Their research highlighted five challenges that organizations must address to successfully… Read More »The Technology Crutch and Crossing the Cultural Learning Chasm The post The Technology Crutch and Crossing the Cultural Learning Chasm appeared first on Data Science Central.  ( 21 min )
  • Open

    Euler product for sine
    Euler’s product formula for sine is To visualize the convergence of the infinite product, let’s look at the error in approximating sin(πx) with the Nth partial product of the infinite product, i.e. Here’s a plot of the partial products. We knew before making the plot that the error had to go to zero as N […] Euler product for sine first appeared on John D. Cook.  ( 4 min )
  • Open

    Linear AR model
    Can someone please explain to me what is calculated here, and what the output means? submitted by /u/Relevant_Ideal_7014 [link] [comments]  ( 6 min )

  • Open

    [R] Foundation Model is not necessarily helping motor control that much?
    submitted by /u/XiaolongWang [link] [comments]  ( 61 min )
    ML exam question on Naive Bayes [Discussion]
    So i revently attended a ML exam where this question was revoked because of a mistanke in the exam-set. But i dont know Why? Can anybody tell me? submitted by /u/Convhay [link] [comments]  ( 68 min )
    [D] yolov7 not finding GPU device: Solution
    I just installed yolo but there was an issue finding my GPU. I was getting this error: AssertionError: Invalid CUDA '--device 0' requested, use '--device cpu' or pass valid CUDA device(s) I think the requirements.txt installed by pip is installing the CPU only pytorch. The easy workaround is to just install the correct pytorch first. Note: I did this in a conda environment so if you want to do that too then conda create --name yolov7 python=3.9 conda activate yolov7 What you should do is install torch first before the requirements.txt and to get the configuration you want @ https://pytorch.org/. Make sure to choose CUDA. I just selected the pip with the latest cuda but you can do the conda install if you want. git clone https://github.com/WongKinYiu/yolov7 cd yolov7 pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu117 pip install -r requirements.txt Then when i used yolo it detected my gpu using --device 0: YOLOR v0.1-116-g8c0bf3f torch 1.13.1+cu117 CUDA:0 (NVIDIA GeForce RTX 2070, 8191.5625MB)** if you installed pytorch already through the requirements.txt you need to uninstall (if you don't want to start from scratch in a different virtual environment) pip uninstall torch torchvision submitted by /u/VincentFreeman_ [link] [comments]  ( 69 min )
    [P] Using Vertex AI to train a model that automatically turns YouTube videos into TikTok clips
    submitted by /u/Luken58 [link] [comments]  ( 65 min )
    [D] What is the status of JEPA?
    So a few months ago Yann LeCun introduced JEPA as potentially the new BigThingTM . It looked like it was full of good ideas, and I was wondering if anyone was working publicly on that? I would assume some people at Meta probably do under LeCun's direction but I am curious what other people think about it? I would especially love to hear from people who considered it and found some roadblocks :-) As I am wondering if it is worth spending a few days working on it on my side. submitted by /u/keepthepace [link] [comments]  ( 66 min )
    [D] What are the prompts that Lensa AI (and other apps) are using to create the AI avatars?
    We all know that they train dreambooth on the given 10-20 images, but what prompts are they using to generate the images after they train the model? submitted by /u/JClub [link] [comments]  ( 66 min )
    [D] Help regarding AAAI 23
    Our paper has been accepted to AAAI-23. However, none of the authors will be able to attend in-person as the visa waiting times are high in our country. The original mail regarding paper acceptance mentioned that they were planning alternatives for people who won't be able to attend in-person but they didn't give any updates after that. They recently sent a mail regarding video submission where they mentioned that atleast one author is required to attend in person. I have already tried mailing aaai23@aaai(dot)org and aaaireg@aaai(dot)org, but did not receive any response. Did anyone else receive any information regarding this. Can anyone tell me what further steps i can take? Thanks in advance PS. Forgive me if this is not the right place to post, i don't know where else to ask submitted by /u/Numeronext [link] [comments]  ( 69 min )
    ML model integration to Android Application [P]
    [P] Hi! I have a ML model which i have exported as tensorflow Lite file. I need to integrate it to the Android Application(Java - Android Studio) to generate results according to the Cough Input via mic. I'm newbie to this. I have attached the python code and also java code below: Here's the snippets to the the ML model code. https://imgur.com/mROdfQI https://imgur.com/vXTBsjc https://imgur.com/k8B8J4S https://imgur.com/YK8JJpM https://imgur.com/tDR17EN ​ Here the code of java where I need to Implement this functionality: ​ Here's the layout Ss https://imgur.com/JQkq1jX ​ This is where i need to implement this model to fetch results https://imgur.com/jA9RYNS ​ Let me know if you need more details submitted by /u/zeshannaveed568 [link] [comments]  ( 65 min )
    [D] Advances in World Models
    What are the most critical advances in differentiable world models of this year? I haven't read much new in that direction since the Dreamer papers. Is there anything new that you think is promising or that you have tried and that works well? submitted by /u/fedetask [link] [comments]  ( 62 min )
    [N] Google Unveils a New Machine Learning Add-on for Google Sheets, Called Simple ML for Sheets, Which Allows Users to Leverage the Power of Machine Learning Without Any Coding Experience
    submitted by /u/rocky_rowdy [link] [comments]  ( 63 min )
    [P] I implemented an end-to-end MLOps stack example. Tutorial, files, and workspace included.
    A couple of months ago, I posted about reviewing 50+ open-source MLOps tools. Thank you all for the love and fantastic feedback! Many of you asked for stack examples, so I implemented an end-to-end MLOps stack with the most popular tools. Here is the result. Why did I do it? It's hard to select the right tools from so many options, and even after selecting them, it's hard to figure out how to get started. So I created an example stack: I picked the tools among the most popular ones, ensuring they work well together. I wrote a tutorial explaining the workflow step-by-step. I published the files to serve as a template. You can launch the same machine I used, fully configured, to test out the stack. I'm coming here to ask whether you find this helpful, and I would appreciate feedback on how I could make this more useful for you. I am grateful for all and any input you can give me ❤️🙂 submitted by /u/Academic_Arrak [link] [comments]  ( 66 min )
    [D] ChatGPT, crowdsourcing and similar examples
    I was reading a little bit about ChatGPT training which led me to a realization how smart of a move making it free to use actually is. We basically know that during the training ChatGPT uses human feedback, which is relatively expensive to get. However, by making it free to use and providing users an option to give feedback opens a door to massive amounts of training data for a relatively cheap price per training sample (the cost of running server). This approach is quite fascinating to me, and makes me wonder about other similar examples of this, so I would like to hear them in the comments if you have any? submitted by /u/mvujas [link] [comments]  ( 67 min )
    [D] Data driven decision making will fail: Here’s why Fascinating and thoughtful talk from Marc Warner, CEO, Faculty
    submitted by /u/chelsea_bear [link] [comments]  ( 67 min )
    [P] Football Player 3D Pose Estimation using YOLOv7
    submitted by /u/RandomForests92 [link] [comments]  ( 64 min )
    [D] A recently published TKDE paper on self supervised learning on graph from Stan Z Li's group at Westlake University, China plagiarized a TPAMI paper from TAMU
    submitted by /u/Fresh-Attorney4131 [link] [comments]  ( 62 min )
    [P] Problem with training and evaluation data from transformers model to Huggingface
    Hi all, We are training a distilBART model to summarize podcasts. We want to be able to properly document the process, and how each decision affects the model. So far that has included using rouge scores to determine the performance. If there are any other things you think we should do, please let me know. But back to the question from the title: For some reason, I just can not figure out how to control the training results. I want to see training- and validation loss after every epoch, but it keeps either putting it at some weird interval like here (code), or not at all like here (code). Will appreciate any help and general criticism of what we are doing! submitted by /u/emmytau [link] [comments]  ( 67 min )
    [P] Remove text from AI-generated images
    submitted by /u/Acceptable_Raisin_55 [link] [comments]  ( 68 min )
    [D] Manufacturing Forecast Model
    Hi all! Let's say you have manufacturing data that shows the full routing history for the past 3 year. For each work order you have the information when which processing step was started and completed as well as the work order was released: ​ Order Article Process Step Status Date 1 A 01 Start 01.01.2022 1 A 01 End 05.01.2022 Based on that historical data I would like to predict with a given predictability on which date we can assume that a specific process step is completed and when the last process step for each order will be. How would you do that? I am not looking for a specific solution, just for your approach and some useful concepts that I could research on my own. Maybe there even a similar use case already somewhere? submitted by /u/Doctor_Pink [link] [comments]  ( 69 min )
    [D] How can AI contribute to art historical analysis and research?
    submitted by /u/AImSamy [link] [comments]  ( 64 min )
    [R] GPT-Neo 125M or Bloomz-MT 300M pretrained/finetuned with Squad?
    I'd like to use one of those engine's for Q/A I see some nice tools out there like nshepherd and happytransformer, but neither of them use squad, but I do see some GPT-Neo squad models out there (for ex with GPT-NeoX) https://www.forefront.ai/blog-posts/how-to-fine-tune-gpt-neox submitted by /u/Thistleknot [link] [comments]  ( 64 min )
    [D] What is a good architecture for evaluation functions in the game of Go?
    AlphaGo used deep convolutional networks, but what is best for small scale computation like on a laptop? submitted by /u/Alarming-Fly-1679 [link] [comments]  ( 63 min )
    [D] Is softmax a good choice for confidence?
    So I was wondering, basically the title. If my CNN model is trained to classify images into cat and dog, and I show it an image of a horse. My model should be giving either dog or a cat as the answer, however the confidence of this answer (passing horse through a softmax) should be low. But I have found that mostly the models are quite cocky with a high confidence that it is indeed a dog. Or a cat. So is there a better way? Is there a technique or method or algorithm that gives accurate confidence on a classification? submitted by /u/thanderrine [link] [comments]  ( 66 min )
    Accelerating AI model embodiment project and GPTChat. [Project]
    I have a AI model embodiment project. POC is pretty far along. I need help to accelerate it to the finish: https://www.notion.so/Mind-Machine-Learning-2707060e25ec43978884b5e718c0c0d8 submitted by /u/bhartsb [link] [comments]  ( 66 min )
    [D] Is there a way to download neurips 2022 talks slides ?
    Hi Everyone, I've been catching up with the talks that happened in neurips 2022 as I could not follow it when it was happening because of my quarter finals. However I have not been able to figure out how to download the slides of the talks. Does anyone here know how to download them ? submitted by /u/sidney_lumet [link] [comments]  ( 66 min )
  • Open

    Hey guys, I'm working on a project to create 'self-replicating knowledge agents' inside large language model interfaces like ChatGPT. Check out /r/SelfReplicatingAI if you're interested in exploring this concept!
    submitted by /u/slackermanz [link] [comments]  ( 47 min )
    Is there an image to text AI generator?
    I'm talking about an AI to describe the image not to look for actual text in the image. submitted by /u/wezzeld [link] [comments]  ( 48 min )
    What are the largest companies developing AI other than the obvious like Google
    Who are the little guys doing AI in a big way? Who are the ones to watch? submitted by /u/dust_in_light [link] [comments]  ( 48 min )
    Noted 😌 (not sure about 5)
    submitted by /u/cavemanpiggy [link] [comments]  ( 51 min )
    AI Dream 134 - Incredible Trip: Discovery of Zion - Last hope for Human ...
    submitted by /u/LordPewPew777 [link] [comments]  ( 45 min )
    Cheers to generating infinite game assets during run-time!
    submitted by /u/ytcoinartist [link] [comments]  ( 48 min )
    I put my username as Bobbert_The_3rd and The AI did this lol
    submitted by /u/cheesemuncher1781 [link] [comments]  ( 45 min )
    A quick art piece on the whole AI debate going on. Who made What for Who using What. Original content by @gardehesten https://www.instagram.com/reel/CmM8sIPDNzI/?igshid=YmMyMTA2M2Y=
    submitted by /u/Ambitious_Dig3082 [link] [comments]  ( 49 min )
    Paul Thagard - Substrate-Independent Minds
    submitted by /u/timothy-ventura [link] [comments]  ( 46 min )
    Training AI to speak as someone
    So i just found out that it is possible to train AI to synthesize speech as someone else. How long would it take to train an AI to do this? How much speech is needed? Are there any tutorials or explanations out there? submitted by /u/Mixtery1 [link] [comments]  ( 51 min )
    A frontend for models like GPT
    Is there an open source web-frontend for GPT-Neo models and such that runs locally? Something that lets you choose modules with different configuration (GPT-2, GPT-Neo-1.3B, gpu accelerated, etc.) regenerate chunks of text edit generated messages inserts own sections of text includes or excludes chunks of text with a checkbox for the generation of the next text Those features are probably heavily inspired by character.ai, and based on my very limited experience with gpt-neo-1.3B. I'm thinking about starting such a project, but my time is very limited due to my other projects already. Are there other developers with angular / react / python flask / tensorflow / pytorch / huggingface experience? submitted by /u/sezanzeb [link] [comments]  ( 59 min )
    Is there an AI app that I can throw a a song in and it will generate something original based on that ?
    submitted by /u/melonhusktwitter [link] [comments]  ( 49 min )
    Could i clone my voice for ai and start making music career and putting zero effort
    submitted by /u/AppleFanBoySheesh [link] [comments]  ( 47 min )
    OpenAI Forecasts $1 Billion in Revenue by 2024
    submitted by /u/liquidocelotYT [link] [comments]  ( 45 min )
    I'm taking generating at least 10 prompts a day as seriously as a job. Should I do it? Why?
    I'm taking generating at least 10 prompts a day as seriously as a job. submitted by /u/TheVellerShow [link] [comments]  ( 50 min )
    AI Avatars with Stable Diffusion - Outdo Lensa for Free and Without a GPU or Computer - Comprehensive Video Tutorial Provided
    submitted by /u/CeFurkan [link] [comments]  ( 49 min )
    ChatGPT AI just solved an unsolved math problem - The Collatz Conjecture
    I first asked the chatbot (ChatGPT by Openai) to "Prove the Collatz conjecture" straightforwardly and nothing meaningful came out except what it is and how unproven it is. This was its conclusion: "Despite these efforts, the conjecture remains unsolved. It is considered to be one of the most challenging and intriguing unsolved problems in mathematics." Then I proceeded with "try to solve it" & then "use another method that no one used before to solve it" with no worthy answer. I figured that these cheap shots weren't gonna do it, so I worked around the question and proceeded with this: "3n+1 where n is a positive integer n/2 where n is a non-negative integer. Prove the answer end by cycling through 1,4,2,1,4,2,1,… if n is a positive integer. This is a repetitive process and you will rep…  ( 67 min )
    What's the state-of-the-art upscaling AI for Illustrations?
    I'm currently using RealESRGAN, so I was hoping for some advancements in artifact reduction. submitted by /u/typcalthowawayacount [link] [comments]  ( 47 min )
    Patrol and Security Guard Services
    Request a free quote online. 1Northwest Security Services offers first-rate Security Services in Timmins area: Licensed Security Guards, security consulting, Uniformed Security Guards; Security Patrol Services; Concierge Services; Mobile Patrol Services; Risk & Threat Analysis; Investigations, Executive Protection and more. ​ Contact our local Timmins security services office to discuss custom solutions to help your organization increase security and reduce risk. ​ We serve the entire Timmins area, including the cities of Chapleau, Connaught, Englehart, Haileybury, Holland, Iroquois Falls, Kapuskasing, Kirkland Lake, La Sarre, Rouyn-Noranda, South Porcupine, and Temiskaming Shores, Hearst, Cochran Ontario. ​ How to Apply for Your Ontario Security Licence In the province of Ontario it is mandatory that every security guard obtains a provincially issued licence. Each person is responsible for applying for, renewing and maintaining their licence, not their employer. The licence is valid for up-to two years. New applicants can visit our offices for more information. ​ For more information visit our website here: https://www.northwestsecurity.ca/ submitted by /u/Desperate_Cut7080 [link] [comments]  ( 48 min )
    I had an hour to spare so I made ChatGPT and MidJourney create some great testimonials for myself from some of world's most imaginary leaders.
    submitted by /u/QubaHQ [link] [comments]  ( 48 min )
    Latest AI Research at UC Berkeley developed a tracking algorithm for tracking the Dynamics of the Tear Film Lipid Layer
    submitted by /u/ai-lover [link] [comments]  ( 6 min )
    Accelerating AI model embodiment project and GPTChat.
    I have a AI model embodiment project. I need help to accelerate it: https://www.notion.so/Mind-Machine-Learning-2707060e25ec43978884b5e718c0c0d8 submitted by /u/bhartsb [link] [comments]  ( 53 min )
  • Open

    [D] using RL instead of normal classic programming
    The story is that I'm supposed to review someone's paper that has replaced a traditional way of making decision in an industry with an RL model. The decision traditionally is being done by a classic programming with some if clauses. In terms of time complexity the classic solution is not slower than the RL agent, i literally tested it on the same machine for both. So im not sure why we should replace classic programs with an RL agent. But I don't wanna be dicouraging. Has anyone seen using RL for this purpose? Searching reliable proceedings didn't help. The paper is not submitted anywhere yet. The writer is a new student in another lab and they asked me to review his paper. submitted by /u/curly_crazy_curious [link] [comments]  ( 56 min )
    "Merging enzymatic and synthetic chemistry with computational synthesis planning", Levin et al 2022
    submitted by /u/gwern [link] [comments]  ( 51 min )
    Why TRPO only take policy gradient step once?
    My question, however, would be opposite that can we shuffle the whole batch before the trust region update, and divide them into minibatchs to update the policy multiple times like PPO? submitted by /u/OutOfCharm [link] [comments]  ( 52 min )
    Best universities or labs for RL related research? Can be from any country, open to all suggestions.
    submitted by /u/FailedMesh [link] [comments]  ( 54 min )
    [Q]Official seed_rl repo is archived.. any alternative seed_rl style drl repo??
    Hey guys! I was fascinated by the concept of the seed_rl when it first came out because I believe that it could accelerate the training speed in local single machine environment. But I found that the official repo is recently archived and no longer maintains.. So I’m looking for alternatives which I can use seed_rl type distributed RL. Ray(or Rllib) is the most using drl librarys, but it doesn’t seems like using the seed_rl style. Anyone can recommend distributed RL librarys for it, or good for research and for lot’s of code modification? Is RLLib worth to use in single local machine training despite those cons? Thank you!! submitted by /u/jinPrelude [link] [comments]  ( 52 min )
    stable-baselines3 logging
    Hello, guys! It doesn't seem to be against the rules to ask about library implementations specifics, so here I go: Does anyone know how to log extra values in Tensorboard with stable-baselines3? This tutorial shows how to make a proper callback, but it doesn't show how to extract information from environment with that. I'm already passing the information needed in info dict to info_keywords kwarg of VecMonitor, but I don't know how to access that inside the callback. Also, I thought that simply adding is_success to that same dict and kwarg would be enough to start logging success_rate (as described in that documentation reference too), but it doesn't seem to work. What am I doing wrong? submitted by /u/victorsevero [link] [comments]  ( 55 min )
  • Open

    How are the right set of neural network hidden layers determined, from every tutorial I’ve seen it seems like people are just guessing.
    submitted by /u/PrepxI [link] [comments]  ( 49 min )
    output of the activation function for Sigmoid
    Consider an input given to the activation function as z = 0.5. Compute the output of the activation function for Sigmoid. Show your calculations. is it this simple? I get confused by the z? 1.0 / (1.0 + e^-0.5) submitted by /u/Relevant_Ideal_7014 [link] [comments]  ( 56 min )
  • Open

    New performance improvements in Amazon SageMaker model parallel library
    Foundation models are large deep learning models trained on a vast quantity of data at scale. They can be further fine-tuned to perform a variety of downstream tasks and form the core backbone of enabling several AI applications. The most prominent category is large-language models (LLM), including auto-regressive models such as GPT variants trained to complete […]  ( 10 min )
  • Open

    Causes and Cures for Interference in Multilingual Translation. (arXiv:2212.07530v1 [cs.CL])
    Multilingual machine translation models can benefit from synergy between different language pairs, but also suffer from interference. While there is a growing number of sophisticated methods that aim to eliminate interference, our understanding of interference as a phenomenon is still limited. This work identifies the main factors that contribute to interference in multilingual machine translation. Through systematic experimentation, we find that interference (or synergy) are primarily determined by model size, data size, and the proportion of each language pair within the total dataset. We observe that substantial interference occurs mainly when the model is very small with respect to the available training data, and that using standard transformer configurations with less than one billion parameters largely alleviates interference and promotes synergy. Moreover, we show that tuning the sampling temperature to control the proportion of each language pair in the data is key to balancing the amount of interference between low and high resource language pairs effectively, and can lead to superior performance overall.  ( 2 min )
    Smoothness and continuity of cost functionals for ECG mismatch computation. (arXiv:2201.04487v2 [physics.med-ph] UPDATED)
    The field of cardiac electrophysiology tries to abstract, describe and finally model the electrical characteristics of a heartbeat. With recent advances in cardiac electrophysiology, models have become more powerful and descriptive as ever. However, to advance to the field of inverse electrophysiological modeling, i.e. creating models from electrical measurements such as the ECG, the less investigated field of smoothness of the simulated ECGs w.r.t. model parameters need to be further explored. The present paper discusses smoothness in terms of the whole pipeline which describes how from physiological parameters, we arrive at the simulated ECG. Employing such a pipeline, we create a test-bench of a simplified idealized left ventricle model and demonstrate the most important factors for efficient inverse modeling through smooth cost functionals. Such knowledge will be important for designing and creating inverse models in future optimization and machine learning methods.  ( 2 min )
    Interpolation with the polynomial kernels. (arXiv:2212.07658v1 [math.NA])
    The polynomial kernels are widely used in machine learning and they are one of the default choices to develop kernel-based classification and regression models. However, they are rarely used and considered in numerical analysis due to their lack of strict positive definiteness. In particular they do not enjoy the usual property of unisolvency for arbitrary point sets, which is one of the key properties used to build kernel-based interpolation methods. This paper is devoted to establish some initial results for the study of these kernels, and their related interpolation algorithms, in the context of approximation theory. We will first prove necessary and sufficient conditions on point sets which guarantee the existence and uniqueness of an interpolant. We will then study the Reproducing Kernel Hilbert Spaces (or native spaces) of these kernels and their norms, and provide inclusion relations between spaces corresponding to different kernel parameters. With these spaces at hand, it will be further possible to derive generic error estimates which apply to sufficiently smooth functions, thus escaping the native space. Finally, we will show how to employ an efficient stable algorithm to these kernels to obtain accurate interpolants, and we will test them in some numerical experiment. After this analysis several computational and theoretical aspects remain open, and we will outline possible further research directions in a concluding section. This work builds some bridges between kernel and polynomial interpolation, two topics to which the authors, to different extents, have been introduced under the supervision or through the work of Stefano De Marchi. For this reason, they wish to dedicate this work to him in the occasion of his 60th birthday.  ( 2 min )
    Decorrelation with conditional normalizing flows. (arXiv:2211.02486v3 [hep-ph] UPDATED)
    The sensitivity of many physics analyses can be enhanced by constructing discriminants that preferentially select signal events. Such discriminants become much more useful if they are uncorrelated with a set of protected attributes. In this paper we show that a normalizing flow conditioned on the protected attributes can be used to find a decorrelated representation for any discriminant. As a normalizing flow is invertible the separation power of the resulting discriminant will be unchanged at any fixed value of the protected attributes. We demonstrate the efficacy of our approach by building supervised jet taggers that produce almost no sculpting in the mass distribution of the background.  ( 2 min )
    Factorized Fourier Neural Operators. (arXiv:2111.13802v3 [cs.LG] UPDATED)
    We propose the Factorized Fourier Neural Operator (F-FNO), a learning-based approach for simulating partial differential equations (PDEs). Starting from a recently proposed Fourier representation of flow fields, the F-FNO bridges the performance gap between pure machine learning approaches to that of the best numerical or hybrid solvers. This is achieved with new representations - separable spectral layers and improved residual connections - and a combination of training strategies such as the Markov assumption, Gaussian noise, and cosine learning rate decay. On several challenging benchmark PDEs on regular grids, structured meshes, and point clouds, the F-FNO can scale to deeper networks and outperform both the FNO and the geo-FNO, reducing the error by 83% on the Navier-Stokes problem, 31% on the elasticity problem, 57% on the airfoil flow problem, and 60% on the plastic forging problem. Compared to the state-of-the-art pseudo-spectral method, the F-FNO can take a step size that is an order of magnitude larger in time and achieve an order of magnitude speedup to produce the same solution quality.  ( 2 min )
    Put Attention to Temporal Saliency Patterns of Multi-Horizon Time Series. (arXiv:2212.07771v1 [cs.LG])
    Time series, sets of sequences in chronological order, are essential data in statistical research with many forecasting applications. Although recent performance in many Transformer-based models has been noticeable, long multi-horizon time series forecasting remains a very challenging task. Going beyond transformers in sequence translation and transduction research, we observe the effects of down-and-up samplings that can nudge temporal saliency patterns to emerge in time sequences. Motivated by the mentioned observation, in this paper, we propose a novel architecture, Temporal Saliency Detection (TSD), on top of the attention mechanism and apply it to multi-horizon time series prediction. We renovate the traditional encoder-decoder architecture by making as a series of deep convolutional blocks to work in tandem with the multi-head self-attention. The proposed TSD approach facilitates the multiresolution of saliency patterns upon condensed multi-heads, thus progressively enhancing complex time series forecasting. Experimental results illustrate that our proposed approach has significantly outperformed existing state-of-the-art methods across multiple standard benchmark datasets in many far-horizon forecasting settings. Overall, TSD achieves 31% and 46% relative improvement over the current state-of-the-art models in multivariate and univariate time series forecasting scenarios on standard benchmarks. The Git repository is available at https://github.com/duongtrung/time-series-temporal-saliency-patterns.  ( 2 min )
    EpiGRAF: Rethinking training of 3D GANs. (arXiv:2206.10535v2 [cs.CV] UPDATED)
    A very recent trend in generative modeling is building 3D-aware generators from 2D image collections. To induce the 3D bias, such models typically rely on volumetric rendering, which is expensive to employ at high resolutions. During the past months, there appeared more than 10 works that address this scaling issue by training a separate 2D decoder to upsample a low-resolution image (or a feature tensor) produced from a pure 3D generator. But this solution comes at a cost: not only does it break multi-view consistency (i.e. shape and texture change when the camera moves), but it also learns the geometry in a low fidelity. In this work, we show that it is possible to obtain a high-resolution 3D generator with SotA image quality by following a completely different route of simply training the model patch-wise. We revisit and improve this optimization scheme in two ways. First, we design a location- and scale-aware discriminator to work on patches of different proportions and spatial positions. Second, we modify the patch sampling strategy based on an annealed beta distribution to stabilize training and accelerate the convergence. The resulted model, named EpiGRAF, is an efficient, high-resolution, pure 3D generator, and we test it on four datasets (two introduced in this work) at $256^2$ and $512^2$ resolutions. It obtains state-of-the-art image quality, high-fidelity geometry and trains ${\approx} 2.5 \times$ faster than the upsampler-based counterparts. Project website: https://universome.github.io/epigraf.  ( 2 min )
    Alternating Objectives Generates Stronger PGD-Based Adversarial Attacks. (arXiv:2212.07992v1 [cs.LG])
    Designing powerful adversarial attacks is of paramount importance for the evaluation of $\ell_p$-bounded adversarial defenses. Projected Gradient Descent (PGD) is one of the most effective and conceptually simple algorithms to generate such adversaries. The search space of PGD is dictated by the steepest ascent directions of an objective. Despite the plethora of objective function choices, there is no universally superior option and robustness overestimation may arise from ill-suited objective selection. Driven by this observation, we postulate that the combination of different objectives through a simple loss alternating scheme renders PGD more robust towards design choices. We experimentally verify this assertion on a synthetic-data example and by evaluating our proposed method across 25 different $\ell_{\infty}$-robust models and 3 datasets. The performance improvement is consistent, when compared to the single loss counterparts. In the CIFAR-10 dataset, our strongest adversarial attack outperforms all of the white-box components of AutoAttack (AA) ensemble, as well as the most powerful attacks existing on the literature, achieving state-of-the-art results in the computational budget of our study ($T=100$, no restarts).  ( 2 min )
    Generating Multivariate Load States Using a Conditional Variational Autoencoder. (arXiv:2110.11435v2 [eess.SY] UPDATED)
    For planning of power systems and for the calibration of operational tools, it is essential to analyse system performance in a large range of representative scenarios. When the available historical data is limited, generative models are a promising solution, but modelling high-dimensional dependencies is challenging. In this paper, a multivariate load state generating model on the basis of a conditional variational autoencoder (CVAE) neural network is proposed. Going beyond common CVAE implementations, the model includes stochastic variation of output samples under given latent vectors and co-optimizes the parameters for this output variability. It is shown that this improves statistical properties of the generated data. The quality of generated multivariate loads is evaluated using univariate and multivariate performance metrics. A generation adequacy case study on the European network is used to illustrate model's ability to generate realistic tail distributions. The experiments demonstrate that the proposed generator outperforms other data generating mechanisms.  ( 2 min )
    Multi-Level Association Rule Mining for Wireless Network Time Series Data. (arXiv:2212.07860v1 [cs.NI])
    Key performance indicators(KPIs) are of great significance in the monitoring of wireless network service quality. The network service quality can be improved by adjusting relevant configuration parameters(CPs) of the base station. However, there are numerous CPs and different cells may affect each other, which bring great challenges to the association analysis of wireless network data. In this paper, we propose an adjustable multi-level association rule mining framework, which can quantitatively mine association rules at each level with environmental information, including engineering parameters and performance management(PMs), and it has interpretability at each level. Specifically, We first cluster similar cells, then quantify KPIs and CPs, and integrate expert knowledge into the association rule mining model, which improve the robustness of the model. The experimental results in real world dataset prove the effectiveness of our method.  ( 2 min )
    Interactive Concept Bottleneck Models. (arXiv:2212.07430v1 [cs.LG])
    Concept bottleneck models (CBMs) (Koh et al. 2020) are interpretable neural networks that first predict labels for human-interpretable concepts relevant to the prediction task, and then predict the final label based on the concept label predictions.We extend CBMs to interactive prediction settings where the model can query a human collaborator for the label to some concepts. We develop an interaction policy that, at prediction time, chooses which concepts to request a label for so as to maximally improve the final prediction. We demonstrate thata simple policy combining concept prediction uncertainty and influence of the concept on the final prediction achieves strong performance and outperforms a static approach proposed in Koh et al. (2020) as well as active feature acquisition methods proposed in the literature. We show that the interactiveCBM can achieve accuracy gains of 5-10% with only 5 interactions over competitive baselines on the Caltech-UCSDBirds, CheXpert and OAI datasets.  ( 2 min )
    Anomaly Detection in Driving by Cluster Analysis Twice. (arXiv:2212.07691v1 [cs.LG])
    Events deviating from normal traffic patterns in driving, anomalies, such as aggressive driving or bumpy roads, may harm delivery efficiency for transportation and logistics (T&L) business. Thus, detecting anomalies in driving is critical for the T&L industry. So far numerous researches have used vehicle sensor data to identify anomalies. Most previous works captured anomalies by using deep learning or machine learning algorithms, which require prior training processes and huge computational costs. This study proposes a method namely Anomaly Detection in Driving by Cluster Analysis Twice (ADDCAT) which clusters the processed sensor data in different physical properties. An event is said to be an anomaly if it never fits with the major cluster, which is considered as the pattern of normality in driving. This method provides a way to detect anomalies in driving with no prior training processes and huge computational costs needed. This paper validated the performance of the method on an open dataset.  ( 2 min )
    Combining information-seeking exploration and reward maximization: Unified inference on continuous state and action spaces under partial observability. (arXiv:2212.07946v1 [cs.LG])
    Reinforcement learning (RL) gained considerable attention by creating decision-making agents that maximize rewards received from fully observable environments. However, many real-world problems are partially or noisily observable by nature, where agents do not receive the true and complete state of the environment. Such problems are formulated as partially observable Markov decision processes (POMDPs). Some studies applied RL to POMDPs by recalling previous decisions and observations or inferring the true state of the environment from received observations. Nevertheless, aggregating observations and decisions over time is impractical for environments with high-dimensional continuous state and action spaces. Moreover, so-called inference-based RL approaches require large number of samples to perform well since agents eschew uncertainty in the inferred state for the decision-making. Active inference is a framework that is naturally formulated in POMDPs and directs agents to select decisions by minimising expected free energy (EFE). This supplies reward-maximising (exploitative) behaviour in RL, with an information-seeking (exploratory) behaviour. Despite this exploratory behaviour of active inference, its usage is limited to discrete state and action spaces due to the computational difficulty of the EFE. We propose a unified principle for joint information-seeking and reward maximization that clarifies a theoretical connection between active inference and RL, unifies active inference and RL, and overcomes their aforementioned limitations. Our findings are supported by strong theoretical analysis. The proposed framework's superior exploration property is also validated by experimental results on partial observable tasks with high-dimensional continuous state and action spaces. Moreover, the results show that our model solves reward-free problems, making task reward design optional.  ( 2 min )
    IMoS: Intent-Driven Full-Body Motion Synthesis for Human-Object Interactions. (arXiv:2212.07555v1 [cs.CV])
    Can we make virtual characters in a scene interact with their surrounding objects through simple instructions? Is it possible to synthesize such motion plausibly with a diverse set of objects and instructions? Inspired by these questions, we present the first framework to synthesize the full-body motion of virtual human characters performing specified actions with 3D objects placed within their reach. Our system takes as input textual instructions specifying the objects and the associated intentions of the virtual characters and outputs diverse sequences of full-body motions. This is in contrast to existing work, where full-body action synthesis methods generally do not consider object interactions, and human-object interaction methods focus mainly on synthesizing hand or finger movements for grasping objects. We accomplish our objective by designing an intent-driven full-body motion generator, which uses a pair of decoupled conditional variational autoencoders (CVAE) to learn the motion of the body parts in an autoregressive manner. We also optimize for the positions of the objects with six degrees of freedom (6DoF) such that they plausibly fit within the hands of the synthesized characters. We compare our proposed method with the existing methods of motion synthesis and establish a new and stronger state-of-the-art for the task of intent-driven motion synthesis. Through a user study, we further show that our synthesized full-body motions appear more realistic to the participants in more than 80% of scenarios compared to the current state-of-the-art methods, and are perceived to be as good as the ground truth on several occasions.
    ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning. (arXiv:2212.07919v1 [cs.CL])
    Large language models show improved downstream task performance when prompted to generate step-by-step reasoning to justify their final answers. These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness (independent of the final answer) is difficult without reliable methods for automatic evaluation. We simply do not know how often the stated reasoning steps actually support the final end task predictions. In this work, we present ROSCOE, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics. To evaluate ROSCOE against baseline metrics, we design a typology of reasoning errors and collect synthetic and human evaluation scores on commonly used reasoning datasets. In contrast with existing metrics, ROSCOE can measure semantic consistency, logicality, informativeness, fluency, and factuality - among other traits - by leveraging properties of step-by-step rationales. We empirically verify the strength of our metrics on five human annotated and six programmatically perturbed diagnostics datasets - covering a diverse set of tasks that require reasoning skills and show that ROSCOE can consistently outperform baseline metrics.
    fMRI from EEG is only Deep Learning away: the use of interpretable DL to unravel EEG-fMRI relationships. (arXiv:2211.02024v2 [physics.med-ph] UPDATED)
    The access to activity of subcortical structures offers unique opportunity for building intention dependent brain-computer interfaces, renders abundant options for exploring a broad range of cognitive phenomena in the realm of affective neuroscience including complex decision making processes and the eternal free-will dilemma and facilitates diagnostics of a range of neurological deceases. So far this was possible only using bulky, expensive and immobile fMRI equipment. Here we present an interpretable domain grounded solution to recover the activity of several subcortical regions from the multichannel EEG data and demonstrate up to 60% correlation between the actual subcortical blood oxygenation level dependent sBOLD signal and its EEG-derived twin. Then, using the novel and theoretically justified weight interpretation methodology we recover individual spatial and time-frequency patterns of scalp EEG predictive of the hemodynamic signal in the subcortical nuclei. The described results not only pave the road towards wearable subcortical activity scanners but also showcase an automatic knowledge discovery process facilitated by deep learning technology in combination with an interpretable domain constrained architecture and the appropriate downstream task.
    Construction of a Surrogate Model: Multivariate Time Series Prediction with a Hybrid Model. (arXiv:2212.07918v1 [stat.ML])
    Recent developments of advanced driver-assistance systems necessitate an increasing number of tests to validate new technologies. These tests cannot be carried out on track in a reasonable amount of time and automotive groups rely on simulators to perform most tests. The reliability of these simulators for constantly refined tasks is becoming an issue and, to increase the number of tests, the industry is now developing surrogate models, that should mimic the behavior of the simulator while being much faster to run on specific tasks. In this paper we aim to construct a surrogate model to mimic and replace the simulator. We first test several classical methods such as random forests, ridge regression or convolutional neural networks. Then we build three hybrid models that use all these methods and combine them to obtain an efficient hybrid surrogate model.
    BagPipe: Accelerating Deep Recommendation Model Training. (arXiv:2202.12429v2 [cs.DC] UPDATED)
    Deep learning based recommendation models (DLRM) are widely used in several business critical applications. Training such recommendation models efficiently is challenging primarily because they consist of billions of embedding-based parameters which are often stored remotely leading to significant overheads from embedding access. By profiling existing DLRM training, we observe that only 8.5% of the iteration time is spent in forward/backward pass while the remaining time is spent on embedding and model synchronization. Our key insight in this paper is that access to embeddings have a specific structure and pattern which can be used to accelerate training. We observe that embedding accesses are heavily skewed, with almost 1% of embeddings represent more than 92% of total accesses. Further, we observe that during training we can lookahead at future batches to determine exactly which embeddings will be needed at what iteration in the future. Based on these insight, we propose Bagpipe, a system for training deep recommendation models that uses caching and prefetching to overlap remote embedding accesses with the computation. We designed an Oracle Cacher, a new system component which uses our lookahead algorithm to generate optimal cache update decisions and provide strong consistency guarantees. Our experiments using three datasets and two models shows that our approach provides a speed up of up to 6.2x compared to state of the art baselines, while providing the same convergence and reproducibility guarantees as synchronous training.  ( 2 min )
    Differentiating Nonsmooth Solutions to Parametric Monotone Inclusion Problems. (arXiv:2212.07844v1 [cs.LG])
    We leverage path differentiability and a recent result on nonsmooth implicit differentiation calculus to give sufficient conditions ensuring that the solution to a monotone inclusion problem will be path differentiable, with formulas for computing its generalized gradient. A direct consequence of our result is that these solutions happen to be differentiable almost everywhere. Our approach is fully compatible with automatic differentiation and comes with assumptions which are easy to check, roughly speaking: semialgebraicity and strong monotonicity. We illustrate the scope of our results by considering three fundamental composite problem settings: strongly convex problems, dual solutions to convex minimization problems and primal-dual solutions to min-max problems.
    Sim-to-Real Transfer for Quadrupedal Locomotion via Terrain Transformer. (arXiv:2212.07740v1 [cs.RO])
    Deep reinforcement learning has recently emerged as an appealing alternative for legged locomotion over multiple terrains by training a policy in physical simulation and then transferring it to the real world (i.e., sim-to-real transfer). Despite considerable progress, the capacity and scalability of traditional neural networks are still limited, which may hinder their applications in more complex environments. In contrast, the Transformer architecture has shown its superiority in a wide range of large-scale sequence modeling tasks, including natural language processing and decision-making problems. In this paper, we propose Terrain Transformer (TERT), a high-capacity Transformer model for quadrupedal locomotion control on various terrains. Furthermore, to better leverage Transformer in sim-to-real scenarios, we present a novel two-stage training framework consisting of an offline pretraining stage and an online correction stage, which can naturally integrate Transformer with privileged training. Extensive experiments in simulation demonstrate that TERT outperforms state-of-the-art baselines on different terrains in terms of return, energy consumption and control smoothness. In further real-world validation, TERT successfully traverses nine challenging terrains, including sand pit and stair down, which can not be accomplished by strong baselines.
    Silhouette: Toward Performance-Conscious and Transferable CPU Embeddings. (arXiv:2212.08046v1 [cs.LG])
    Learned embeddings are widely used to obtain concise data representation and enable transfer learning between different data sets and tasks. In this paper, we present Silhouette, our approach that leverages publicly-available performance data sets to learn CPU embeddings. We show how these embeddings enable transfer learning between data sets of different types and sizes. Each of these scenarios leads to an improvement in accuracy for the target data set.  ( 2 min )
    Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language. (arXiv:2212.07525v1 [cs.LG])
    Current self-supervised learning algorithms are often modality-specific and require large amounts of computational resources. To address these issues, we increase the training efficiency of data2vec, a learning objective that generalizes across several modalities. We do not encode masked tokens, use a fast convolutional decoder and amortize the effort to build teacher representations. data2vec 2.0 benefits from the rich contextualized target representations introduced in data2vec which enable a fast self-supervised learner. Experiments on ImageNet-1K image classification show that data2vec 2.0 matches the accuracy of Masked Autoencoders in 16.4x lower pre-training time, on Librispeech speech recognition it performs as well as wav2vec 2.0 in 10.6x less time, and on GLUE natural language understanding it matches a retrained RoBERTa model in half the time. Trading some speed for accuracy results in ImageNet-1K top-1 accuracy of 86.8\% with a ViT-L model trained for 150 epochs.  ( 2 min )
    Emergent Behaviors in Multi-Agent Target Acquisition. (arXiv:2212.07891v1 [cs.AI])
    Only limited studies and superficial evaluations are available on agents' behaviors and roles within a Multi-Agent System (MAS). We simulate a MAS using Reinforcement Learning (RL) in a pursuit-evasion (a.k.a predator-prey pursuit) game, which shares task goals with target acquisition, and we create different adversarial scenarios by replacing RL-trained pursuers' policies with two distinct (non-RL) analytical strategies. Using heatmaps of agents' positions (state-space variable) over time, we are able to categorize an RL-trained evader's behaviors. The novelty of our approach entails the creation of an influential feature set that reveals underlying data regularities, which allow us to classify an agent's behavior. This classification may aid in catching the (enemy) targets by enabling us to identify and predict their behaviors, and when extended to pursuers, this approach towards identifying teammates' behavior may allow agents to coordinate more effectively.
    ESPNN: A novel electronic stopping power neural-network code built on the IAEA stopping power database. I. Atomic targets. (arXiv:2210.10950v2 [physics.atm-clus] UPDATED)
    The International Atomic Energy Agency (IAEA) stopping power database is a highly valued public resource compiling most of the experimental measurements published over nearly a century. The database-accessible to the global scientific community-is continuously updated and has been extensively employed in theoretical and experimental research for more than 30 years. This work aims to employ machine learning algorithms on the 2021 IAEA database to predict accurate electronic stopping power cross sections for any ion and target combination in a wide range of incident energies. Unsupervised machine learning methods are applied to clean the database in an automated manner. These techniques purge the data by removing suspicious outliers and old isolated values. A large portion of the remaining data is used to train a deep neural network, while the rest is set aside, constituting the test set. The present work considers collisional systems only with atomic targets. The first version of the ESPNN (electronic stopping power neural-network code), openly available to users, is shown to yield predicted values in excellent agreement with the experimental results of the test set.
    Certified Monotonic Neural Networks. (arXiv:2011.10219v2 [cs.LG] UPDATED)
    Learning monotonic models with respect to a subset of the inputs is a desirable feature to effectively address the fairness, interpretability, and generalization issues in practice. Existing methods for learning monotonic neural networks either require specifically designed model structures to ensure monotonicity, which can be too restrictive/complicated, or enforce monotonicity by adjusting the learning process, which cannot provably guarantee the learned model is monotonic on selected features. In this work, we propose to certify the monotonicity of the general piece-wise linear neural networks by solving a mixed integer linear programming problem.This provides a new general approach for learning monotonic neural networks with arbitrary model structures. Our method allows us to train neural networks with heuristic monotonicity regularizations, and we can gradually increase the regularization magnitude until the learned network is certified monotonic. Compared to prior works, our approach does not require human-designed constraints on the weight space and also yields more accurate approximation. Empirical studies on various datasets demonstrate the efficiency of our approach over the state-of-the-art methods, such as Deep Lattice Networks.
    Decentralized Nonconvex Optimization with Guaranteed Privacy and Accuracy. (arXiv:2212.07534v1 [math.OC])
    Privacy protection and nonconvexity are two challenging problems in decentralized optimization and learning involving sensitive data. Despite some recent advances addressing each of the two problems separately, no results have been reported that have theoretical guarantees on both privacy protection and saddle/maximum avoidance in decentralized nonconvex optimization. We propose a new algorithm for decentralized nonconvex optimization that can enable both rigorous differential privacy and saddle/maximum avoiding performance. The new algorithm allows the incorporation of persistent additive noise to enable rigorous differential privacy for data samples, gradients, and intermediate optimization variables without losing provable convergence, and thus circumventing the dilemma of trading accuracy for privacy in differential privacy design. More interestingly, the algorithm is theoretically proven to be able to efficiently { guarantee accuracy by avoiding} convergence to local maxima and saddle points, which has not been reported before in the literature on decentralized nonconvex optimization. The algorithm is efficient in both communication (it only shares one variable in each iteration) and computation (it is encryption-free), and hence is promising for large-scale nonconvex optimization and learning involving high-dimensional optimization parameters. Numerical experiments for both a decentralized estimation problem and an Independent Component Analysis (ICA) problem confirm the effectiveness of the proposed approach.
    Hope Speech Detection on Social Media Platforms. (arXiv:2212.07424v1 [cs.CL])
    Since personal computers became widely available in the consumer market, the amount of harmful content on the internet has significantly expanded. In simple terms, harmful content is anything online which causes a person distress or harm. It may include hate speech, violent content, threats, non-hope speech, etc. The online content must be positive, uplifting and supportive. Over the past few years, many studies have focused on solving this problem through hate speech detection, but very few focused on identifying hope speech. This paper discusses various machine learning approaches to identify a sentence as Hope Speech, Non-Hope Speech, or a Neutral sentence. The dataset used in the study contains English YouTube comments and is released as a part of the shared task "EACL-2021: Hope Speech Detection for Equality, Diversity, and Inclusion". Initially, the dataset obtained from the shared task had three classes: Hope Speech, non-Hope speech, and not in English; however, upon deeper inspection, we discovered that dataset relabeling is required. A group of undergraduates was hired to help perform the entire dataset's relabeling task. We experimented with conventional machine learning models (such as Na\"ive Bayes, logistic regression and support vector machine) and pre-trained models (such as BERT) on relabeled data. According to the experimental results, the relabeled data has achieved a better accuracy for Hope speech identification than the original data set.
    DeepJoin: Joinable Table Discovery with Pre-trained Language Models. (arXiv:2212.07588v1 [cs.DB])
    Due to the usefulness in data enrichment for data analysis tasks, joinable table discovery has become an important operation in data lake management. Existing approaches target equi-joins, the most common way of combining tables for creating a unified view, or semantic joins, which tolerate misspellings and different formats to deliver more join results. They are either exact solutions whose running time is linear in the sizes of query column and target table repository or approximate solutions lacking precision. In this paper, we propose Deepjoin, a deep learning model for accurate and efficient joinable table discovery. Our solution is an embedding-based retrieval, which employs a pre-trained language model (PLM) and is designed as one framework serving both equi- and semantic joins. We propose a set of contextualization options to transform column contents to a text sequence. The PLM reads the sequence and is fine-tuned to embed columns to vectors such that columns are expected to be joinable if they are close to each other in the vector space. Since the output of the PLM is fixed in length, the subsequent search procedure becomes independent of the column size. With a state-of-the-art approximate nearest neighbor search algorithm, the search time is logarithmic in the repository size. To train the model, we devise the techniques for preparing training data as well as data augmentation. The experiments on real datasets demonstrate that by training on a small subset of a corpus, Deepjoin generalizes to large datasets and its precision consistently outperforms other approximate solutions'. Deepjoin is even more accurate than an exact solution to semantic joins when evaluated with labels from experts. Moreover, when equipped with a GPU, Deepjoin is up to two orders of magnitude faster than existing solutions.
    Two-stage Contextual Transformer-based Convolutional Neural Network for Airway Extraction from CT Images. (arXiv:2212.07651v1 [eess.IV])
    Accurate airway extraction from computed tomography (CT) images is a critical step for planning navigation bronchoscopy and quantitative assessment of airway-related chronic obstructive pulmonary disease (COPD). The existing methods are challenging to sufficiently segment the airway, especially the high-generation airway, with the constraint of the limited label and cannot meet the clinical use in COPD. We propose a novel two-stage 3D contextual transformer-based U-Net for airway segmentation using CT images. The method consists of two stages, performing initial and refined airway segmentation. The two-stage model shares the same subnetwork with different airway masks as input. Contextual transformer block is performed both in the encoder and decoder path of the subnetwork to finish high-quality airway segmentation effectively. In the first stage, the total airway mask and CT images are provided to the subnetwork, and the intrapulmonary airway mask and corresponding CT scans to the subnetwork in the second stage. Then the predictions of the two-stage method are merged as the final prediction. Extensive experiments were performed on in-house and multiple public datasets. Quantitative and qualitative analysis demonstrate that our proposed method extracted much more branches and lengths of the tree while accomplishing state-of-the-art airway segmentation performance. The code is available at https://github.com/zhaozsq/airway_segmentation.
    Learning Cooperative Beamforming with Edge-Update Empowered Graph Neural Networks. (arXiv:2212.08020v1 [cs.NI])
    Cooperative beamforming design has been recognized as an effective approach in modern wireless networks to meet the dramatically increasing demand of various wireless data traffics. It is formulated as an optimization problem in conventional approaches and solved iteratively in an instance-by-instance manner. Recently, learning-based methods have emerged with real-time implementation by approximating the mapping function from the problem instances to the corresponding solutions. Among various neural network architectures, graph neural networks (GNNs) can effectively utilize the graph topology in wireless networks to achieve better generalization ability on unseen problem sizes. However, the current GNNs are only equipped with the node-update mechanism, which restricts it from modeling more complicated problems such as the cooperative beamforming design, where the beamformers are on the graph edges of wireless networks. To fill this gap, we propose an edge-graph-neural-network (Edge-GNN) by incorporating an edge-update mechanism into the GNN, which learns the cooperative beamforming on the graph edges. Simulation results show that the proposed Edge-GNN achieves higher sum rate with much shorter computation time than state-of-the-art approaches, and generalizes well to different numbers of base stations and user equipments.
    Spatially-resolved Thermometry from Line-of-Sight Emission Spectroscopy via Machine Learning. (arXiv:2212.07836v1 [cs.LG])
    A methodology is proposed, which addresses the caveat that line-of-sight emission spectroscopy presents in that it cannot provide spatially resolved temperature measurements in nonhomogeneous temperature fields. The aim of this research is to explore the use of data-driven models in measuring temperature distributions in a spatially resolved manner using emission spectroscopy data. Two categories of data-driven methods are analyzed: (i) Feature engineering and classical machine learning algorithms, and (ii) end-to-end convolutional neural networks (CNN). In total, combinations of fifteen feature groups and fifteen classical machine learning models, and eleven CNN models are considered and their performances explored. The results indicate that the combination of feature engineering and machine learning provides better performance than the direct use of CNN. Notably, feature engineering which is comprised of physics-guided transformation, signal representation-based feature extraction and Principal Component Analysis is found to be the most effective. Moreover, it is shown that when using the extracted features, the ensemble-based, light blender learning model offers the best performance with RMSE, RE, RRMSE and R values of 64.3, 0.017, 0.025 and 0.994, respectively. The proposed method, based on feature engineering and the light blender model, is capable of measuring nonuniform temperature distributions from low-resolution spectra, even when the species concentration distribution in the gas mixtures is unknown.
    Stochastic Zeroth order Descent with Structured Directions. (arXiv:2206.05124v2 [math.OC] UPDATED)
    We introduce and analyze Structured Stochastic Zeroth order Descent (S-SZD), a finite difference approach which approximates a stochastic gradient on a set of $l\leq d$ orthogonal directions, where $d$ is the dimension of the ambient space. These directions are randomly chosen, and may change at each step. For smooth convex functions we prove almost sure convergence of the iterates and a convergence rate on the function values of the form $O(d/l k^{-c})$ for every $c<1/2$, which is arbitrarily close to the one of Stochastic Gradient Descent (SGD) in terms of number of iterations. Our bound also shows the benefits of using $l$ multiple directions instead of one. For non-convex functions satisfying the Polyak-{\L}ojasiewicz condition, we establish the first convergence rates for stochastic zeroth order algorithms under such an assumption. We corroborate our theoretical findings in numerical simulations where assumptions are satisfied and on the real-world problem of hyper-parameter optimization, observing that S-SZD has very good practical performances.
    Reward Shaping for Human Learning via Inverse Reinforcement Learning. (arXiv:2002.10904v3 [cs.LG] UPDATED)
    Humans are spectacular reinforcement learners, constantly learning from and adjusting to experience and feedback. Unfortunately, this doesn't necessarily mean humans are fast learners. When tasks are challenging, learning can become unacceptably slow. Fortunately, humans do not have to learn tabula rasa, and learning speed can be greatly increased with learning aids. In this work we validate a new type of learning aid -- reward shaping for humans via inverse reinforcement learning (IRL). The goal of this aid is to increase the speed with which humans can learn good policies for specific tasks. Furthermore this approach compliments alternative machine learning techniques such as safety features that try to prevent individuals from making poor decisions. To achieve our results we first extend a well known IRL algorithm via kernel methods. Afterwards we conduct two human subjects experiments using an online game where players have limited time to learn a good policy. We show with statistical significance that players who receive our learning aid are able to approach desired policies more quickly than the control group.
    Transformers learn in-context by gradient descent. (arXiv:2212.07677v1 [cs.LG])
    Transformers have become the state-of-the-art neural network architecture across numerous domains of machine learning. This is partly due to their celebrated ability to transfer and to learn in-context based on few examples. Nevertheless, the mechanisms by which Transformers become in-context learners are not well understood and remain mostly an intuition. Here, we argue that training Transformers on auto-regressive tasks can be closely related to well-known gradient-based meta-learning formulations. We start by providing a simple weight construction that shows the equivalence of data transformations induced by 1) a single linear self-attention layer and by 2) gradient-descent (GD) on a regression loss. Motivated by that construction, we show empirically that when training self-attention-only Transformers on simple regression tasks either the models learned by GD and Transformers show great similarity or, remarkably, the weights found by optimization match the construction. Thus we show how trained Transformers implement gradient descent in their forward pass. This allows us, at least in the domain of regression problems, to mechanistically understand the inner workings of optimized Transformers that learn in-context. Furthermore, we identify how Transformers surpass plain gradient descent by an iterative curvature correction and learn linear models on deep data representations to solve non-linear regression tasks. Finally, we discuss intriguing parallels to a mechanism identified to be crucial for in-context learning termed induction-head (Olsson et al., 2022) and show how it could be understood as a specific case of in-context learning by gradient descent learning within Transformers.
    Automated Reachability Analysis of Neural Network-Controlled Systems via Adaptive Polytopes. (arXiv:2212.07553v1 [eess.SY])
    Over-approximating the reachable sets of dynamical systems is a fundamental problem in safety verification and robust control synthesis. The representation of these sets is a key factor that affects the computational complexity and the approximation error. In this paper, we develop a new approach for over-approximating the reachable sets of neural network dynamical systems using adaptive template polytopes. We use the singular value decomposition of linear layers along with the shape of the activation functions to adapt the geometry of the polytopes at each time step to the geometry of the true reachable sets. We then propose a branch-and-bound method to compute accurate over-approximations of the reachable sets by the inferred templates. We illustrate the utility of the proposed approach in the reachability analysis of linear systems driven by neural network controllers.
    SHAQ: Incorporating Shapley Value Theory into Multi-Agent Q-Learning. (arXiv:2105.15013v6 [cs.LG] UPDATED)
    Value factorisation is a useful technique for multi-agent reinforcement learning (MARL) in global reward game, however its underlying mechanism is not yet fully understood. This paper studies a theoretical framework for value factorisation with interpretability via Shapley value theory. We generalise Shapley value to Markov convex game called Markov Shapley value (MSV) and apply it as a value factorisation method in global reward game, which is obtained by the equivalence between the two games. Based on the properties of MSV, we derive Shapley-Bellman optimality equation (SBOE) to evaluate the optimal MSV, which corresponds to an optimal joint deterministic policy. Furthermore, we propose Shapley-Bellman operator (SBO) that is proved to solve SBOE. With a stochastic approximation and some transformations, a new MARL algorithm called Shapley Q-learning (SHAQ) is established, the implementation of which is guided by the theoretical results of SBO and MSV. We also discuss the relationship between SHAQ and relevant value factorisation methods. In the experiments, SHAQ exhibits not only superior performances on all tasks but also the interpretability that agrees with the theoretical analysis. The implementation of this paper is on https://github.com/hsvgbkhgbv/shapley-q-learning.
    Driver Assistance Eco-driving and Transmission Control with Deep Reinforcement Learning. (arXiv:2212.07594v1 [eess.SY])
    With the growing need to reduce energy consumption and greenhouse gas emissions, Eco-driving strategies provide a significant opportunity for additional fuel savings on top of other technological solutions being pursued in the transportation sector. In this paper, a model-free deep reinforcement learning (RL) control agent is proposed for active Eco-driving assistance that trades-off fuel consumption against other driver-accommodation objectives, and learns optimal traction torque and transmission shifting policies from experience. The training scheme for the proposed RL agent uses an off-policy actor-critic architecture that iteratively does policy evaluation with a multi-step return and policy improvement with the maximum posteriori policy optimization algorithm for hybrid action spaces. The proposed Eco-driving RL agent is implemented on a commercial vehicle in car following traffic. It shows superior performance in minimizing fuel consumption compared to a baseline controller that has full knowledge of fuel-efficiency tables.  ( 2 min )
    AirfRANS: High Fidelity Computational Fluid Dynamics Dataset for Approximating Reynolds-Averaged Navier-Stokes Solutions. (arXiv:2212.07564v1 [cs.LG])
    Surrogate models are necessary to optimize meaningful quantities in physical dynamics as their recursive numerical resolutions are often prohibitively expensive. It is mainly the case for fluid dynamics and the resolution of Navier-Stokes equations. However, despite the fast-growing field of data-driven models for physical systems, reference datasets representing real-world phenomena are lacking. In this work, we develop AirfRANS, a dataset for studying the two-dimensional incompressible steady-state Reynolds-Averaged Navier-Stokes equations over airfoils at a subsonic regime and for different angles of attacks. We also introduce metrics on the stress forces at the surface of geometries and visualization of boundary layers to assess the capabilities of models to accurately predict the meaningful information of the problem. Finally, we propose deep learning baselines on four machine learning tasks to study AirfRANS under different constraints for generalization considerations: big and scarce data regime, Reynolds number, and angle of attack extrapolation.
    Ungeneralizable Contextual Logistic Bandit in Credit Scoring. (arXiv:2212.07632v1 [stat.ML])
    The application of reinforcement learning in credit scoring has created a unique setting for contextual logistic bandit that does not conform to the usual exploration-exploitation tradeoff but rather favors exploration-free algorithms. Through sufficient randomness in a pool of observable contexts, the reinforcement learning agent can simultaneously exploit an action with the highest reward while still learning more about the structure governing that environment. Thus, it is the case that greedy algorithms consistently outperform algorithms with efficient exploration, such as Thompson sampling. However, in a more pragmatic scenario in credit scoring, lenders can, to a degree, classify each borrower as a separate group, and learning about the characteristics of each group does not infer any information to another group. Through extensive simulations, we show that Thompson sampling dominates over greedy algorithms given enough timesteps which increase with the complexity of underlying features.
    The effects of gender bias in word embeddings on depression prediction. (arXiv:2212.07852v1 [cs.CL])
    Word embeddings are extensively used in various NLP problems as a state-of-the-art semantic feature vector representation. Despite their success on various tasks and domains, they might exhibit an undesired bias for stereotypical categories due to statistical and societal biases that exist in the dataset they are trained on. In this study, we analyze the gender bias in four different pre-trained word embeddings specifically for the depression category in the mental disorder domain. We use contextual and non-contextual embeddings that are trained on domain-independent as well as clinical domain-specific data. We observe that embeddings carry bias for depression towards different gender groups depending on the type of embeddings. Moreover, we demonstrate that these undesired correlations are transferred to the downstream task for depression phenotype recognition. We find that data augmentation by simply swapping gender words mitigates the bias significantly in the downstream task.
    Curriculum Learning Meets Weakly Supervised Modality Correlation Learning. (arXiv:2212.07619v1 [cs.LG])
    In the field of multimodal sentiment analysis (MSA), a few studies have leveraged the inherent modality correlation information stored in samples for self-supervised learning. However, they feed the training pairs in a random order without consideration of difficulty. Without human annotation, the generated training pairs of self-supervised learning often contain noise. If noisy or hard pairs are used for training at the easy stage, the model might be stuck in bad local optimum. In this paper, we inject curriculum learning into weakly supervised modality correlation learning. The weakly supervised correlation learning leverages the label information to generate scores for negative pairs to learn a more discriminative embedding space, where negative pairs are defined as two unimodal embeddings from different samples. To assist the correlation learning, we feed the training pairs to the model according to difficulty by the proposed curriculum learning, which consists of elaborately designed scoring and feeding functions. The scoring function computes the difficulty of pairs using pre-trained and current correlation predictors, where the pairs with large losses are defined as hard pairs. Notably, the hardest pairs are discarded in our algorithm, which are assumed as noisy pairs. Moreover, the feeding function takes the difference of correlation losses as feedback to determine the feeding actions (`stay', `step back', or `step forward'). The proposed method reaches state-of-the-art performance on MSA.
    Distributed-Training-and-Execution Multi-Agent Reinforcement Learning for Power Control in HetNet. (arXiv:2212.07967v1 [eess.SY])
    In heterogeneous networks (HetNets), the overlap of small cells and the macro cell causes severe cross-tier interference. Although there exist some approaches to address this problem, they usually require global channel state information, which is hard to obtain in practice, and get the sub-optimal power allocation policy with high computational complexity. To overcome these limitations, we propose a multi-agent deep reinforcement learning (MADRL) based power control scheme for the HetNet, where each access point makes power control decisions independently based on local information. To promote cooperation among agents, we develop a penalty-based Q learning (PQL) algorithm for MADRL systems. By introducing regularization terms in the loss function, each agent tends to choose an experienced action with high reward when revisiting a state, and thus the policy updating speed slows down. In this way, an agent's policy can be learned by other agents more easily, resulting in a more efficient collaboration process. We then implement the proposed PQL in the considered HetNet and compare it with other distributed-training-and-execution (DTE) algorithms. Simulation results show that our proposed PQL can learn the desired power control policy from a dynamic environment where the locations of users change episodically and outperform existing DTE MADRL algorithms.
    Real-Time Neural Light Field on Mobile Devices. (arXiv:2212.08057v1 [cs.CV])
    Recent efforts in Neural Rendering Fields (NeRF) have shown impressive results on novel view synthesis by utilizing implicit neural representation to represent 3D scenes. Due to the process of volumetric rendering, the inference speed for NeRF is extremely slow, limiting the application scenarios of utilizing NeRF on resource-constrained hardware, such as mobile devices. Many works have been conducted to reduce the latency of running NeRF models. However, most of them still require high-end GPU for acceleration or extra storage memory, which is all unavailable on mobile devices. Another emerging direction utilizes the neural light field (NeLF) for speedup, as only one forward pass is performed on a ray to predict the pixel color. Nevertheless, to reach a similar rendering quality as NeRF, the network in NeLF is designed with intensive computation, which is not mobile-friendly. In this work, we propose an efficient network that runs in real-time on mobile devices for neural rendering. We follow the setting of NeLF to train our network. Unlike existing works, we introduce a novel network architecture that runs efficiently on mobile devices with low latency and small size, i.e., saving $15\times \sim 24\times$ storage compared with MobileNeRF. Our model achieves high-resolution generation while maintaining real-time inference for both synthetic and real-world scenes on mobile devices, e.g., $18.04$ms (iPhone 13) for rendering one $1008\times756$ image of real 3D scenes. Additionally, we achieve similar image quality as NeRF and better quality than MobileNeRF (PSNR $26.15$ vs. $25.91$ on the real-world forward-facing dataset).  ( 2 min )
    Generative structured normalizing flow Gaussian processes applied to spectroscopic data. (arXiv:2212.07554v1 [cs.LG])
    In this work, we propose a novel generative model for mapping inputs to structured, high-dimensional outputs using structured conditional normalizing flows and Gaussian process regression. The model is motivated by the need to characterize uncertainty in the input/output relationship when making inferences on new data. In particular, in the physical sciences, limited training data may not adequately characterize future observed data; it is critical that models adequately indicate uncertainty, particularly when they may be asked to extrapolate. In our proposed model, structured conditional normalizing flows provide parsimonious latent representations that relate to the inputs through a Gaussian process, providing exact likelihood calculations and uncertainty that naturally increases away from the training data inputs. We demonstrate the methodology on laser-induced breakdown spectroscopy data from the ChemCam instrument onboard the Mars rover Curiosity. ChemCam was designed to recover the chemical composition of rock and soil samples by measuring the spectral properties of plasma atomic emissions induced by a laser pulse. We show that our model can generate realistic spectra conditional on a given chemical composition and that we can use the model to perform uncertainty quantification of chemical compositions for new observed spectra. Based on our results, we anticipate that our proposed modeling approach may be useful in other scientific domains with high-dimensional, complex structure where it is important to quantify predictive uncertainty.  ( 2 min )
    DOC-NAD: A Hybrid Deep One-class Classifier for Network Anomaly Detection. (arXiv:2212.07558v1 [cs.CR])
    Machine Learning (ML) approaches have been used to enhance the detection capabilities of Network Intrusion Detection Systems (NIDSs). Recent work has achieved near-perfect performance by following binary- and multi-class network anomaly detection tasks. Such systems depend on the availability of both (benign and malicious) network data classes during the training phase. However, attack data samples are often challenging to collect in most organisations due to security controls preventing the penetration of known malicious traffic to their networks. Therefore, this paper proposes a Deep One-Class (DOC) classifier for network intrusion detection by only training on benign network data samples. The novel one-class classification architecture consists of a histogram-based deep feed-forward classifier to extract useful network data features and use efficient outlier detection. The DOC classifier has been extensively evaluated using two benchmark NIDS datasets. The results demonstrate its superiority over current state-of-the-art one-class classifiers in terms of detection and false positive rates.  ( 2 min )
    Residual Policy Learning for Powertrain Control. (arXiv:2212.07611v1 [eess.SY])
    Eco-driving strategies have been shown to provide significant reductions in fuel consumption. This paper outlines an active driver assistance approach that uses a residual policy learning (RPL) agent trained to provide residual actions to default power train controllers while balancing fuel consumption against other driver-accommodation objectives. Using previous experiences, our RPL agent learns improved traction torque and gear shifting residual policies to adapt the operation of the powertrain to variations and uncertainties in the environment. For comparison, we consider a traditional reinforcement learning (RL) agent trained from scratch. Both agents employ the off-policy Maximum A Posteriori Policy Optimization algorithm with an actor-critic architecture. By implementing on a simulated commercial vehicle in various car-following scenarios, we find that the RPL agent quickly learns significantly improved policies compared to a baseline source policy but in some measures not as good as those eventually possible with the RL agent trained from scratch.
    Py-Feat: Python Facial Expression Analysis Toolbox. (arXiv:2104.03509v2 [cs.CV] UPDATED)
    Studying facial expressions is a notoriously difficult endeavor. Recent advances in the field of affective computing have yielded impressive progress in automatically detecting facial expressions from pictures and videos. However, much of this work has yet to be widely disseminated in social science domains such as psychology. Current state of the art models require considerable domain expertise that is not traditionally incorporated into social science training programs. Furthermore, there is a notable absence of user-friendly and open-source software that provides a comprehensive set of tools and functions that support facial expression research. In this paper, we introduce Py-Feat, an open-source Python toolbox that provides support for detecting, preprocessing, analyzing, and visualizing facial expression data. Py-Feat makes it easy for domain experts to disseminate and benchmark computer vision models and also for end users to quickly process, analyze, and visualize face expression data. We hope this platform will facilitate increased use of facial expression data in human behavior research.
    Physics-Informed Neural Networks for Material Model Calibration from Full-Field Displacement Data. (arXiv:2212.07723v1 [cs.LG])
    The identification of material parameters occurring in constitutive models has a wide range of applications in practice. One of these applications is the monitoring and assessment of the actual condition of infrastructure buildings, as the material parameters directly reflect the resistance of the structures to external impacts. Physics-informed neural networks (PINNs) have recently emerged as a suitable method for solving inverse problems. The advantages of this method are a straightforward inclusion of observation data. Unlike grid-based methods, such as the finite element method updating (FEMU) approach, no computational grid and no interpolation of the data is required. In the current work, we aim to further develop PINNs towards the calibration of the linear-elastic constitutive model from full-field displacement and global force data in a realistic regime. We show that normalization and conditioning of the optimization problem play a crucial role in this process. Therefore, among others, we identify the material parameters for initial estimates and balance the individual terms in the loss function. In order to reduce the dependence of the identified material parameters on local errors in the displacement approximation, we base the identification not on the stress boundary conditions but instead on the global balance of internal and external work. In addition, we found that we get a better posed inverse problem if we reformulate it in terms of bulk and shear modulus instead of Young's modulus and Poisson's ratio. We demonstrate that the enhanced PINNs are capable of identifying material parameters from both experimental one-dimensional data and synthetic full-field displacement data in a realistic regime. Since displacement data measured by, e.g., a digital image correlation (DIC) system is noisy, we additionally investigate the robustness of the method to different levels of noise.
    Multimodal Teacher Forcing for Reconstructing Nonlinear Dynamical Systems. (arXiv:2212.07892v1 [cs.LG])
    Many, if not most, systems of interest in science are naturally described as nonlinear dynamical systems (DS). Empirically, we commonly access these systems through time series measurements, where often we have time series from different types of data modalities simultaneously. For instance, we may have event counts in addition to some continuous signal. While by now there are many powerful machine learning (ML) tools for integrating different data modalities into predictive models, this has rarely been approached so far from the perspective of uncovering the underlying, data-generating DS (aka DS reconstruction). Recently, sparse teacher forcing (TF) has been suggested as an efficient control-theoretic method for dealing with exploding loss gradients when training ML models on chaotic DS. Here we incorporate this idea into a novel recurrent neural network (RNN) training framework for DS reconstruction based on multimodal variational autoencoders (MVAE). The forcing signal for the RNN is generated by the MVAE which integrates different types of simultaneously given time series data into a joint latent code optimal for DS reconstruction. We show that this training method achieves significantly better reconstructions on multimodal datasets generated from chaotic DS benchmarks than various alternative methods.
    Dual Quaternion Ambisonics Array for Six-Degree-of-Freedom Acoustic Representation. (arXiv:2204.01851v2 [eess.AS] UPDATED)
    Spatial audio methods are gaining a growing interest due to the spread of immersive audio experiences and applications, such as virtual and augmented reality. For these purposes, 3D audio signals are often acquired through arrays of Ambisonics microphones, each comprising four capsules that decompose the sound field in spherical harmonics. In this paper, we propose a dual quaternion representation of the spatial sound field acquired through an array of two First Order Ambisonics (FOA) microphones. The audio signals are encapsulated in a dual quaternion that leverages quaternion algebra properties to exploit correlations among them. This augmented representation with 6 degrees of freedom (6DOF) involves a more accurate coverage of the sound field, resulting in a more precise sound localization and a more immersive audio experience. We evaluate our approach on a sound event localization and detection (SELD) benchmark. We show that our dual quaternion SELD model with temporal convolution blocks (DualQSELD-TCN) achieves better results with respect to real and quaternion-valued baselines thanks to our augmented representation of the sound field. Full code is available at: https://github.com/ispamm/DualQSELD-TCN.
    SMACv2: An Improved Benchmark for Cooperative Multi-Agent Reinforcement Learning. (arXiv:2212.07489v1 [cs.LG])
    The availability of challenging benchmarks has played a key role in the recent progress of machine learning. In cooperative multi-agent reinforcement learning, the StarCraft Multi-Agent Challenge (SMAC) has become a popular testbed for centralised training with decentralised execution. However, after years of sustained improvement on SMAC, algorithms now achieve near-perfect performance. In this work, we conduct new analysis demonstrating that SMAC is not sufficiently stochastic to require complex closed-loop policies. In particular, we show that an open-loop policy conditioned only on the timestep can achieve non-trivial win rates for many SMAC scenarios. To address this limitation, we introduce SMACv2, a new version of the benchmark where scenarios are procedurally generated and require agents to generalise to previously unseen settings (from the same distribution) during evaluation. We show that these changes ensure the benchmark requires the use of closed-loop policies. We evaluate state-of-the-art algorithms on SMACv2 and show that it presents significant challenges not present in the original benchmark. Our analysis illustrates that SMACv2 addresses the discovered deficiencies of SMAC and can help benchmark the next generation of MARL methods. Videos of training are available at https://sites.google.com/view/smacv2
    Machine Learning Coarse-Grained Potentials of Protein Thermodynamics. (arXiv:2212.07492v1 [q-bio.BM])
    A generalized understanding of protein dynamics is an unsolved scientific problem, the solution of which is critical to the interpretation of the structure-function relationships that govern essential biological processes. Here, we approach this problem by constructing coarse-grained molecular potentials based on artificial neural networks and grounded in statistical mechanics. For training, we build a unique dataset of unbiased all-atom molecular dynamics simulations of approximately 9 ms for twelve different proteins with multiple secondary structure arrangements. The coarse-grained models are capable of accelerating the dynamics by more than three orders of magnitude while preserving the thermodynamics of the systems. Coarse-grained simulations identify relevant structural states in the ensemble with comparable energetics to the all-atom systems. Furthermore, we show that a single coarse-grained potential can integrate all twelve proteins and can capture experimental structural features of mutated proteins. These results indicate that machine learning coarse-grained potentials could provide a feasible approach to simulate and understand protein dynamics.
    A Data Source Dependency Analysis Framework for Large Scale Data Science Projects. (arXiv:2212.07951v1 [cs.SE])
    Dependency hell is a well-known pain point in the development of large software projects and machine learning (ML) code bases are not immune from it. In fact, ML applications suffer from an additional form, namely, "data source dependency hell". This term refers to the central role played by data and its unique quirks that often lead to unexpected failures of ML models which cannot be explained by code changes. In this paper, we present an automated dependency mapping framework that allows MLOps engineers to monitor the whole dependency map of their models in a fast paced engineering environment and thus mitigate ahead of time the consequences of any data source changes (e.g., re-train model, ignore data, set default data etc.). Our system is based on a unified and generic approach, employing techniques from static analysis, from which data sources can be identified reliably for any type of dependency on a wide range of source languages and artefacts. The dependency mapping framework is exposed as a REST web API where the only input is the path to the Git repository hosting the code base. Currently used by MLOps engineers at Microsoft, we expect such dependency map APIs to be adopted more widely by MLOps engineers in the future.
    ReDDIT: Regret Detection and Domain Identification from Text. (arXiv:2212.07549v1 [cs.CL])
    In this paper, we present a study of regret and its expression on social media platforms. Specifically, we present a novel dataset of Reddit texts that have been classified into three classes: Regret by Action, Regret by Inaction, and No Regret. We then use this dataset to investigate the language used to express regret on Reddit and to identify the domains of text that are most commonly associated with regret. Our findings show that Reddit users are most likely to express regret for past actions, particularly in the domain of relationships. We also found that deep learning models using GloVe embedding outperformed other models in all experiments, indicating the effectiveness of GloVe for representing the meaning and context of words in the domain of regret. Overall, our study provides valuable insights into the nature and prevalence of regret on social media, as well as the potential of deep learning and word embeddings for analyzing and understanding emotional language in online text. These findings have implications for the development of natural language processing algorithms and the design of social media platforms that support emotional expression and communication.
    Convergent Data-driven Regularizations for CT Reconstruction. (arXiv:2212.07786v1 [math.NA])
    The reconstruction of images from their corresponding noisy Radon transform is a typical example of an ill-posed linear inverse problem as arising in the application of computerized tomography (CT). As the (na\"{\i}ve) solution does not depend on the measured data continuously, regularization is needed to re-establish a continuous dependence. In this work, we investigate simple, but yet still provably convergent approaches to learning linear regularization methods from data. More specifically, we analyze two approaches: One generic linear regularization that learns how to manipulate the singular values of the linear operator in an extension of [1], and one tailored approach in the Fourier domain that is specific to CT-reconstruction. We prove that such approaches become convergent regularization methods as well as the fact that the reconstructions they provide are typically much smoother than the training data they were trained on. Finally, we compare the spectral as well as the Fourier-based approaches for CT-reconstruction numerically, discuss their advantages and disadvantages and investigate the effect of discretization errors at different resolutions.
    Learning threshold neurons via the "edge of stability". (arXiv:2212.07469v1 [cs.LG])
    Existing analyses of neural network training often operate under the unrealistic assumption of an extremely small learning rate. This lies in stark contrast to practical wisdom and empirical studies, such as the work of J. Cohen et al. (ICLR 2021), which exhibit startling new phenomena (the "edge of stability" or "unstable convergence") and potential benefits for generalization in the large learning rate regime. Despite a flurry of recent works on this topic, however, the latter effect is still poorly understood. In this paper, we take a step towards understanding genuinely non-convex training dynamics with large learning rates by performing a detailed analysis of gradient descent for simplified models of two-layer neural networks. For these models, we provably establish the edge of stability phenomenon and discover a sharp phase transition for the step size below which the neural network fails to learn "threshold-like" neurons (i.e., neurons with a non-zero first-layer bias). This elucidates one possible mechanism by which the edge of stability can in fact lead to better generalization, as threshold neurons are basic building blocks with useful inductive bias for many tasks.
    Guiding continuous operator learning through Physics-based boundary constraints. (arXiv:2212.07477v1 [cs.LG])
    Boundary conditions (BCs) are important groups of physics-enforced constraints that are necessary for solutions of Partial Differential Equations (PDEs) to satisfy at specific spatial locations. These constraints carry important physical meaning, and guarantee the existence and the uniqueness of the PDE solution. Current neural-network based approaches that aim to solve PDEs rely only on training data to help the model learn BCs implicitly. There is no guarantee of BC satisfaction by these models during evaluation. In this work, we propose Boundary enforcing Operator Network (BOON) that enables the BC satisfaction of neural operators by making structural changes to the operator kernel. We provide our refinement procedure, and demonstrate the satisfaction of physics-based BCs, e.g. Dirichlet, Neumann, and periodic by the solutions obtained by BOON. Numerical experiments based on multiple PDEs with a wide variety of applications indicate that the proposed approach ensures satisfaction of BCs, and leads to more accurate solutions over the entire domain. The proposed correction method exhibits a (2X-20X) improvement over a given operator model in relative $L^2$ error (0.000084 relative $L^2$ error for Burgers' equation).
    FreCDo: A Large Corpus for French Cross-Domain Dialect Identification. (arXiv:2212.07707v1 [cs.CL])
    We present a novel corpus for French dialect identification comprising 413,522 French text samples collected from public news websites in Belgium, Canada, France and Switzerland. To ensure an accurate estimation of the dialect identification performance of models, we designed the corpus to eliminate potential biases related to topic, writing style, and publication source. More precisely, the training, validation and test splits are collected from different news websites, while searching for different keywords (topics). This leads to a French cross-domain (FreCDo) dialect identification task. We conduct experiments with four competitive baselines, a fine-tuned CamemBERT model, an XGBoost based on fine-tuned CamemBERT features, a Support Vector Machines (SVM) classifier based on fine-tuned CamemBERT features, and an SVM based on word n-grams. Aside from presenting quantitative results, we also make an analysis of the most discriminative features learned by CamemBERT. Our corpus is available at https://github.com/MihaelaGaman/FreCDo.
    Decomposing a Recurrent Neural Network into Modules for Enabling Reusability and Replacement. (arXiv:2212.05970v2 [cs.SE] UPDATED)
    Can we take a recurrent neural network (RNN) trained to translate between languages and augment it to support a new natural language without retraining the model from scratch? Can we fix the faulty behavior of the RNN by replacing portions associated with the faulty behavior? Recent works on decomposing a fully connected neural network (FCNN) and convolutional neural network (CNN) into modules have shown the value of engineering deep models in this manner, which is standard in traditional SE but foreign for deep learning models. However, prior works focus on the image-based multiclass classification problems and cannot be applied to RNN due to (a) different layer structures, (b) loop structures, (c) different types of input-output architectures, and (d) usage of both nonlinear and logistic activation functions. In this work, we propose the first approach to decompose an RNN into modules. We study different types of RNNs, i.e., Vanilla, LSTM, and GRU. Further, we show how such RNN modules can be reused and replaced in various scenarios. We evaluate our approach against 5 canonical datasets (i.e., Math QA, Brown Corpus, Wiki-toxicity, Clinc OOS, and Tatoeba) and 4 model variants for each dataset. We found that decomposing a trained model has a small cost (Accuracy: -0.6%, BLEU score: +0.10%). Also, the decomposed modules can be reused and replaced without needing to retrain.
    Vision Transformers are Parameter-Efficient Audio-Visual Learners. (arXiv:2212.07983v1 [cs.CV])
    Vision transformers (ViTs) have achieved impressive results on various computer vision tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained only on visual data, to generalize to audio-visual data without finetuning any of its original parameters. To do so, we propose a latent audio-visual hybrid (LAVISH) adapter that adapts pretrained ViTs to audio-visual tasks by injecting a small number of trainable parameters into every layer of a frozen ViT. To efficiently fuse visual and audio cues, our LAVISH adapter uses a small set of latent tokens, which form an attention bottleneck, thus, eliminating the quadratic cost of standard cross-attention. Compared to the existing modality-specific audio-visual methods, our approach achieves competitive or even better performance on various audio-visual tasks while using fewer tunable parameters and without relying on costly audio pretraining or external audio encoders. Our code is available at https://genjib.github.io/project_page/LAVISH/
    Can REF output quality scores be assigned by AI? Experimental evidence. (arXiv:2212.08041v1 [cs.CY])
    This document describes strategies for using Artificial Intelligence (AI) to predict some journal article scores in future research assessment exercises. Five strategies have been assessed.
    DAMP: Doubly Aligned Multilingual Parser for Task-Oriented Dialogue. (arXiv:2212.08054v1 [cs.CL])
    Modern virtual assistants use internal semantic parsing engines to convert user utterances to actionable commands. However, prior work has demonstrated that semantic parsing is a difficult multilingual transfer task with low transfer efficiency compared to other tasks. In global markets such as India and Latin America, this is a critical issue as switching between languages is prevalent for bilingual users. In this work we dramatically improve the zero-shot performance of a multilingual and codeswitched semantic parsing system using two stages of multilingual alignment. First, we show that constrastive alignment pretraining improves both English performance and transfer efficiency. We then introduce a constrained optimization approach for hyperparameter-free adversarial alignment during finetuning. Our Doubly Aligned Multilingual Parser (DAMP) improves mBERT transfer performance by 3x, 6x, and 81x on the Spanglish, Hinglish and Multilingual Task Oriented Parsing benchmarks respectively and outperforms XLM-R and mT5-Large using 3.2x fewer parameters.
    Scaling Marginalized Importance Sampling to High-Dimensional State-Spaces via State Abstraction. (arXiv:2212.07486v1 [cs.LG])
    We consider the problem of off-policy evaluation (OPE) in reinforcement learning (RL), where the goal is to estimate the performance of an evaluation policy, $\pi_e$, using a fixed dataset, $\mathcal{D}$, collected by one or more policies that may be different from $\pi_e$. Current OPE algorithms may produce poor OPE estimates under policy distribution shift i.e., when the probability of a particular state-action pair occurring under $\pi_e$ is very different from the probability of that same pair occurring in $\mathcal{D}$ (Voloshin et al. 2021, Fu et al. 2021). In this work, we propose to improve the accuracy of OPE estimators by projecting the high-dimensional state-space into a low-dimensional state-space using concepts from the state abstraction literature. Specifically, we consider marginalized importance sampling (MIS) OPE algorithms which compute state-action distribution correction ratios to produce their OPE estimate. In the original ground state-space, these ratios may have high variance which may lead to high variance OPE. However, we prove that in the lower-dimensional abstract state-space the ratios can have lower variance resulting in lower variance OPE. We then highlight the challenges that arise when estimating the abstract ratios from data, identify sufficient conditions to overcome these issues, and present a minimax optimization problem whose solution yields these abstract ratios. Finally, our empirical evaluation on difficult, high-dimensional state-space OPE tasks shows that the abstract ratios can make MIS OPE estimators achieve lower mean-squared error and more robust to hyperparameter tuning than the ground ratios.
    JAX-Accelerated Neuroevolution of Physics-informed Neural Networks: Benchmarks and Experimental Results. (arXiv:2212.07624v1 [cs.NE])
    This paper introduces the use of evolutionary algorithms for solving differential equations. The solution is obtained by optimizing a deep neural network whose loss function is defined by the residual terms from the differential equations. Recent studies have used stochastic gradient descent (SGD) variants to train these physics-informed neural networks (PINNs), but these methods can struggle to find accurate solutions due to optimization challenges. When solving differential equations, it is important to find the globally optimum parameters of the network, rather than just finding a solution that works well during training. SGD only searches along a single gradient direction, so it may not be the best approach for training PINNs with their accompanying complex optimization landscapes. In contrast, evolutionary algorithms perform a parallel exploration of different solutions in order to avoid getting stuck in local optima and can potentially find more accurate solutions. However, evolutionary algorithms can be slow, which can make them difficult to use in practice. To address this, we provide a set of five benchmark problems with associated performance metrics and baseline results to support the development of evolutionary algorithms for enhanced PINN training. As a baseline, we evaluate the performance and speed of using the widely adopted Covariance Matrix Adaptation Evolution Strategy (CMA-ES) for solving PINNs. We provide the loss and training time for CMA-ES run on TensorFlow, and CMA-ES and SGD run on JAX (with GPU acceleration) for the five benchmark problems. Our results show that JAX-accelerated evolutionary algorithms, particularly CMA-ES, can be a useful approach for solving differential equations. We hope that our work will support the exploration and development of alternative optimization algorithms for the complex task of optimizing PINNs.
    Demonstration of machine-learning-enhanced Bayesian quantum state estimation. (arXiv:2212.08032v1 [quant-ph])
    Machine learning (ML) has found broad applicability in quantum information science in topics as diverse as experimental design, state classification, and even studies on quantum foundations. Here, we experimentally realize an approach for defining custom prior distributions that are automatically tuned using ML for use with Bayesian quantum state estimation methods. Previously, researchers have looked to Bayesian quantum state tomography due to its unique advantages like natural uncertainty quantification, the return of reliable estimates under any measurement condition, and minimal mean-squared error. However, practical challenges related to long computation times and conceptual issues concerning how to incorporate prior knowledge most suitably can overshadow these benefits. Using both simulated and experimental measurement results, we demonstrate that ML-defined prior distributions reduce net convergence times and provide a natural way to incorporate both implicit and explicit information directly into the prior distribution. These results constitute a promising path toward practical implementations of Bayesian quantum state tomography.
    Dissecting Distribution Inference. (arXiv:2212.07591v1 [cs.LG])
    A distribution inference attack aims to infer statistical properties of data used to train machine learning models. These attacks are sometimes surprisingly potent, but the factors that impact distribution inference risk are not well understood and demonstrated attacks often rely on strong and unrealistic assumptions such as full knowledge of training environments even in supposedly black-box threat scenarios. To improve understanding of distribution inference risks, we develop a new black-box attack that even outperforms the best known white-box attack in most settings. Using this new attack, we evaluate distribution inference risk while relaxing a variety of assumptions about the adversary's knowledge under black-box access, like known model architectures and label-only access. Finally, we evaluate the effectiveness of previously proposed defenses and introduce new defenses. We find that although noise-based defenses appear to be ineffective, a simple re-sampling defense can be highly effective. Code is available at https://github.com/iamgroot42/dissecting_distribution_inference
    Calibrating AI Models for Wireless Communications via Conformal Prediction. (arXiv:2212.07775v1 [cs.LG])
    When used in complex engineered systems, such as communication networks, artificial intelligence (AI) models should be not only as accurate as possible, but also well calibrated. A well-calibrated AI model is one that can reliably quantify the uncertainty of its decisions, assigning high confidence levels to decisions that are likely to be correct and low confidence levels to decisions that are likely to be erroneous. This paper investigates the application of conformal prediction as a general framework to obtain AI models that produce decisions with formal calibration guarantees. Conformal prediction transforms probabilistic predictors into set predictors that are guaranteed to contain the correct answer with a probability chosen by the designer. Such formal calibration guarantees hold irrespective of the true, unknown, distribution underlying the generation of the variables of interest, and can be defined in terms of ensemble or time-averaged probabilities. In this paper, conformal prediction is applied for the first time to the design of AI for communication systems in conjunction to both frequentist and Bayesian learning, focusing on demodulation, modulation classification, and channel prediction.
    Interpretable ML for Imbalanced Data. (arXiv:2212.07743v1 [cs.LG])
    Deep learning models are being increasingly applied to imbalanced data in high stakes fields such as medicine, autonomous driving, and intelligence analysis. Imbalanced data compounds the black-box nature of deep networks because the relationships between classes may be highly skewed and unclear. This can reduce trust by model users and hamper the progress of developers of imbalanced learning algorithms. Existing methods that investigate imbalanced data complexity are geared toward binary classification, shallow learning models and low dimensional data. In addition, current eXplainable Artificial Intelligence (XAI) techniques mainly focus on converting opaque deep learning models into simpler models (e.g., decision trees) or mapping predictions for specific instances to inputs, instead of examining global data properties and complexities. Therefore, there is a need for a framework that is tailored to modern deep networks, that incorporates large, high dimensional, multi-class datasets, and uncovers data complexities commonly found in imbalanced data (e.g., class overlap, sub-concepts, and outlier instances). We propose a set of techniques that can be used by both deep learning model users to identify, visualize and understand class prototypes, sub-concepts and outlier instances; and by imbalanced learning algorithm developers to detect features and class exemplars that are key to model performance. Our framework also identifies instances that reside on the border of class decision boundaries, which can carry highly discriminative information. Unlike many existing XAI techniques which map model decisions to gray-scale pixel locations, we use saliency through back-propagation to identify and aggregate image color bands across entire classes. Our framework is publicly available at \url{https://github.com/dd1github/XAI_for_Imbalanced_Learning}
    A Study on the Intersection of GPU Utilization and CNN Inference. (arXiv:2212.07936v1 [cs.LG])
    There has been significant progress in developing neural network architectures that both achieve high predictive performance and that also achieve high application-level inference throughput (e.g., frames per second). Another metric of increasing importance is GPU utilization during inference: the measurement of how well a deployed neural network uses the computational capabilities of the GPU on which it runs. Achieving high GPU utilization is critical to increasing application-level throughput and ensuring a good return on investment for deploying GPUs. This paper analyzes the GPU utilization of convolutional neural network (CNN) inference. We first survey the GPU utilization of CNNs to show that there is room to improve the GPU utilization of many of these CNNs. We then investigate the GPU utilization of networks within a neural architecture search (NAS) search space, and explore how using GPU utilization as a metric could potentially be used to accelerate NAS itself. Our study makes the case that there is room to improve the inference-time GPU utilization of CNNs and that knowledge of GPU utilization has the potential to benefit even applications that do not target utilization itself. We hope that the results of this study will spur future innovation in designing GPU-efficient neural networks.
    FlexiViT: One Model for All Patch Sizes. (arXiv:2212.08013v1 [cs.CV])
    Vision Transformers convert images to sequences by slicing them into patches. The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost, but changing the patch size typically requires retraining the model. In this paper, we demonstrate that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes, making it possible to tailor the model to different compute budgets at deployment time. We extensively evaluate the resulting model, which we call FlexiViT, on a wide range of tasks, including classification, image-text retrieval, open-world detection, panoptic segmentation, and semantic segmentation, concluding that it usually matches, and sometimes outperforms, standard ViT models trained at a single patch size in an otherwise identical setup. Hence, FlexiViT training is a simple drop-in improvement for ViT that makes it easy to add compute-adaptive capabilities to most models relying on a ViT backbone architecture. Code and pre-trained models are available at https://github.com/google-research/big_vision
    MABSplit: Faster Forest Training Using Multi-Armed Bandits. (arXiv:2212.07473v1 [cs.LG])
    Random forests are some of the most widely used machine learning models today, especially in domains that necessitate interpretability. We present an algorithm that accelerates the training of random forests and other popular tree-based learning methods. At the core of our algorithm is a novel node-splitting subroutine, dubbed MABSplit, used to efficiently find split points when constructing decision trees. Our algorithm borrows techniques from the multi-armed bandit literature to judiciously determine how to allocate samples and computational power across candidate split points. We provide theoretical guarantees that MABSplit improves the sample complexity of each node split from linear to logarithmic in the number of data points. In some settings, MABSplit leads to 100x faster training (an 99% reduction in training time) without any decrease in generalization performance. We demonstrate similar speedups when MABSplit is used across a variety of forest-based variants, such as Extremely Random Forests and Random Patches. We also show our algorithm can be used in both classification and regression tasks. Finally, we show that MABSplit outperforms existing methods in generalization performance and feature importance calculations under a fixed computational budget. All of our experimental results are reproducible via a one-line script at https://github.com/ThrunGroup/FastForest.
    PALBERT: Teaching ALBERT to Ponder. (arXiv:2204.03276v3 [cs.LG] UPDATED)
    Currently, pre-trained models can be considered the default choice for a wide range of NLP tasks. Despite their SoTA results, there is practical evidence that these models may require a different number of computing layers for different input sequences, since evaluating all layers leads to overconfidence in wrong predictions (namely overthinking). This problem can potentially be solved by implementing adaptive computation time approaches, which were first designed to improve inference speed. Recently proposed PonderNet may be a promising solution for performing an early exit by treating the exit layer's index as a latent variable. However, the originally proposed exit criterion, relying on sampling from trained posterior distribution on the probability of exiting from the $i$-th layer, introduces major variance in exit layer indices, significantly reducing the resulting model's performance. In this paper, we propose improving PonderNet with a novel deterministic Q-exit criterion and a revisited model architecture. We adapted the proposed mechanism to ALBERT and RoBERTa and compared it with recent methods for performing an early exit. We observed that the proposed changes can be considered significant improvements on the original PonderNet architecture and outperform PABEE on a wide range of GLUE tasks. In addition, we also performed an in-depth ablation study of the proposed architecture to further understand Lambda layers and their performance.
    Sliced Optimal Partial Transport. (arXiv:2212.08049v1 [cs.LG])
    Optimal transport (OT) has become exceedingly popular in machine learning, data science, and computer vision. The core assumption in the OT problem is the equal total amount of mass in source and target measures, which limits its application. Optimal Partial Transport (OPT) is a recently proposed solution to this limitation. Similar to the OT problem, the computation of OPT relies on solving a linear programming problem (often in high dimensions), which can become computationally prohibitive. In this paper, we propose an efficient algorithm for calculating the OPT problem between two non-negative measures in one dimension. Next, following the idea of sliced OT distances, we utilize slicing to define the sliced OPT distance. Finally, we demonstrate the computational and accuracy benefits of the sliced OPT-based method in various numerical experiments. In particular, we show an application of our proposed Sliced-OPT in noisy point cloud registration.
    The Effects of Character-Level Data Augmentation on Style-Based Dating of Historical Manuscripts. (arXiv:2212.07923v1 [cs.CV])
    Identifying the production dates of historical manuscripts is one of the main goals for paleographers when studying ancient documents. Automatized methods can provide paleographers with objective tools to estimate dates more accurately. Previously, statistical features have been used to date digitized historical manuscripts based on the hypothesis that handwriting styles change over periods. However, the sparse availability of such documents poses a challenge in obtaining robust systems. Hence, the research of this article explores the influence of data augmentation on the dating of historical manuscripts. Linear Support Vector Machines were trained with k-fold cross-validation on textural and grapheme-based features extracted from historical manuscripts of different collections, including the Medieval Paleographical Scale, early Aramaic manuscripts, and the Dead Sea Scrolls. Results show that training models with augmented data improve the performance of historical manuscripts dating by 1% - 3% in cumulative scores. Additionally, this indicates further enhancement possibilities by considering models specific to the features and the documents' scripts.
    Explainable Machine Learning for Hydrocarbon Prospect Risking. (arXiv:2212.07563v1 [cs.LG])
    Hydrocarbon prospect risking is a critical application in geophysics predicting well outcomes from a variety of data including geological, geophysical, and other information modalities. Traditional routines require interpreters to go through a long process to arrive at the probability of success of specific outcomes. AI has the capability to automate the process but its adoption has been limited thus far owing to a lack of transparency in the way complicated, black box models generate decisions. We demonstrate how LIME -- a model-agnostic explanation technique -- can be used to inject trust in model decisions by uncovering the model's reasoning process for individual predictions. It generates these explanations by fitting interpretable models in the local neighborhood of specific datapoints being queried. On a dataset of well outcomes and corresponding geophysical attribute data, we show how LIME can induce trust in model's decisions by revealing the decision-making process to be aligned to domain knowledge. Further, it has the potential to debug mispredictions made due to anomalous patterns in the data or faulty training datasets.
    Chaotic Variational Auto Encoder based One Class Classifier for Insurance Fraud Detection. (arXiv:2212.07802v1 [cs.LG])
    Of late, insurance fraud detection has assumed immense significance owing to the huge financial & reputational losses fraud entails and the phenomenal success of the fraud detection techniques. Insurance is majorly divided into two categories: (i) Life and (ii) Non-life. Non-life insurance in turn includes health insurance and auto insurance among other things. In either of the categories, the fraud detection techniques should be designed in such a way that they capture as many fraudulent transactions as possible. Owing to the rarity of fraudulent transactions, in this paper, we propose a chaotic variational autoencoder (C-VAE to perform one-class classification (OCC) on genuine transactions. Here, we employed the logistic chaotic map to generate random noise in the latent space. The effectiveness of C-VAE is demonstrated on the health insurance fraud and auto insurance datasets. We considered vanilla Variational Auto Encoder (VAE) as the baseline. It is observed that C-VAE outperformed VAE in both datasets. C-VAE achieved a classification rate of 77.9% and 87.25% in health and automobile insurance datasets respectively. Further, the t-test conducted at 1% level of significance and 18 degrees of freedom infers that C-VAE is statistically significant than the VAE.
    Co-Learning with Pre-Trained Networks Improves Source-Free Domain Adaptation. (arXiv:2212.07585v1 [cs.CV])
    Source-free domain adaptation aims to adapt a source model trained on fully-labeled source domain data to a target domain with unlabeled target domain data. Source data is assumed inaccessible due to proprietary or privacy reasons. Existing works use the source model to pseudolabel target data, but the pseudolabels are unreliable due to data distribution shift between source and target domain. In this work, we propose to leverage an ImageNet pre-trained feature extractor in a new co-learning framework to improve target pseudolabel quality for finetuning the source model. Benefits of the ImageNet feature extractor include that it is not source-biased and it provides an alternate view of features and classification decisions different from the source model. Such pre-trained feature extractors are also publicly available, which allows us to readily leverage modern network architectures that have strong representation learning ability. After co-learning, we sharpen predictions of non-pseudolabeled samples by entropy minimization. Evaluation on 3 benchmark datasets show that our proposed method can outperform existing source-free domain adaptation methods, as well as unsupervised domain adaptation methods which assume joint access to source and target data.
    A comparison of LSTM and GRU networks for learning symbolic sequences. (arXiv:2107.02248v2 [cs.LG] UPDATED)
    We explore relations between the hyper-parameters of a recurrent neural network (RNN) and the complexity of string sequences it is able to memorize. We compare long short-term memory (LSTM) networks and gated recurrent units (GRUs). We find that an increase of RNN depth does not necessarily result in better memorization capability when the training time is constrained. Our results also indicate that the learning rate and the number of units per layer are among the most important hyper-parameters to be tuned. Generally, GRUs outperform LSTM networks on low complexity sequences while on high complexity sequences LSTMs perform better.
    Towards Linguistically Informed Multi-Objective Pre-Training for Natural Language Inference. (arXiv:2212.07428v1 [cs.CL])
    We introduce a linguistically enhanced combination of pre-training methods for transformers. The pre-training objectives include POS-tagging, synset prediction based on semantic knowledge graphs, and parent prediction based on dependency parse trees. Our approach achieves competitive results on the Natural Language Inference task, compared to the state of the art. Specifically for smaller models, the method results in a significant performance boost, emphasizing the fact that intelligent pre-training can make up for fewer parameters and help building more efficient models. Combining POS-tagging and synset prediction yields the overall best results.
    Harmonic (Quantum) Neural Networks. (arXiv:2212.07462v1 [cs.LG])
    Harmonic functions are abundant in nature, appearing in limiting cases of Maxwell's, Navier-Stokes equations, the heat and the wave equation. Consequently, there are many applications of harmonic functions, spanning applications from industrial process optimisation to robotic path planning and the calculation of first exit times of random walks. Despite their ubiquity and relevance, there have been few attempts to develop effective means of representing harmonic functions in the context of machine learning architectures, either in machine learning on classical computers, or in the nascent field of quantum machine learning. Architectures which impose or encourage an inductive bias towards harmonic functions would facilitate data-driven modelling and the solution of inverse problems in a range of applications. For classical neural networks, it has already been established how leveraging inductive biases can in general lead to improved performance of learning algorithms. The introduction of such inductive biases within a quantum machine learning setting is instead still in its nascent stages. In this work, we derive exactly-harmonic (conventional- and quantum-) neural networks in two dimensions for simply-connected domains by leveraging the characteristics of holomorphic complex functions. We then demonstrate how these can be approximately extended to multiply-connected two-dimensional domains using techniques inspired by domain decomposition in physics-informed neural networks. We further provide architectures and training protocols to effectively impose approximately harmonic constraints in three dimensions and higher, and as a corollary we report divergence-free network architectures in arbitrary dimensions. Our approaches are demonstrated with applications to heat transfer, electrostatics and robot navigation, with comparisons to physics-informed neural networks included.
    A large-scale and PCR-referenced vocal audio dataset for COVID-19. (arXiv:2212.07738v1 [cs.SD])
    The UK COVID-19 Vocal Audio Dataset is designed for the training and evaluation of machine learning models that classify SARS-CoV-2 infection status or associated respiratory symptoms using vocal audio. The UK Health Security Agency recruited voluntary participants through the national Test and Trace programme and the REACT-1 survey in England from March 2021 to March 2022, during dominant transmission of the Alpha and Delta SARS-CoV-2 variants and some Omicron variant sublineages. Audio recordings of volitional coughs, exhalations, and speech were collected in the 'Speak up to help beat coronavirus' digital survey alongside demographic, self-reported symptom and respiratory condition data, and linked to SARS-CoV-2 test results. The UK COVID-19 Vocal Audio Dataset represents the largest collection of SARS-CoV-2 PCR-referenced audio recordings to date. PCR results were linked to 70,794 of 72,999 participants and 24,155 of 25,776 positive cases. Respiratory symptoms were reported by 45.62% of participants. This dataset has additional potential uses for bioacoustics research, with 11.30% participants reporting asthma, and 27.20% with linked influenza PCR test results.
    Bridging POMDPs and Bayesian decision making for robust maintenance planning under model uncertainty: An application to railway systems. (arXiv:2212.07933v1 [cs.AI])
    Structural Health Monitoring (SHM) describes a process for inferring quantifiable metrics of structural condition, which can serve as input to support decisions on the operation and maintenance of infrastructure assets. Given the long lifespan of critical structures, this problem can be cast as a sequential decision making problem over prescribed horizons. Partially Observable Markov Decision Processes (POMDPs) offer a formal framework to solve the underlying optimal planning task. However, two issues can undermine the POMDP solutions. Firstly, the need for a model that can adequately describe the evolution of the structural condition under deterioration or corrective actions and, secondly, the non-trivial task of recovery of the observation process parameters from available monitoring data. Despite these potential challenges, the adopted POMDP models do not typically account for uncertainty on model parameters, leading to solutions which can be unrealistically confident. In this work, we address both key issues. We present a framework to estimate POMDP transition and observation model parameters directly from available data, via Markov Chain Monte Carlo (MCMC) sampling of a Hidden Markov Model (HMM) conditioned on actions. The MCMC inference estimates distributions of the involved model parameters. We then form and solve the POMDP problem by exploiting the inferred distributions, to derive solutions that are robust to model uncertainty. We successfully apply our approach on maintenance planning for railway track assets on the basis of a "fractal value" indicator, which is computed from actual railway monitoring data.  ( 2 min )
    Towards Hardware-Specific Automatic Compression of Neural Networks. (arXiv:2212.07818v1 [cs.LG])
    Compressing neural network architectures is important to allow the deployment of models to embedded or mobile devices, and pruning and quantization are the major approaches to compress neural networks nowadays. Both methods benefit when compression parameters are selected specifically for each layer. Finding good combinations of compression parameters, so-called compression policies, is hard as the problem spans an exponentially large search space. Effective compression policies consider the influence of the specific hardware architecture on the used compression methods. We propose an algorithmic framework called Galen to search such policies using reinforcement learning utilizing pruning and quantization, thus providing automatic compression for neural networks. Contrary to other approaches we use inference latency measured on the target hardware device as an optimization goal. With that, the framework supports the compression of models specific to a given hardware target. We validate our approach using three different reinforcement learning agents for pruning, quantization and joint pruning and quantization. Besides proving the functionality of our approach we were able to compress a ResNet18 for CIFAR-10, on an embedded ARM processor, to 20% of the original inference latency without significant loss of accuracy. Moreover, we can demonstrate that a joint search and compression using pruning and quantization is superior to an individual search for policies using a single compression method.  ( 2 min )
    A new trigonometric kernel function for support vector machine. (arXiv:2210.08585v3 [cs.LG] UPDATED)
    In the last few years, various types of machine learning algorithms, such as Support Vector Machine (SVM), Support Vector Regression (SVR), and Non-negative Matrix Factorization (NMF) have been introduced. The kernel approach is an effective method for increasing the classification accuracy of machine learning algorithms. This paper introduces a family of one-parameter kernel functions for improving the accuracy of SVM classification. The proposed kernel function consists of a trigonometric term and differs from all existing kernel functions. We show this function is a positive definite kernel function. Finally, we evaluate the SVM method based on the new trigonometric kernel, the Gaussian kernel, the polynomial kernel, and a convex combination of the new kernel function and the Gaussian kernel function on various types of datasets. Empirical results show that the SVM based on the new trigonometric kernel function and the mixed kernel function achieve the best classification accuracy. Moreover, some numerical results of performing the SVR based on the new trigonometric kernel function and the mixed kernel function are presented.  ( 2 min )
    Rethinking Vision Transformers for MobileNet Size and Speed. (arXiv:2212.08059v1 [cs.CV])
    With the success of Vision Transformers (ViTs) in computer vision tasks, recent arts try to optimize the performance and complexity of ViTs to enable efficient deployment on mobile devices. Multiple approaches are proposed to accelerate attention mechanism, improve inefficient designs, or incorporate mobile-friendly lightweight convolutions to form hybrid architectures. However, ViT and its variants still have higher latency or considerably more parameters than lightweight CNNs, even true for the years-old MobileNet. In practice, latency and size are both crucial for efficient deployment on resource-constraint hardware. In this work, we investigate a central question, can transformer models run as fast as MobileNet and maintain a similar size? We revisit the design choices of ViTs and propose an improved supernet with low latency and high parameter efficiency. We further introduce a fine-grained joint search strategy that can find efficient architectures by optimizing latency and number of parameters simultaneously. The proposed models, EfficientFormerV2, achieve about $4\%$ higher top-1 accuracy than MobileNetV2 and MobileNetV2$\times1.4$ on ImageNet-1K with similar latency and parameters. We demonstrate that properly designed and optimized vision transformers can achieve high performance with MobileNet-level size and speed.  ( 2 min )
    Spatial-Temporal Anomaly Detection for Sensor Attacks in Autonomous Vehicles. (arXiv:2212.07757v1 [eess.SY])
    Time-of-flight (ToF) distance measurement devices such as ultrasonics, LiDAR and radar are widely used in autonomous vehicles for environmental perception, navigation and assisted braking control. Despite their relative importance in making safer driving decisions, these devices are vulnerable to multiple attack types including spoofing, triggering and false data injection. When these attacks are successful they can compromise the security of autonomous vehicles leading to severe consequences for the driver, nearby vehicles and pedestrians. To handle these attacks and protect the measurement devices, we propose a spatial-temporal anomaly detection model \textit{STAnDS} which incorporates a residual error spatial detector, with a time-based expected change detection. This approach is evaluated using a simulated quantitative environment and the results show that \textit{STAnDS} is effective at detecting multiple attack types.  ( 2 min )
    Extending Universal Approximation Guarantees: A Theoretical Justification for the Continuity of Real-World Learning Tasks. (arXiv:2212.07934v1 [stat.ML])
    Universal Approximation Theorems establish the density of various classes of neural network function approximators in $C(K, \mathbb{R}^m)$, where $K \subset \mathbb{R}^n$ is compact. In this paper, we aim to extend these guarantees by establishing conditions on learning tasks that guarantee their continuity. We consider learning tasks given by conditional expectations $x \mapsto \mathrm{E}\left[Y \mid X = x\right]$, where the learning target $Y = f \circ L$ is a potentially pathological transformation of some underlying data-generating process $L$. Under a factorization $L = T \circ W$ for the data-generating process where $T$ is thought of as a deterministic map acting on some random input $W$, we establish conditions (that might be easily verified using knowledge of $T$ alone) that guarantee the continuity of practically \textit{any} derived learning task $x \mapsto \mathrm{E}\left[f \circ L \mid X = x\right]$. We motivate the realism of our conditions using the example of randomized stable matching, thus providing a theoretical justification for the continuity of real-world learning tasks.  ( 2 min )
    Faster Maximum Inner Product Search in High Dimensions. (arXiv:2212.07551v1 [cs.LG])
    Maximum Inner Product Search (MIPS) is a popular problem in the machine learning literature due to its applicability in a wide array of applications, such as recommender systems. In high-dimensional settings, however, MIPS queries can become computationally expensive as most existing solutions do not scale well with data dimensionality. In this work, we present a state-of-the-art algorithm for the MIPS problem in high dimensions, dubbed BanditMIPS. BanditMIPS is a randomized algorithm that borrows techniques from multi-armed bandits to reduce the MIPS problem to a best-arm identification problem. BanditMIPS reduces the complexity of state-of-the-art algorithms from $O(\sqrt{d})$ to $O(\text{log}d)$, where $d$ is the dimension of the problem data vectors. On high-dimensional real-world datasets, BanditMIPS runs approximately 12 times faster than existing approaches and returns the same solution. BanditMIPS requires no preprocessing of the data and includes a hyperparameter that practitioners may use to trade off accuracy and runtime. We also propose a variant of our algorithm, named BanditMIPS-$\alpha$, which employs non-uniform sampling across the data dimensions to provide further speedups.  ( 2 min )
    Analytical Engines With Context-Rich Processing: Towards Efficient Next-Generation Analytics. (arXiv:2212.07517v1 [cs.DB])
    As modern data pipelines continue to collect, produce, and store a variety of data formats, extracting and combining value from traditional and context-rich sources such as strings, text, video, audio, and logs becomes a manual process where such formats are unsuitable for RDBMS. To tap into the dark data, domain experts analyze and extract insights and integrate them into the data repositories. This process can involve out-of-DBMS, ad-hoc analysis, and processing resulting in ETL, engineering effort, and suboptimal performance. While AI systems based on ML models can automate the analysis process, they often further generate context-rich answers. Using multiple sources of truth, for either training the models or in the form of knowledge bases, further exacerbates the problem of consolidating the data of interest. We envision an analytical engine co-optimized with components that enable context-rich analysis. Firstly, as the data from different sources or resulting from model answering cannot be cleaned ahead of time, we propose using online data integration via model-assisted similarity operations. Secondly, we aim for a holistic pipeline cost- and rule-based optimization across relational and model-based operators. Thirdly, with increasingly heterogeneous hardware and equally heterogeneous workloads ranging from traditional relational analytics to generative model inference, we envision a system that just-in-time adapts to the complex analytical query requirements. To solve increasingly complex analytical problems, ML offers attractive solutions that must be combined with traditional analytical processing and benefit from decades of database community research to achieve scalability and performance effortless for the end user.  ( 2 min )
    Multi-Agent Dynamic Pricing in a Blockchain Protocol Using Gaussian Bandits. (arXiv:2212.07942v1 [q-fin.CP])
    The Graph Protocol indexes historical blockchain transaction data and makes it available for querying. As the protocol is decentralized, there are many independent Indexers that index and compete with each other for serving queries to the Consumers. One dimension along which Indexers compete is pricing. In this paper, we propose a bandit-based algorithm for maximization of Indexers' revenue via Consumer budget discovery. We present the design and the considerations we had to make for a dynamic pricing algorithm being used by multiple agents simultaneously. We discuss the results achieved by our dynamic pricing bandits both in simulation and deployed into production on one of the Indexers operating on Ethereum. We have open-sourced both the simulation framework and tools we created, which other Indexers have since started to adapt into their own workflows.  ( 2 min )
    Man-recon: manifold learning for reconstruction with deep autoencoder for smart seismic interpretation. (arXiv:2212.07568v1 [cs.LG])
    Deep learning can extract rich data representations if provided sufficient quantities of labeled training data. For many tasks however, annotating data has significant costs in terms of time and money, owing to the high standards of subject matter expertise required, for example in medical and geophysical image interpretation tasks. Active Learning can identify the most informative training examples for the interpreter to train, leading to higher efficiency. We propose an Active learning method based on jointly learning representations for supervised and unsupervised tasks. The learned manifold structure is later utilized to identify informative training samples most dissimilar from the learned manifold from the error profiles on the unsupervised task. We verify the efficiency of the proposed method on a seismic facies segmentation dataset from the Netherlands F3 block survey, significantly outperforming contemporary methods to achieve the highest mean Intersection-Over-Union value of 0.773.  ( 2 min )
    RWEN-TTS: Relation-aware Word Encoding Network for Natural Text-to-Speech Synthesis. (arXiv:2212.07939v1 [cs.CL])
    With the advent of deep learning, a huge number of text-to-speech (TTS) models which produce human-like speech have emerged. Recently, by introducing syntactic and semantic information w.r.t the input text, various approaches have been proposed to enrich the naturalness and expressiveness of TTS models. Although these strategies showed impressive results, they still have some limitations in utilizing language information. First, most approaches only use graph networks to utilize syntactic and semantic information without considering linguistic features. Second, most previous works do not explicitly consider adjacent words when encoding syntactic and semantic information, even though it is obvious that adjacent words are usually meaningful when encoding the current word. To address these issues, we propose Relation-aware Word Encoding Network (RWEN), which effectively allows syntactic and semantic information based on two modules (i.e., Semantic-level Relation Encoding and Adjacent Word Relation Encoding). Experimental results show substantial improvements compared to previous works.  ( 2 min )
    Hybrid Quantum Generative Adversarial Networks for Molecular Simulation and Drug Discovery. (arXiv:2212.07826v1 [quant-ph])
    In molecular research, simulation \& design of molecules are key areas with significant implications for drug development, material science, and other fields. Current classical computational power falls inadequate to simulate any more than small molecules, let alone protein chains on hundreds of peptide. Therefore these experiment are done physically in wet-lab, but it takes a lot of time \& not possible to examine every molecule due to the size of the search area, tens of billions of dollars are spent every year in these research experiments. Molecule simulation \& design has lately advanced significantly by machine learning models, A fresh perspective on the issue of chemical synthesis is provided by deep generative models for graph-structured data. By optimising differentiable models that produce molecular graphs directly, it is feasible to avoid costly search techniques in the discrete and huge space of chemical structures. But these models also suffer from computational limitations when dimensions become huge and consume huge amount of resources. Quantum Generative machine learning in recent years have shown some empirical results promising significant advantages over classical counterparts.  ( 2 min )
    Multi-Agent Reinforcement Learning with Shared Resources for Inventory Management. (arXiv:2212.07684v1 [cs.AI])
    In this paper, we consider the inventory management (IM) problem where we need to make replenishment decisions for a large number of stock keeping units (SKUs) to balance their supply and demand. In our setting, the constraint on the shared resources (such as the inventory capacity) couples the otherwise independent control for each SKU. We formulate the problem with this structure as Shared-Resource Stochastic Game (SRSG)and propose an efficient algorithm called Context-aware Decentralized PPO (CD-PPO). Through extensive experiments, we demonstrate that CD-PPO can accelerate the learning procedure compared with standard MARL algorithms.  ( 2 min )
    Forgetful Forests: high performance learning data structures for streaming data under concept drift. (arXiv:2212.07876v1 [cs.LG])
    Database research can help machine learning performance in many ways. One way is to design better data structures. This paper combines the use of incremental computation and sequential and probabilistic filtering to enable "forgetful" tree-based learning algorithms to cope with concept drift data (i.e., data whose function from input to classification changes over time). The forgetful algorithms described in this paper achieve high time performance while maintaining high quality predictions on streaming data. Specifically, the algorithms are up to 24 times faster than state-of-the-art incremental algorithms with at most a 2% loss of accuracy, or at least twice faster without any loss of accuracy. This makes such structures suitable for high volume streaming applications.  ( 2 min )
    TeTIm-Eval: a novel curated evaluation data set for comparing text-to-image models. (arXiv:2212.07839v1 [cs.CV])
    Evaluating and comparing text-to-image models is a challenging problem. Significant advances in the field have recently been made, piquing interest of various industrial sectors. As a consequence, a gold standard in the field should cover a variety of tasks and application contexts. In this paper a novel evaluation approach is experimented, on the basis of: (i) a curated data set, made by high-quality royalty-free image-text pairs, divided into ten categories; (ii) a quantitative metric, the CLIP-score, (iii) a human evaluation task to distinguish, for a given text, the real and the generated images. The proposed method has been applied to the most recent models, i.e., DALLE2, Latent Diffusion, Stable Diffusion, GLIDE and Craiyon. Early experimental results show that the accuracy of the human judgement is fully coherent with the CLIP-score. The dataset has been made available to the public.  ( 2 min )
    Monitoring MBE substrate deoxidation via RHEED image-sequence analysis by deep learning. (arXiv:2210.03430v2 [cond-mat.mes-hall] UPDATED)
    Reflection high-energy electron diffraction (RHEED) is a powerful tool in molecular beam epitaxy (MBE), but RHEED images are often difficult to interpret, requiring experienced operators. We present an approach for automated surveillance of GaAs substrate deoxidation in MBE reactors using deep learning based RHEED image-sequence classification. Our approach consists of an non-supervised auto-encoder (AE) for feature extraction, combined with a supervised convolutional classifier network. We demonstrate that our lightweight network model can accurately identify the exact deoxidation moment. Furthermore we show that the approach is very robust and allows accurate deoxidation detection during months without requiring re-training. The main advantage of the approach is that it can be applied to raw RHEED images without requiring further information such as the rotation angle, temperature, etc.  ( 2 min )
    PulseImpute: A Novel Benchmark Task for Pulsative Physiological Signal Imputation. (arXiv:2212.07514v1 [cs.LG])
    The promise of Mobile Health (mHealth) is the ability to use wearable sensors to monitor participant physiology at high frequencies during daily life to enable temporally-precise health interventions. However, a major challenge is frequent missing data. Despite a rich imputation literature, existing techniques are ineffective for the pulsative signals which comprise many mHealth applications, and a lack of available datasets has stymied progress. We address this gap with PulseImpute, the first large-scale pulsative signal imputation challenge which includes realistic mHealth missingness models, an extensive set of baselines, and clinically-relevant downstream tasks. Our baseline models include a novel transformer-based architecture designed to exploit the structure of pulsative signals. We hope that PulseImpute will enable the ML community to tackle this significant and challenging task.  ( 2 min )
    Tensions Between the Proxies of Human Values in AI. (arXiv:2212.07508v1 [cs.LG])
    Motivated by mitigating potentially harmful impacts of technologies, the AI community has formulated and accepted mathematical definitions for certain pillars of accountability: e.g. privacy, fairness, and model transparency. Yet, we argue this is fundamentally misguided because these definitions are imperfect, siloed constructions of the human values they hope to proxy, while giving the guise that those values are sufficiently embedded in our technologies. Under popularized methods, tensions arise when practitioners attempt to achieve each pillar of fairness, privacy, and transparency in isolation or simultaneously. In this position paper, we push for redirection. We argue that the AI community needs to consider all the consequences of choosing certain formulations of these pillars -- not just the technical incompatibilities, but also the effects within the context of deployment. We point towards sociotechnical research for frameworks for the latter, but push for broader efforts into implementing these in practice.  ( 2 min )
    Mod-Squad: Designing Mixture of Experts As Modular Multi-Task Learners. (arXiv:2212.08066v1 [cs.CV])
    Optimization in multi-task learning (MTL) is more challenging than single-task learning (STL), as the gradient from different tasks can be contradictory. When tasks are related, it can be beneficial to share some parameters among them (cooperation). However, some tasks require additional parameters with expertise in a specific type of data or discrimination (specialization). To address the MTL challenge, we propose Mod-Squad, a new model that is Modularized into groups of experts (a 'Squad'). This structure allows us to formalize cooperation and specialization as the process of matching experts and tasks. We optimize this matching process during the training of a single model. Specifically, we incorporate mixture of experts (MoE) layers into a transformer model, with a new loss that incorporates the mutual dependence between tasks and experts. As a result, only a small set of experts are activated for each task. This prevents the sharing of the entire backbone model between all tasks, which strengthens the model, especially when the training set size and the number of tasks scale up. More interestingly, for each task, we can extract the small set of experts as a standalone model that maintains the same performance as the large model. Extensive experiments on the Taskonomy dataset with 13 vision tasks and the PASCAL-Context dataset with 5 vision tasks show the superiority of our approach.  ( 2 min )
    Population Template-Based Brain Graph Augmentation for Improving One-Shot Learning Classification. (arXiv:2212.07790v1 [q-bio.NC])
    The challenges of collecting medical data on neurological disorder diagnosis problems paved the way for learning methods with scarce number of samples. Due to this reason, one-shot learning still remains one of the most challenging and trending concepts of deep learning as it proposes to simulate the human-like learning approach in classification problems. Previous studies have focused on generating more accurate fingerprints of the population using graph neural networks (GNNs) with connectomic brain graph data. Thereby, generated population fingerprints named connectional brain template (CBTs) enabled detecting discriminative bio-markers of the population on classification tasks. However, the reverse problem of data augmentation from single graph data representing brain connectivity has never been tackled before. In this paper, we propose an augmentation pipeline in order to provide improved metrics on our binary classification problem. Divergently from the previous studies, we examine augmentation from a single population template by utilizing graph-based generative adversarial network (gGAN) architecture for a classification problem. We benchmarked our proposed solution on AD/LMCI dataset consisting of brain connectomes with Alzheimer's Disease (AD) and Late Mild Cognitive Impairment (LMCI). In order to evaluate our model's generalizability, we used cross-validation strategy and randomly sampled the folds multiple times. Our results on classification not only provided better accuracy when augmented data generated from one sample is introduced, but yields more balanced results on other metrics as well.
    Scalable Bayesian Uncertainty Quantification for Neural Network Potentials: Promise and Pitfalls. (arXiv:2212.07959v1 [physics.chem-ph])
    Neural network (NN) potentials promise highly accurate molecular dynamics (MD) simulations within the computational complexity of classical MD force fields. However, when applied outside their training domain, NN potential predictions can be inaccurate, increasing the need for Uncertainty Quantification (UQ). Bayesian modeling provides the mathematical framework for UQ, but classical Bayesian methods based on Markov chain Monte Carlo (MCMC) are computationally intractable for NN potentials. By training graph NN potentials for coarse-grained systems of liquid water and alanine dipeptide, we demonstrate here that scalable Bayesian UQ via stochastic gradient MCMC (SG-MCMC) yields reliable uncertainty estimates for MD observables. We show that cold posteriors can reduce the required training data size and that for reliable UQ, multiple Markov chains are needed. Additionally, we find that SG-MCMC and the Deep Ensemble method achieve comparable results, despite shorter training and less hyperparameter tuning of the latter. We show that both methods can capture aleatoric and epistemic uncertainty reliably, but not systematic uncertainty, which needs to be minimized by adequate modeling to obtain accurate credible intervals for MD observables. Our results represent a step towards accurate UQ that is of vital importance for trustworthy NN potential-based MD simulations required for decision-making in practice.  ( 2 min )
    Robustness Evaluation of Regression Tasks with Skewed Domain Preferences. (arXiv:2212.07562v1 [cs.LG])
    In natural phenomena, data distributions often deviate from normality. One can think of cataclysms as a self-explanatory example: events that occur almost never, and at the same time are many standard deviations away from the common outcome. In many scientific contexts it is exactly these tail events that researchers are most interested in anticipating, so that adequate measures can be taken to prevent or attenuate a major impact on society. Despite such efforts, we have yet to provide definite answers to crucial issues in evaluating predictive solutions in domains such as weather, pollution, health. In this paper, we deal with two encapsulated problems simultaneously. First, assessing the performance of regression models when non-uniform preferences apply - not all values are equally relevant concerning the accuracy of their prediction, and there's a particular interest in the most extreme values. Second, assessing the robustness of models when dealing with uncertainty regarding the actual underlying distribution of values relevant for such problems. We show how different levels of relevance associated with target values may impact experimental conclusions, and demonstrate the practical utility of the proposed methods.  ( 2 min )
    FIS-GAN: GAN with Flow-based Importance Sampling. (arXiv:1910.02519v3 [cs.LG] UPDATED)
    Generative Adversarial Networks (GAN) training process, in most cases, apply Uniform or Gaussian sampling methods in the latent space, which probably spends most of the computation on examples that can be properly handled and easy to generate. Theoretically, importance sampling speeds up stochastic optimization in supervised learning by prioritizing training examples. In this paper, we explore the possibility of adapting importance sampling into adversarial learning. We use importance sampling to replace Uniform and Gaussian sampling methods in the latent space and employ normalizing flow to approximate latent space posterior distribution by density estimation. Empirically, results on MNIST and Fashion-MNIST demonstrate that our method significantly accelerates GAN's optimization while retaining visual fidelity in generated samples.  ( 2 min )
    Transposed Variational Auto-encoder with Intrinsic Feature Learning for Traffic Forecasting. (arXiv:2211.00641v4 [cs.LG] UPDATED)
    In this technical report, we present our solutions to the Traffic4cast 2022 core challenge and extended challenge. In this competition, the participants are required to predict the traffic states for the future 15-minute based on the vehicle counter data in the previous hour. Compared to other competitions in the same series, this year focuses on the prediction of different data sources and sparse vertex-to-edge generalization. To address these issues, we introduce the Transposed Variational Auto-encoder (TVAE) model to reconstruct the missing data and Graph Attention Networks (GAT) to strengthen the correlations between learned representations. We further apply feature selection to learn traffic patterns from diverse but easily available data. Our solutions have ranked first in both challenges on the final leaderboard. The source code is available at \url{https://github.com/Daftstone/Traffic4cast}  ( 2 min )
    CLAM: Selective Clarification for Ambiguous Questions with Large Language Models. (arXiv:2212.07769v1 [cs.CL])
    State-of-the-art language models are often accurate on many question-answering benchmarks with well-defined questions. Yet, in real settings questions are often unanswerable without asking the user for clarifying information. We show that current SotA models often do not ask the user for clarification when presented with imprecise questions and instead provide incorrect answers or "hallucinate". To address this, we introduce CLAM, a framework that first uses the model to detect ambiguous questions, and if an ambiguous question is detected, prompts the model to ask the user for clarification. Furthermore, we show how to construct a scalable and cost-effective automatic evaluation protocol using an oracle language model with privileged information to provide clarifying information. We show that our method achieves a 20.15 percentage point accuracy improvement over SotA on a novel ambiguous question-answering answering data set derived from TriviaQA.  ( 2 min )
    Let's consider more general nonlinear approaches to study teleconnections of climate variables. (arXiv:2212.07635v1 [cs.LG])
    The recent work by (Rieger et al 2021) is concerned with the problem of extracting features from spatio-temporal geophysical signals. The authors introduce the complex rotated MCA (xMCA) to deal with lagged effects and non-orthogonality of the feature representation. This method essentially (1) transforms the signals to a complex plane with the Hilbert transform; (2) applies an oblique (Varimax and Promax) rotation to remove the orthogonality constraint; and (3) performs the eigendecomposition in this complex space (Horel et al, 1984). We argue that this method is essentially a particular case of the method called rotated complex kernel principal component analysis (ROCK-PCA) introduced in (Bueso et al., 2019, 2020), where we proposed the same approach: first transform the data to the complex plane with the Hilbert transform and then apply the varimax rotation, with the only difference that the eigendecomposition is performed in the dual (kernel) Hilbert space. The latter allows us to generalize the xMCA solution by extracting nonlinear (curvilinear) features when nonlinear kernel functions are used. Hence, the solution of xMCA boils down to ROCK-PCA when the inner product is computed in the input data space instead of in the high-dimensional (possibly infinite) kernel Hilbert space to which data has been mapped. In this short correspondence we show theoretical proof that xMCA is a special case of ROCK-PCA and provide quantitative evidence that more expressive and informative features can be extracted when working with kernels; results of the decomposition of global sea surface temperature (SST) fields are shown to illustrate the capabilities of ROCK-PCA to cope with nonlinear processes, unlike xMCA.  ( 2 min )
    Generating multivariate time series with COmmon Source CoordInated GAN (COSCI-GAN). (arXiv:2205.13741v2 [cs.LG] UPDATED)
    Generating multivariate time series is a promising approach for sharing sensitive data in many medical, financial, and IoT applications. A common type of multivariate time series originates from a single source such as the biometric measurements from a medical patient. This leads to complex dynamical patterns between individual time series that are hard to learn by typical generation models such as GANs. There is valuable information in those patterns that machine learning models can use to better classify, predict or perform other downstream tasks. We propose a novel framework that takes time series' common origin into account and favors channel/feature relationships preservation. The two key points of our method are: 1) the individual time series are generated from a common point in latent space and 2) a central discriminator favors the preservation of inter-channel/feature dynamics. We demonstrate empirically that our method helps preserve channel/feature correlations and that our synthetic data performs very well in downstream tasks with medical and financial data.  ( 2 min )
    Robust Policy Optimization in Deep Reinforcement Learning. (arXiv:2212.07536v1 [cs.LG])
    The policy gradient method enjoys the simplicity of the objective where the agent optimizes the cumulative reward directly. Moreover, in the continuous action domain, parameterized distribution of action distribution allows easy control of exploration, resulting from the variance of the representing distribution. Entropy can play an essential role in policy optimization by selecting the stochastic policy, which eventually helps better explore the environment in reinforcement learning (RL). However, the stochasticity often reduces as the training progresses; thus, the policy becomes less exploratory. Additionally, certain parametric distributions might only work for some environments and require extensive hyperparameter tuning. This paper aims to mitigate these issues. In particular, we propose an algorithm called Robust Policy Optimization (RPO), which leverages a perturbed distribution. We hypothesize that our method encourages high-entropy actions and provides a way to represent the action space better. We further provide empirical evidence to verify our hypothesis. We evaluated our methods on various continuous control tasks from DeepMind Control, OpenAI Gym, Pybullet, and IsaacGym. We observed that in many settings, RPO increases the policy entropy early in training and then maintains a certain level of entropy throughout the training period. Eventually, our agent RPO shows consistently improved performance compared to PPO and other techniques: entropy regularization, different distributions, and data augmentation. Furthermore, in several settings, our method stays robust in performance, while other baseline mechanisms fail to improve and even worsen the performance.  ( 2 min )
    Counterfactual Explanations for Support Vector Machine Models. (arXiv:2212.07432v1 [cs.LG])
    We tackle the problem of computing counterfactual explanations -- minimal changes to the features that flip an undesirable model prediction. We propose a solution to this question for linear Support Vector Machine (SVMs) models. Moreover, we introduce a way to account for weighted actions that allow for more changes in certain features than others. In particular, we show how to find counterfactual explanations with the purpose of increasing model interpretability. These explanations are valid, change only actionable features, are close to the data distribution, sparse, and take into account correlations between features. We cast this as a mixed integer programming optimization problem. Additionally, we introduce two novel scale-invariant cost functions for assessing the quality of counterfactual explanations and use them to evaluate the quality of our approach with a real medical dataset. Finally, we build a support vector machine model to predict whether law students will pass the Bar exam using protected features, and used our algorithms to uncover the inherent biases of the SVM.  ( 2 min )
    Neural Neural Textures Make Sim2Real Consistent. (arXiv:2206.13500v2 [cs.CV] UPDATED)
    Unpaired image translation algorithms can be used for sim2real tasks, but many fail to generate temporally consistent results. We present a new approach that combines differentiable rendering with image translation to achieve temporal consistency over indefinite timescales, using surface consistency losses and \emph{neural neural textures}. We call this algorithm TRITON (Texture Recovering Image Translation Network): an unsupervised, end-to-end, stateless sim2real algorithm that leverages the underlying 3D geometry of input scenes by generating realistic-looking learnable neural textures. By settling on a particular texture for the objects in a scene, we ensure consistency between frames statelessly. Unlike previous algorithms, TRITON is not limited to camera movements -- it can handle the movement of objects as well, making it useful for downstream tasks such as robotic manipulation.  ( 2 min )
    Multiclass classification utilising an estimated algorithmic probability prior. (arXiv:2212.07426v1 [cs.LG])
    Methods of pattern recognition and machine learning are applied extensively in science, technology, and society. Hence, any advances in related theory may translate into large-scale impact. Here we explore how algorithmic information theory, especially algorithmic probability, may aid in a machine learning task. We study a multiclass supervised classification problem, namely learning the RNA molecule sequence-to-shape map, where the different possible shapes are taken to be the classes. The primary motivation for this work is a proof of concept example, where a concrete, well-motivated machine learning task can be aided by approximations to algorithmic probability. Our approach is based on directly estimating the class (i.e., shape) probabilities from shape complexities, and using the estimated probabilities as a prior in a Gaussian process learning problem. Naturally, with a large amount of training data, the prior has no significant influence on classification accuracy, but in the very small training data regime, we show that using the prior can substantially improve classification accuracy. To our knowledge, this work is one of the first to demonstrate how algorithmic probability can aid in a concrete, real-world, machine learning problem.  ( 2 min )
    DUIDD: Deep-Unfolded Interleaved Detection and Decoding for MIMO Wireless Systems. (arXiv:2212.07816v1 [cs.IT])
    Iterative detection and decoding (IDD) is known to achieve near-capacity performance in multi-antenna wireless systems. We propose deep-unfolded interleaved detection and decoding (DUIDD), a new paradigm that reduces the complexity of IDD while achieving even lower error rates. DUIDD interleaves the inner stages of the data detector and channel decoder, which expedites convergence and reduces complexity. Furthermore, DUIDD applies deep unfolding to automatically optimize algorithmic hyperparameters, soft-information exchange, message damping, and state forwarding. We demonstrate the efficacy of DUIDD using NVIDIA's Sionna link-level simulator in a 5G-near multi-user MIMO-OFDM wireless system with a novel low-complexity soft-input soft-output data detector, an optimized low-density parity-check decoder, and channel vectors from a commercial ray-tracer. Our results show that DUIDD outperforms classical IDD both in terms of block error rate and computational complexity.  ( 2 min )
    Output-Dependent Gaussian Process State-Space Model. (arXiv:2212.07608v1 [cs.LG])
    Gaussian process state-space model (GPSSM) is a fully probabilistic state-space model that has attracted much attention over the past decade. However, the outputs of the transition function in the existing GPSSMs are assumed to be independent, meaning that the GPSSMs cannot exploit the inductive biases between different outputs and lose certain model capacities. To address this issue, this paper proposes an output-dependent and more realistic GPSSM by utilizing the well-known, simple yet practical linear model of coregionalization (LMC) framework to represent the output dependency. To jointly learn the output-dependent GPSSM and infer the latent states, we propose a variational sparse GP-based learning method that only gently increases the computational complexity. Experiments on both synthetic and real datasets demonstrate the superiority of the output-dependent GPSSM in terms of learning and inference performance.  ( 2 min )
    Variable Clustering via Distributionally Robust Nodewise Regression. (arXiv:2212.07944v1 [cs.LG])
    We study a multi-factor block model for variable clustering and connect it to the regularized subspace clustering by formulating a distributionally robust version of the nodewise regression. To solve the latter problem, we derive a convex relaxation, provide guidance on selecting the size of the robust region, and hence the regularization weighting parameter, based on the data, and propose an ADMM algorithm for implementation. We validate our method in an extensive simulation study. Finally, we propose and apply a variant of our method to stock return data, obtain interpretable clusters that facilitate portfolio selection and compare its out-of-sample performance with other clustering methods in an empirical study.  ( 2 min )
    Projection-Domain Self-Supervision for Volumetric Helical CT Reconstruction. (arXiv:2212.07431v1 [eess.IV])
    We propose a deep learning method for three-dimensional reconstruction in low-dose helical cone-beam computed tomography. We reconstruct the volume directly, i.e., not from 2D slices, guaranteeing consistency along all axes. In a crucial step beyond prior work, we train our model in a self-supervised manner in the projection domain using noisy 2D projection data, without relying on 3D reference data or the output of a reference reconstruction method. This means the fidelity of our results is not limited by the quality and availability of such data. We evaluate our method on real helical cone-beam projections and simulated phantoms. Our reconstructions are sharper and less noisy than those of previous methods, and several decibels better in quantitative PSNR measurements. When applied to full-dose data, our method produces high-quality results orders of magnitude faster than iterative techniques.  ( 2 min )
    Class-Aware Adversarial Transformers for Medical Image Segmentation. (arXiv:2201.10737v5 [cs.CV] UPDATED)
    Transformers have made remarkable progress towards modeling long-range dependencies within the medical image analysis domain. However, current transformer-based models suffer from several disadvantages: (1) existing methods fail to capture the important features of the images due to the naive tokenization scheme; (2) the models suffer from information loss because they only consider single-scale feature representations; and (3) the segmentation label maps generated by the models are not accurate enough without considering rich semantic contexts and anatomical textures. In this work, we present CASTformer, a novel type of adversarial transformers, for 2D medical image segmentation. First, we take advantage of the pyramid structure to construct multi-scale representations and handle multi-scale variations. We then design a novel class-aware transformer module to better learn the discriminative regions of objects with semantic structures. Lastly, we utilize an adversarial training strategy that boosts segmentation accuracy and correspondingly allows a transformer-based discriminator to capture high-level semantically correlated contents and low-level anatomical features. Our experiments demonstrate that CASTformer dramatically outperforms previous state-of-the-art transformer-based approaches on three benchmarks, obtaining 2.54%-5.88% absolute improvements in Dice over previous models. Further qualitative experiments provide a more detailed picture of the model's inner workings, shed light on the challenges in improved transparency, and demonstrate that transfer learning can greatly improve performance and reduce the size of medical image datasets in training, making CASTformer a strong starting point for downstream medical image analysis tasks.  ( 3 min )
    A machine learning model to identify corruption in M\'exico's public procurement contracts. (arXiv:2211.01478v2 [cs.CY] UPDATED)
    The costs and impacts of government corruption range from impairing a country's economic growth to affecting its citizens' well-being and safety. Public contracting between government dependencies and private sector instances, referred to as public procurement, is a fertile land of opportunity for corrupt practices, generating substantial monetary losses worldwide. Thus, identifying and deterring corrupt activities between the government and the private sector is paramount. However, due to several factors, corruption in public procurement is challenging to identify and track, leading to corrupt practices going unnoticed. This paper proposes a machine learning model based on an ensemble of random forest classifiers, which we call hyper-forest, to identify and predict corrupt contracts in M\'exico's public procurement data. This method's results correctly detect most of the corrupt and non-corrupt contracts evaluated in the dataset. Furthermore, we found that the most critical predictors considered in the model are those related to the relationship between buyers and suppliers rather than those related to features of individual contracts. Also, the method proposed here is general enough to be trained with data from other countries. Overall, our work presents a tool that can help in the decision-making process to identify, predict and analyze corruption in public procurement contracts.  ( 2 min )
    Runtime Monitoring for Out-of-Distribution Detection in Object Detection Neural Networks. (arXiv:2212.07773v1 [cs.LG])
    Runtime monitoring provides a more realistic and applicable alternative to verification in the setting of real neural networks used in industry. It is particularly useful for detecting out-of-distribution (OOD) inputs, for which the network was not trained and can yield erroneous results. We extend a runtime-monitoring approach previously proposed for classification networks to perception systems capable of identification and localization of multiple objects. Furthermore, we analyze its adequacy experimentally on different kinds of OOD settings, documenting the overall efficacy of our approach.  ( 2 min )
    Invariant Lipschitz Bandits: A Side Observation Approach. (arXiv:2212.07524v1 [cs.LG])
    Symmetry arises in many optimization and decision-making problems, and has attracted considerable attention from the optimization community: By utilizing the existence of such symmetries, the process of searching for optimal solutions can be improved significantly. Despite its success in (offline) optimization, the utilization of symmetries has not been well examined within the online optimization settings, especially in the bandit literature. As such, in this paper we study the invariant Lipschitz bandit setting, a subclass of the Lipschitz bandits where the reward function and the set of arms are preserved under a group of transformations. We introduce an algorithm named \texttt{UniformMesh-N}, which naturally integrates side observations using group orbits into the \texttt{UniformMesh} algorithm (\cite{Kleinberg2005_UniformMesh}), which uniformly discretizes the set of arms. Using the side-observation approach, we prove an improved regret upper bound, which depends on the cardinality of the group, given that the group is finite. We also prove a matching regret's lower bound for the invariant Lipschitz bandit class (up to logarithmic factors). We hope that our work will ignite further investigation of symmetry in bandit theory and sequential decision-making theory in general.  ( 2 min )
    A scalable framework for annotating photovoltaic cell defects in electroluminescence images. (arXiv:2212.07768v1 [cs.CV])
    The correct functioning of photovoltaic (PV) cells is critical to ensuring the optimal performance of a solar plant. Anomaly detection techniques for PV cells can result in significant cost savings in operation and maintenance (O&M). Recent research has focused on deep learning techniques for automatically detecting anomalies in Electroluminescence (EL) images. Automated anomaly annotations can improve current O&M methodologies and help develop decision-making systems to extend the life-cycle of the PV cells and predict failures. This paper addresses the lack of anomaly segmentation annotations in the literature by proposing a combination of state-of-the-art data-driven techniques to create a Golden Standard benchmark. The proposed method stands out for (1) its adaptability to new PV cell types, (2) cost-efficient fine-tuning, and (3) leverage public datasets to generate advanced annotations. The methodology has been validated in the annotation of a widely used dataset, obtaining a reduction of the annotation cost by 60%.  ( 2 min )
  • Open

    Construction of a Surrogate Model: Multivariate Time Series Prediction with a Hybrid Model. (arXiv:2212.07918v1 [stat.ML])
    Recent developments of advanced driver-assistance systems necessitate an increasing number of tests to validate new technologies. These tests cannot be carried out on track in a reasonable amount of time and automotive groups rely on simulators to perform most tests. The reliability of these simulators for constantly refined tasks is becoming an issue and, to increase the number of tests, the industry is now developing surrogate models, that should mimic the behavior of the simulator while being much faster to run on specific tasks. In this paper we aim to construct a surrogate model to mimic and replace the simulator. We first test several classical methods such as random forests, ridge regression or convolutional neural networks. Then we build three hybrid models that use all these methods and combine them to obtain an efficient hybrid surrogate model.
    Modelling stellar activity with Gaussian process regression networks. (arXiv:2205.06627v2 [astro-ph.EP] UPDATED)
    Stellar photospheric activity is known to limit the detection and characterisation of extra-solar planets. In particular, the study of Earth-like planets around Sun-like stars requires data analysis methods that can accurately model the stellar activity phenomena affecting radial velocity (RV) measurements. Gaussian Process Regression Networks (GPRNs) offer a principled approach to the analysis of simultaneous time-series, combining the structural properties of Bayesian neural networks with the non-parametric flexibility of Gaussian Processes. Using HARPS-N solar spectroscopic observations encompassing three years, we demonstrate that this framework is capable of jointly modelling RV data and traditional stellar activity indicators. Although we consider only the simplest GPRN configuration, we are able to describe the behaviour of solar RV data at least as accurately as previously published methods. We confirm the correlation between the RV and stellar activity time series reaches a maximum at separations of a few days, and find evidence of non-stationary behaviour in the time series, associated with an approaching solar activity minimum.
    Ungeneralizable Contextual Logistic Bandit in Credit Scoring. (arXiv:2212.07632v1 [stat.ML])
    The application of reinforcement learning in credit scoring has created a unique setting for contextual logistic bandit that does not conform to the usual exploration-exploitation tradeoff but rather favors exploration-free algorithms. Through sufficient randomness in a pool of observable contexts, the reinforcement learning agent can simultaneously exploit an action with the highest reward while still learning more about the structure governing that environment. Thus, it is the case that greedy algorithms consistently outperform algorithms with efficient exploration, such as Thompson sampling. However, in a more pragmatic scenario in credit scoring, lenders can, to a degree, classify each borrower as a separate group, and learning about the characteristics of each group does not infer any information to another group. Through extensive simulations, we show that Thompson sampling dominates over greedy algorithms given enough timesteps which increase with the complexity of underlying features.
    Extending Universal Approximation Guarantees: A Theoretical Justification for the Continuity of Real-World Learning Tasks. (arXiv:2212.07934v1 [stat.ML])
    Universal Approximation Theorems establish the density of various classes of neural network function approximators in $C(K, \mathbb{R}^m)$, where $K \subset \mathbb{R}^n$ is compact. In this paper, we aim to extend these guarantees by establishing conditions on learning tasks that guarantee their continuity. We consider learning tasks given by conditional expectations $x \mapsto \mathrm{E}\left[Y \mid X = x\right]$, where the learning target $Y = f \circ L$ is a potentially pathological transformation of some underlying data-generating process $L$. Under a factorization $L = T \circ W$ for the data-generating process where $T$ is thought of as a deterministic map acting on some random input $W$, we establish conditions (that might be easily verified using knowledge of $T$ alone) that guarantee the continuity of practically \textit{any} derived learning task $x \mapsto \mathrm{E}\left[f \circ L \mid X = x\right]$. We motivate the realism of our conditions using the example of randomized stable matching, thus providing a theoretical justification for the continuity of real-world learning tasks.
    Towards Hardware-Specific Automatic Compression of Neural Networks. (arXiv:2212.07818v1 [cs.LG])
    Compressing neural network architectures is important to allow the deployment of models to embedded or mobile devices, and pruning and quantization are the major approaches to compress neural networks nowadays. Both methods benefit when compression parameters are selected specifically for each layer. Finding good combinations of compression parameters, so-called compression policies, is hard as the problem spans an exponentially large search space. Effective compression policies consider the influence of the specific hardware architecture on the used compression methods. We propose an algorithmic framework called Galen to search such policies using reinforcement learning utilizing pruning and quantization, thus providing automatic compression for neural networks. Contrary to other approaches we use inference latency measured on the target hardware device as an optimization goal. With that, the framework supports the compression of models specific to a given hardware target. We validate our approach using three different reinforcement learning agents for pruning, quantization and joint pruning and quantization. Besides proving the functionality of our approach we were able to compress a ResNet18 for CIFAR-10, on an embedded ARM processor, to 20% of the original inference latency without significant loss of accuracy. Moreover, we can demonstrate that a joint search and compression using pruning and quantization is superior to an individual search for policies using a single compression method.
    Invariant Lipschitz Bandits: A Side Observation Approach. (arXiv:2212.07524v1 [cs.LG])
    Symmetry arises in many optimization and decision-making problems, and has attracted considerable attention from the optimization community: By utilizing the existence of such symmetries, the process of searching for optimal solutions can be improved significantly. Despite its success in (offline) optimization, the utilization of symmetries has not been well examined within the online optimization settings, especially in the bandit literature. As such, in this paper we study the invariant Lipschitz bandit setting, a subclass of the Lipschitz bandits where the reward function and the set of arms are preserved under a group of transformations. We introduce an algorithm named \texttt{UniformMesh-N}, which naturally integrates side observations using group orbits into the \texttt{UniformMesh} algorithm (\cite{Kleinberg2005_UniformMesh}), which uniformly discretizes the set of arms. Using the side-observation approach, we prove an improved regret upper bound, which depends on the cardinality of the group, given that the group is finite. We also prove a matching regret's lower bound for the invariant Lipschitz bandit class (up to logarithmic factors). We hope that our work will ignite further investigation of symmetry in bandit theory and sequential decision-making theory in general.
    Reward Shaping for Human Learning via Inverse Reinforcement Learning. (arXiv:2002.10904v3 [cs.LG] UPDATED)
    Humans are spectacular reinforcement learners, constantly learning from and adjusting to experience and feedback. Unfortunately, this doesn't necessarily mean humans are fast learners. When tasks are challenging, learning can become unacceptably slow. Fortunately, humans do not have to learn tabula rasa, and learning speed can be greatly increased with learning aids. In this work we validate a new type of learning aid -- reward shaping for humans via inverse reinforcement learning (IRL). The goal of this aid is to increase the speed with which humans can learn good policies for specific tasks. Furthermore this approach compliments alternative machine learning techniques such as safety features that try to prevent individuals from making poor decisions. To achieve our results we first extend a well known IRL algorithm via kernel methods. Afterwards we conduct two human subjects experiments using an online game where players have limited time to learn a good policy. We show with statistical significance that players who receive our learning aid are able to approach desired policies more quickly than the control group.
    Identifying AGN host galaxies with convolutional neural networks. (arXiv:2212.07881v1 [astro-ph.GA])
    Active galactic nuclei (AGN) are supermassive black holes with luminous accretion disks found in some galaxies, and are thought to play an important role in galaxy evolution. However, traditional optical spectroscopy for identifying AGN requires time-intensive observations. We train a convolutional neural network (CNN) to distinguish AGN host galaxies from non-active galaxies using a sample of 210,000 Sloan Digital Sky Survey galaxies. We evaluate the CNN on 33,000 galaxies that are spectrally classified as composites, and find correlations between galaxy appearances and their CNN classifications, which hint at evolutionary processes that affect both galaxy morphology and AGN activity. With the advent of the Vera C. Rubin Observatory, Nancy Grace Roman Space Telescope, and other wide-field imaging telescopes, deep learning methods will be instrumental for quickly and reliably shortlisting AGN samples for future analyses.
    Privately Estimating a Gaussian: Efficient, Robust and Optimal. (arXiv:2212.08018v1 [cs.DS])
    In this work, we give efficient algorithms for privately estimating a Gaussian distribution in both pure and approximate differential privacy (DP) models with optimal dependence on the dimension in the sample complexity. In the pure DP setting, we give an efficient algorithm that estimates an unknown $d$-dimensional Gaussian distribution up to an arbitrary tiny total variation error using $\widetilde{O}(d^2 \log \kappa)$ samples while tolerating a constant fraction of adversarial outliers. Here, $\kappa$ is the condition number of the target covariance matrix. The sample bound matches best non-private estimators in the dependence on the dimension (up to a polylogarithmic factor). We prove a new lower bound on differentially private covariance estimation to show that the dependence on the condition number $\kappa$ in the above sample bound is also tight. Prior to our work, only identifiability results (yielding inefficient super-polynomial time algorithms) were known for the problem. In the approximate DP setting, we give an efficient algorithm to estimate an unknown Gaussian distribution up to an arbitrarily tiny total variation error using $\widetilde{O}(d^2)$ samples while tolerating a constant fraction of adversarial outliers. Prior to our work, all efficient approximate DP algorithms incurred a super-quadratic sample cost or were not outlier-robust. For the special case of mean estimation, our algorithm achieves the optimal sample complexity of $\widetilde O(d)$, improving on a $\widetilde O(d^{1.5})$ bound from prior work. Our pure DP algorithm relies on a recursive private preconditioning subroutine that utilizes the recent work on private mean estimation [Hopkins et al., 2022]. Our approximate DP algorithms are based on a substantial upgrade of the method of stabilizing convex relaxations introduced in [Kothari et al., 2022].
    Sliced Optimal Partial Transport. (arXiv:2212.08049v1 [cs.LG])
    Optimal transport (OT) has become exceedingly popular in machine learning, data science, and computer vision. The core assumption in the OT problem is the equal total amount of mass in source and target measures, which limits its application. Optimal Partial Transport (OPT) is a recently proposed solution to this limitation. Similar to the OT problem, the computation of OPT relies on solving a linear programming problem (often in high dimensions), which can become computationally prohibitive. In this paper, we propose an efficient algorithm for calculating the OPT problem between two non-negative measures in one dimension. Next, following the idea of sliced OT distances, we utilize slicing to define the sliced OPT distance. Finally, we demonstrate the computational and accuracy benefits of the sliced OPT-based method in various numerical experiments. In particular, we show an application of our proposed Sliced-OPT in noisy point cloud registration.
    FIS-GAN: GAN with Flow-based Importance Sampling. (arXiv:1910.02519v3 [cs.LG] UPDATED)
    Generative Adversarial Networks (GAN) training process, in most cases, apply Uniform or Gaussian sampling methods in the latent space, which probably spends most of the computation on examples that can be properly handled and easy to generate. Theoretically, importance sampling speeds up stochastic optimization in supervised learning by prioritizing training examples. In this paper, we explore the possibility of adapting importance sampling into adversarial learning. We use importance sampling to replace Uniform and Gaussian sampling methods in the latent space and employ normalizing flow to approximate latent space posterior distribution by density estimation. Empirically, results on MNIST and Fashion-MNIST demonstrate that our method significantly accelerates GAN's optimization while retaining visual fidelity in generated samples.

  • Open

    Making Sense of Artificial Intelligence: Eliezer Yudkowsky, Nick Bostrom, Max Tegmark, Stuart Russell, Eric Schmidt, Paul Bloom, Alison Gopnik, and David Deutsch | The Essential Sam Harris (Episode 1)
    submitted by /u/palsh7 [link] [comments]  ( 53 min )
    Can I use AI to come up with ideas, similar to these?
    submitted by /u/TheblackRook3 [link] [comments]  ( 53 min )
    Stable diffusion for High(er) resolution images. How to?
    submitted by /u/Seahorsejockey [link] [comments]  ( 54 min )
    Nicklas Hansen, UCSD: On long-horizon planning and why algorithms don't drive research progress
    Listen to the podcast episode with Nicklas Hansen from UC San Diego where we discuss adapting reinforcement learning policies during deployment, why algorithms don't drive research progress, and much more! submitted by /u/thejashGI [link] [comments]  ( 53 min )
    I Asked ChatGPT if AI will replace Humans
    submitted by /u/Mk_Makanaki [link] [comments]  ( 57 min )
    Adversarial Discriminative Domain Adaptation (ADDA) Paper Explained
    Hello everyone, I wanted to share a new video I just released on YouTube here called "Adversarial Discriminative Domain Adaptation (ADDA) Paper Explained." It's a deep dive into the concepts and techniques of ADDA, which is a powerful method for adapting machine learning models to new domains. If you're interested in machine learning and domain adaptation, I think you'll really enjoy it. Thanks for considering giving it a watch, and I hope you find it helpful! As always, feedback is extremely welcomed! submitted by /u/Personal-Trainer-541 [link] [comments]  ( 56 min )
    Business adoption of AI has doubled in the last five years – McKinsey
    The annual survey on the state of AI from McKinsey's QuantumBlack artificial intelligence division shows business adoption of the technology has doubled in the last five years. Half of the survey's respondents said their business has adopted AI in at least one business area. That's up from 20% of respondents in 2017. McKinsey also found that AI is being embedded into a wider range of business capabilities. The average user of AI in business is now using the technology in 3.8 applications, compared to 1.9 in 2018. The use of AI in business covers a spectrum of applications, including process automation, digital twins, and facial recognition. ​ https://preview.redd.it/8ds3zprg7b6a1.png?width=1148&format=png&auto=webp&s=3e05a9f8ded2841eb10462a24dc407b4e91778c5 ​ https://preview.redd.it/13k5tyyh7b6a1.png?width=1200&format=png&auto=webp&s=6fbeb7befac49f12d2acbda0626f5676cc9c4845 ​ This is from the AI With Vibes Newsletter, read the full issue here: https://aiwithvibes.beehiiv.com/p/ai-porn-billie-eilish-goes-viral-tiktok-chatgpt-brutally-destroyed-pun-competition submitted by /u/Mk_Makanaki [link] [comments]  ( 50 min )
    AI anime question
    I see on the internet a AI that's make photo like anime, but I don't want the app but maybe a API or Library to use in my program, anyone can help me? submitted by /u/gattolfo_EUG_ [link] [comments]  ( 53 min )
    AI Dream 59 - Towards the Sun - DJ Wizard69
    submitted by /u/LordPewPew777 [link] [comments]  ( 53 min )
    Soul mate AI possible within 10 years
    I wonder when we will have the perfect soul mate AI bot. A bot that functions as your best friend/soulmate, who can communicate with you, cheer you up, support you, entertain you, debate with you, make jokes that are perfect for your particular sense of humor, order your favourite food online and have it arrive a few minutes after you get home, etc. I believe this would be a game changer not just for old people, who live alone, but for everyone. People would spend more time with that bot than with social media. When I see how good some of the recent AIs have becone, I wonder how long it will take until we get this. What do you think? submitted by /u/Consistent-Put-6551 [link] [comments]  ( 53 min )
    Can the new AI tool ChatGPT replace human work?
    A recent article I read discussed the most current AI news. A brand-new AI product on the market is getting a lot of attention. ChatGPT is a piece of software that allows people to converse with a computer by typing in a question or task, and the program will create a response that is intended to appear human. We fed it billions of Internet-sourced text samples to train it. Moreover, one of its distinguishing features is its ability to grasp and generate natural language. This indicates that it may answer naturally and conversationally, making it a helpful tool in various settings. However, how well the processing technology "understands" language is unknown. But still, it is getting people talking. According to the researchers, "you can have what looks dangerously similar to a human dialogue with it." ChatGPT has grown in popularity as a result of its ease of use. It has only been open to the public for 11 days, yet it already has over a million users, making it more popular than Facebook in less time. Although, it faces challenges that even the company behind it admits, such as a proclivity to make "nonsense" as it advances. Please let me know if you have any recommendations or believe this topic might interest you. https://www.cbc.ca/news/business/chatgpt-artificial-intelligence-1.6681401 submitted by /u/ricks_cloud [link] [comments]  ( 50 min )
    Jeffrey Funk - Technology Innovation & Economics
    submitted by /u/timothy-ventura [link] [comments]  ( 50 min )
    How Positive Emotions in AI Chatbots Fall Flat
    submitted by /u/liquidocelotYT [link] [comments]  ( 51 min )
    who do you think has bigger chances of dominating the AI industry - Microsoft or Alphabet?
    submitted by /u/piranha_studio [link] [comments]  ( 49 min )
    Easy In-Depth Tutorial to Generate High Quality Seamless Textures with Stable Diffusion with Maps and importing into Unity, Link In Post!
    submitted by /u/AnonTopat [link] [comments]  ( 48 min )
    OpenAI's GPT-4 Coming Soon With 100,000,000,000,000 Parameters And Multimodal Input/Output, Meaning It Will Output Text, Audio, And Video
    submitted by /u/kenickh [link] [comments]  ( 50 min )
    Best Stable Diffusion 2.1 Style Embeddings Guide!
    submitted by /u/PuppetHere [link] [comments]  ( 51 min )
    OpenAI releases ChatGPT update and new embedding model
    submitted by /u/Number_5_alive [link] [comments]  ( 6 min )
    I created a short Quiz called "Which AI prompt was used?"
    submitted by /u/Mk_Makanaki [link] [comments]  ( 50 min )
    Asking AI to Visualize It's Self Dall-E
    I asked Dall-E what it thinks it looks like. It's pretty cute. https://preview.redd.it/0soebkoyw86a1.jpg?width=447&format=pjpg&auto=webp&s=b2f35ec2537e2f10a033cd4c87be1d5d7eb3029f submitted by /u/ReclusiveEagle [link] [comments]  ( 46 min )
    Is it possible to get into artificial intelligence with no coding knowledge?
    Very curious about AI and it seems like such an interesting concept, with seemingly infinite possibilities to help us with daily tasks, businesses, and so forth. I really want to get involved in AI and start creating my own projects but I did not study coding and feel like this would be a must for AI? Is it even remotely possible to get into AI and learn how to utilise its power without knowing fundamental coding principles? submitted by /u/-Adapted [link] [comments]  ( 52 min )
    AI Voice for Movie Translations
    I'm almost certain this is already an idea someone thought of, but I had to write it out here anyways, just in case. I think a super interesting use for AI voice generation/cloning of someone else's voice could be to generate the vocal audio for movies, in different languages. This could also probably apply to the video too, since that's possible now also. I think real acting and vocals are the way to go first, but I think this could also be a great fill-in if the actor doesn't know different languages, or if the movie is under too low of a budget to record in different languages. Then you get the actor's same exact voice, but in different languages! Sounds surreal, and super cool. There's a lot more CGI in movies nowadays, and that has become the norm, so I imagine this could have a similar outcome if it started to happen more and more (relating this to CGI, from feeling foreign to becoming more popular and part of the norm). What's your thoughts? Not sure if I like it, but I do like it as an idea at least 😆 Edit: Inspired from the Tailosive Tech live stream, when Drew and his dad were talking about Star Trek https://m.youtube.com/watch?v=SZikYMGEXe4 submitted by /u/Offroaders123 [link] [comments]  ( 53 min )
  • Open

    Technique for classifying a collection of words into a class [P]
    Hello, I am working on a problem where I have a collection of product descriptions (which is a collection of 5-20 words) which I then need to classify into one of 110 classes. The description will sometimes vary between instances for each class. For example, the description for each class can be slightly different - also - the descriptions sometimes contain incomplete and or misspelled words. ​ Description Class black rubber watch with water proof band black 108 hex watch waterproof band black 108 hex watch black waterproof atch black 108 hex watch watch black black 108 hex watch blue watch blue 110 bit watch bit watch blue 110 bit watch watch blue blue 110 bit watch DescriptionClassblack rubber watch with water proof bandblack 108 hex watchwaterproof bandblack 108 hex watchblack watchblack 108 hex watchwatchblue 110 bite watchblue watchblue 110 bit watchblue bit watchblue 110 bit watch I have roughly 1,800 classified training data points which are heavily skewed toward a few more popular products. Does anyone have advice for solving this classification problem? Right now I'm using blazing text with ngrams = 5 and am achieving 50% accuracy. However, I am wondering if there is a simpler way to solve this? ​ If you're interested - these are current hypter params I'm using: ​ bt_model = sagemaker.estimator.Estimator( container, role, instance_count=1, instance_type="ml.c4.4xlarge", volume_size=30, max_run=360000, input_mode="File", output_path=s3_output_location, hyperparameters={ "mode": "supervised", "epochs": 500, "min_count": 1, "learning_rate": 0.05, "vector_dim": 10, "early_stopping": True, "patience": 5, "min_epochs": 10, "word_ngrams": 5, }, ) ​ submitted by /u/Melampus123 [link] [comments]  ( 67 min )
    [D] Is this a fair analogy of a neural network's workings?
    For a blog post on the topic of AI art, aimed at a complete laypeople audience, I'm trying to roughly explain the training phase of neural networks, and by extension, how a GAN creates images, without going into much technical detail. The takeaway should be that a GAN does not copy parts of images in the training data, but extrapolates their common features. Could you tell me if my analogy below is acceptable, and if not, suggest how to change it? Imagine a neural network AI as a sea of numbers, the numbers representing interconnecting currents of various strengths. The AI is shown an image of a cat, pixels converted to numerical values, each pixel stuffed into a bottle and thrown into the sea, all along the length of the shore. Each bottle is caught in a current and is carried on to following currents. Some bottles are carried to the left, some to the right, and some sink. Across the sea, a number of bottles beach on cat island, which represents the word "cat". A second image of a different cat is divided and thrown into the sea, but this time not all the same bottles beach on cat island. Noticing this, the algorithm semi-randomly changes the strengths of the sea's currents until a majority of bottles reach cat island in both cases. The process is repeated with a million images until the streams are so aligned that they carry the most bottled pixels to cat island in most cases. The resulting optimised neural network can now recognise cat images by the sum of bottles that end up in the "cat" category. Similarly, an image generating AI can learn from which spots on the shore it should throw bottled pixels into the sea to reach cat island, i.e. how to form an image that the other image recognition AI will label "cat". submitted by /u/Don_Patrick [link] [comments]  ( 66 min )
    [R] Silent Bugs in Deep Learning Frameworks: An Empirical Study of Keras and TensorFlow
    https://arxiv.org/pdf/2112.13314.pdf ​ Deep Learning (DL) frameworks are now widely used, simplifying the creation of complex models as well as their integration to various applications even to non DL experts. However, like any other programs, they are prone to bugs. This paper deals with the subcategory of bugs named silent bugs: they lead to wrong behavior but they do not cause system crashes or hangs, nor show an error message to the user. Such bugs are even more dangerous in DL applications and frameworks due to the “black-box” and stochastic nature of the systems (the end user can not understand how the model makes decisions). This paper presents the first empirical study of Keras and TensorFlow silent bugs, and their impact on users’ programs. We extracted closed issues related to Keras from the TensorFlow GitHub repository. Out of the 1,168 issues that we gathered, 77 were reproducible silent bugs affecting users’ programs. We categorized the bugs based on the effects on the users’ programs and the components where the issues occurred, using information from the issue reports. We then derived a threat level for each of the issues, based on the impact they had on the users’ programs. To assess the relevance of identified categories and the impact scale, we conducted an online survey with 103 DL developers. The participants generally agreed with the significant impact of silent bugs in DL libraries and acknowledged our findings (i.e., categories of silent bugs and the proposed impact scale). Finally, leveraging our analysis, we provide a set of guidelines to facilitate safeguarding against such bugs in DL frameworks. submitted by /u/Ok-Teacher-22 [link] [comments]  ( 66 min )
    [D] Can You Generate Realistic Data With GPT-3?
    ChatGPT has taken the tech world by storm, but its older cousin GPT-3 is still relevant. Being able to connect to the text completion API through python allows you to use the large language model to generate synthetic data with bespoke distributions and relationships. The application is limited, however, as the lack of on-prem deployment limits your ability to show the model your proprietary data to learn from due to privacy concerns. Real data is complex, what do people think about using LLMs to generate synthetic data? Should they just stick to writing stories and jokes? submitted by /u/Djinn_Tonic4DataSci [link] [comments]  ( 61 min )
    [P] Possible NLP approaches to extract 'goals' from text
    I am planning to take up an interesting NLP project, due to my limited exposure to NLP I am stuck at the moment. I want to extract 'goal' statements from lengthy reports. For example, the goals can be We would be reducing our carbon footprint by 50% by 2025 or Our company aims to increase the diversity in the work-force in upcoming months. Check below image for example text and highlighted goals. https://preview.redd.it/z6houyh7ra6a1.png?width=970&format=png&auto=webp&s=b3f6032bf14bff0932a6eee44444f86e5b82c67b How can I go about the process of goal extraction, I would like to get some pointers on possible NLP approaches I can start with ? submitted by /u/8hubham [link] [comments]  ( 60 min )
    [D] What kind of effects ChatGPT or future developments may have on job market?
    I am actively using ChatGPT nowadays to seek assistance in various tasks such as fixing grammatical errors in manuscripts, to provide simplified/coherent explanations on technical jargon etc. This is giving me an impression that future jobs related to "writing" such as proofreaders might run out of business. submitted by /u/ureepamuree [link] [comments]  ( 66 min )
    [P] XetHub: We scaled Git to support 1 TB repos
    Thanks to everyone who replied to our earlier post requesting pre-launch product feedback! We’re excited to announce that we’ve now publicly launched XetHub, a collaborative storage platform for data management. I’ve been in the MLOps space for ~10 years, and data is still the hardest unsolved open problem. Code is versioned using Git, data is stored somewhere else, and context often lives in a 3rd location like Slack or GDocs. This is why we built XetHub, a platform that enables teams to treat data like code, using Git. Unlike Git LFS, XetHub doesn’t just store the files. It uses content-defined chunking and Merkle Trees to dedupe against everything in history, allowing small changes in large files to be stored compactly. Here’s how it works: https://xethub.com/assets/docs/how-xet-deduplication-works XetHub includes a GitHub-like web interface that provides automatic CSV summaries and allows custom visualizations using Vega. And we know how painful downloading a huge repository can get, so we built Git-Xet mount—which, in seconds, provides a user-mode filesystem view over the repo. Today, XetHub works for 1 TB repositories, and we plan to scale to 100 TB in the next year. Our implementation is in Rust (client & cache + storage) and our web application is written in Go. XetHub is available today for Linux & Mac (Windows coming soon) and we’d love for you to try it out! More info here: https://xetdata.com/blog/2022/12/13/introducing-xethub https://xetdata.com/blog/2022/10/15/why-xetdata Hacker News discussion (launched on Show HN at #1): https://news.ycombinator.com/item?id=33969908 https://preview.redd.it/t9tf3kt5i96a1.png?width=1740&format=png&auto=webp&s=184dd57d9f3d4e1dea94f8ab02211f663e214e84 submitted by /u/rajatarya [link] [comments]  ( 69 min )
    [P] PyCM 3.7 released: ROC curve and Precision-Recall curve are added
    Hi ML practitioners, We wanted to bring to your attention another release of PyCM (Multi-class confusion matrix library in Python). In this version, ROCCurve class and PRCurve class are added to calculate and plot ROC Curve and Precision-Recall curve respectively. From now on, PyCM is able to calculate the area under curve of ROC and Precision-Recall curve for different threshold values using these new methods. ROC curve >>> crv = ROCCurve(actual_vector = np.array([1, 1, 2, 2]), probs = np.array([[0.1, 0.9], [0.4, 0.6], [0.35, 0.65], [0.8, 0.2]]), classes=[2, 1]) >>> crv.thresholds [0.1, 0.2, 0.35, 0.4, 0.6, 0.65, 0.8, 0.9] >>> auc_trp = crv.area() >>> auc_trp[1] 0.75 >>> auc_trp[2] 0.75 Precision-Recall curve >>> crv = PRCurve(actual_vector = np.array([1, 1, 2, 2]), probs = np.array([[0.1, 0.9], [0.4, 0.6], [0.35, 0.65], [0.8, 0.2]]), classes=[2, 1]) >>> crv.thresholds [0.1, 0.2, 0.35, 0.4, 0.6, 0.65, 0.8, 0.9] >>> auc_trp = crv.area() >>> auc_trp[1] 0.29166666666666663 >>> auc_trp[2] 0.29166666666666663 The complete change log of this version is available here. ​ Website: www.pycm.io Repo: https://github.com/sepandhaghighi/pycm ​ Hope you find it useful! submitted by /u/alirezazolanvari [link] [comments]  ( 63 min )
    [R] Ideas to combine multiple time series datasets into a single trainable dataset?
    I am working on a research project, I am looking for some ideas from fellows. I have track sensor data with several variables on millisecond level. Now each 'dataset' is a session on the race track and the variables are temporally and spatially correlated. Now I have data from multiple session, and I am training a classfication models for these time series. How can I combine them into a single dataset or is it even appropriate to do it this way or rather train multiple models? I do have to say, the diffeent sessions have high variance with one another. submitted by /u/GingeryGnetum [link] [comments]  ( 64 min )
    [R] Image-and-Language Understanding from Pixels Only
    submitted by /u/recidivistic_shitped [link] [comments]  ( 61 min )
    [R] Are there open research problems in random forests?
    I'm intrigued by random forests but it looks like there's really no open problems in this area. A quick skim on Google Scholar shows, mostly, applications of random forests in various industries/problems. Are there research groups working on random forests? submitted by /u/SpookyTardigrade [link] [comments]  ( 66 min )
  • Open

    Next generation Amazon SageMaker Experiments – Organize, track, and compare your machine learning trainings at scale
    Today, we’re happy to announce updates to our Amazon SageMaker Experiments capability of Amazon SageMaker that lets you organize, track, compare and evaluate machine learning (ML) experiments and model versions from any integrated development environment (IDE) using the SageMaker Python SDK or boto3, including local Jupyter Notebooks. Machine learning (ML) is an iterative process. When solving […]  ( 11 min )
    Introducing Fortuna: A library for uncertainty quantification
    Proper estimation of predictive uncertainty is fundamental in applications that involve critical decisions. Uncertainty can be used to assess the reliability of model predictions, trigger human intervention, or decide whether a model can be safely deployed in the wild. We introduce Fortuna, an open-source library for uncertainty quantification. Fortuna provides calibration methods, such as conformal […]  ( 7 min )
    Best practices for Amazon SageMaker Training Managed Warm Pools
    Amazon SageMaker Training Managed Warm Pools gives you the flexibility to opt in to reuse and hold on to the underlying infrastructure for a user-defined period of time. This is done while also maintaining the benefit of passing the undifferentiated heavy lifting of managing compute instances in to Amazon SageMaker Model Training. In this post, […]  ( 10 min )
    How to evaluate the quality of the synthetic data – measuring from the perspective of fidelity, utility, and privacy
    In an increasingly data-centric world, enterprises must focus on gathering both valuable physical information and generating the information that they need but can’t easily capture. Data access, regulation, and compliance are an increasing source of friction for innovation in analytics and artificial intelligence (AI). For highly regulated sectors such as Financial Services, Healthcare, Life Sciences, […]  ( 11 min )
    Augment fraud transactions using synthetic data in Amazon SageMaker
    Developing and training successful machine learning (ML) fraud models requires access to large amounts of high-quality data. Sourcing this data is challenging because available datasets are sometimes not large enough or sufficiently unbiased to usefully train the ML model and may require significant cost and time. Regulation and privacy requirements further prevent data use or […]  ( 8 min )
  • Open

    Nicklas Hansen, UCSD: On long-horizon planning and why algorithms don't drive research progress
    Listen to the podcast episode with Nicklas Hansen from UC San Diego where we discuss adapting reinforcement learning policies during deployment, why algorithms don't drive research progress, and much more! submitted by /u/thejashGI [link] [comments]  ( 54 min )
    Postdoc Opportunity in the space of RL
    Dear all, My research lab is looking to hire a full-time postdoctoral researcher to perform research on projects centered around applications of Reinforcement Learning in collaborative energy systems. Interested persons will be associated with the Design Informatics Laboratory at Stevens Institute of Technology, a tech focused university in Hoboken, NJ (right across the river from Manhattan). You will be responsible for researching, designing, and developing novel methods and algorithms for applications in the energy domain. Salary will be in the $65-80K range (well above the area PostDoc average of 57K). If you are interested, please feel free to reach out to me to discuss further. Unfortunately, due to a tight timeline, preference will be given to applicants who can start in Jan 2023. Thank you for your time. Regards, Phil submitted by /u/Design_Informatics [link] [comments]  ( 54 min )
    Controlling inner loop of MAML
    Hey everyone, I am wondering how to control the inner loop gradient update in MAML. For my further explanations, I refer to the ray rllib nomenclature. In PPO, we have the parameters train_batch_size, sgd_minibatch_size, and num_sgd_iter to control the batchsize and number of SGD iterations for the training process. In MAML however, we have the parameters inner_adaptation_steps and train_batch_size. How do they control the inner loop gradient update? My understanding is, that if inner_adaptation_steps = 1 we collect all the samples of 1 episode and then we perform a gradient update. But how is the update prcoess working exactly? Do we perform several iterations of updating with minibatches like in PPO? Or is it just 1 gradient update with all samples of 1 episode? And what is the role of train_batch_size then? Can someone help, please? Thanks in advance! Best, Elektrochan submitted by /u/ElektroChan [link] [comments]  ( 58 min )
    ACE, a multi-agent #reinforcementlearning algorithm proposed by OpenDILab, has achieved state-of-the-art results in SMAC and GRF, and the related paper has been received in #AAAI2023.
    submitted by /u/OpenDILab [link] [comments]  ( 52 min )
    stable baseline, how can we sample reproducible env value with each episode
    Hello, Iam trying to use DDPG (stable baseline3) to solve a problem. I would like to know, how can we change the env sampled values with every episode "and it should be reproducible", using stable baseline. for example, assume we have an env where we harvest energy, we assume that the harvested energy is normally distributed for example, and then in every episode, I will sample DIFFERENT Values of my harvested energy.I would just like to emphasize again, that I would like that the different values of my harvested energy should be reproducible, so I can compare the RL method to other methods. PS: using stable baseline. submitted by /u/EnvironmentCrazy6381 [link] [comments]  ( 55 min )
    Formal definition of a combination between POMDP & SMDP
    Hey, I am currently writing my thesis and for that I am utilizing HRL. Since the definition of a SMDP and POMDP are clear to me, I am still struggling to define both together. So on the one hand, the agent is running under a SMDP, because the sub-behaviors it can execute are at different time scale and on the other hand, the state-space is only partially observable because it needs to reveal the environment in terms of a map. ​ Any ideas or hints ? Thaaaaaaaanks submitted by /u/Pitiful_Cloud437 [link] [comments]  ( 64 min )
    Any example or tutorial for hyperparameter tuning using Optuna for PPO stable baselines3
    submitted by /u/last_2_brain_cells97 [link] [comments]  ( 55 min )
    Does it make sense to use a confusion matrix in DRL? Or something close?
    Does it make sense to apply confusion-matrix tools like PyCM to retrieve some useful information from a DRL algorithm, like during the learning process or for algorithm comparison, or anything else? Since I only use RL/DRL, which is an optimization task, I'm not very familiar with (un)supervised learning and these kinds of tools. (if yes) any use case? or (if not) any obvious reason which makes it worthless? I'll appreciate any answer, of course, so don't hesitate to give some insights or impressions about the topic. ​ Note: I'm not an English native, so I apologize if something is hard to understand (ask me if so and I'll try to improve the explanation). Also I'm a newbie posting questions, so I appreciate help in how to ask/structure better or adding helpful tags to the question. ​ Thank you a lot everyone submitted by /u/DavisEX33 [link] [comments]  ( 56 min )
    Is there any theoretical reasoning behind 4 frames stacking?
    Most Atari environments implementations make 4 frame stacking, but isn't it too much? I mean, if you want to determine velocity, you just need 2 samples; if you want to determine acceleration, you just need 3 samples. Why use 4? Isn't it less efficient? Also wouldn't it be cleaner to take the difference between each consecutive frame? That way, most of the image will be black (zeros) and only the thing that is moving will have some non-zero values. Wouldn't it be more efficient? submitted by /u/victorsevero [link] [comments]  ( 54 min )
  • Open

    Exploring the Benefits of Transfer Learning While Building Brain Tumor Classifiers
    A comparative analysis of DL techniques ⚖️  ( 21 min )
    Why large language models like ChatGPT are bullshit artists
    And how to use them effectively anyway  ( 18 min )
    Why The Mind Wires The Brain For Creativity Or Conformity
    Our reality is a framework in which the brain exists to harbor the mind and the biological body exists to move the brain. Essentially, the…  ( 22 min )
    How to Write a Rap Song using AI in 5 Minutes
    A step-by-step guide to quickly writing any song using AI without any code Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 9 min )
    AI Analytics: Better Insights with Smart Algorithms
    A recent survey by McKinsey revealed that the respondents whom the consultancy classified as “AI high performers” attribute at least 20% of…  ( 20 min )
    How IoT is Changing the Fraud Landscape?
    There’s no denying that the Internet of Things (IoT) is here to stay. It has changed how we live, work and play, making our lives more…  ( 13 min )
    Insightful Interpretation of Machine Learning Datasets
    It is possible to simulate human intelligence in machines with artificial intelligence (AI) and machine learning (ML). These simulations…  ( 13 min )
  • Open

    Adversarial Discriminative Domain Adaptation (ADDA) Paper Explained
    Hello everyone, I wanted to share a new video I just released on YouTube here called "Adversarial Discriminative Domain Adaptation (ADDA) Paper Explained." It's a deep dive into the concepts and techniques of ADDA, which is a powerful method for adapting machine learning models to new domains. If you're interested in machine learning and domain adaptation, I think you'll really enjoy it. Thanks for considering giving it a watch, and I hope you find it helpful! As always, feedback is extremely welcomed! submitted by /u/Personal-Trainer-541 [link] [comments]  ( 57 min )
    Is a Neural Network Appropriate For This Situation?
    I am having some trouble thinking about the architecture of a NN for this problem I am trying to solve. The actual problem is quite complex, so for simplicity's sake, I'll use the example of a 100 v 100 game of dodgeball. At any point in the game, I would take a sample of the 10 players in the game with the most enemies eliminated. Note: This is not guaranteed to be 5 players per team, it may be uneven. For each of these 5 players, I would take in the following inputs (# of eliminations, their reaction time, and how quickly they can run a 40 yard dash). I was planning on each of the 10 players having their own separate NN that takes in these 3 inputs and gives an output between 0 to 1 where being closer to 1 means the player is better. My reasoning for this is that each player's attributes are independent: player 1's reaction time isn't related to player 2's reaction time. It logically also seems to make the problem simpler: the game is affected by how good the players from each team are. The players' skills are gauged by these 3 attributes. But the problem I am having is, now that I have gotten to this point and have rated the 10 players 0-1, how would I design a NN that takes in these 10 inputs and gives 1 output which is the probability that Team A will win the game? It doesn't really make sense to make a fully connected neural network because I am under the assumption that each player is independent and does not affect the other 9 players. On top of this, even if the players did have some relation to each other, the order of the players would be constantly switching. For example, the top 4 in one sample might be from teams ABAB, but in another sample, it might be BAAB. Should I even be using a NN for this problem? If I am not using a NN what should I be using? submitted by /u/Ceraphen [link] [comments]  ( 52 min )
    Riffusion – Stable Diffusion fine-tuned to generate Music
    submitted by /u/nickb [link] [comments]  ( 52 min )
  • Open

    Taking Control of Your Online Presence with Data for All
    Data for All by John K. Thompson covers data in the most holistic sense. For someone that is not involved in data science, defining the term data can be difficult. If data science is the study of data, what exactly is being studied? Thompson starts at the beginning when defining a term that has increased… Read More »Taking Control of Your Online Presence with Data for All The post Taking Control of Your Online Presence with Data for All appeared first on Data Science Central.  ( 19 min )
  • Open

    Accelerating Text Generation with Confident Adaptive Language Modeling (CALM)
    Posted by Tal Schuster, Research Scientist, Google Research Language models (LMs) are the driving force behind many recent breakthroughs in natural language processing. Models like T5, LaMDA, GPT-3, and PaLM have demonstrated impressive performance on various language tasks. While multiple factors can contribute to improving the performance of LMs, some recent studies suggest that scaling up the model’s size is crucial for revealing emergent capabilities. In other words, some instances can be solved by small models, while others seem to benefit from increased scale. Despite recent efforts that enabled the efficient training of LMs over large amounts of data, trained models can still be slow and costly for practical use. When generating text at inference time, most autoregressive LMs …  ( 93 min )
  • Open

    Subtle biases in AI can influence emergency decisions
    But the harm from a discriminatory AI system can be minimized if the advice it delivers is properly framed, an MIT team has shown.  ( 9 min )
  • Open

    AI’s Highlight Reel: Top 5 NVIDIA Videos of 2022
    If AI had a highlight reel, the NVIDIA YouTube channel might just be it. The channel showcases the latest breakthroughs in artificial intelligence, with demos, keynotes and other videos that help viewers see and believe the astonishing ways in which the technology is changing the world. NVIDIA’s most popular videos of 2022 put spotlights on Read article > The post AI’s Highlight Reel: Top 5 NVIDIA Videos of 2022 appeared first on NVIDIA Blog.  ( 4 min )
    Accelerated Computing, AI and Digital Twins: A Recipe for US Manufacturing Leadership
    A national initiative in semiconductors provides a once-in-a-generation opportunity to energize manufacturing in the U.S. The CHIPS and Science Act includes an $13 billion R&D investment in the chip industry. Done right, it’s a recipe for bringing advanced manufacturing techniques to every industry and cultivating a highly skilled workforce. The semiconductor industry uses the most Read article > The post Accelerated Computing, AI and Digital Twins: A Recipe for US Manufacturing Leadership appeared first on NVIDIA Blog.  ( 6 min )
    Safe Travels: NVIDIA DRIVE OS Receives Premier Safety Certification
    To make transportation safer, autonomous vehicles (AVs) must have processes and underlying systems that meet the highest standards. NVIDIA DRIVE OS is the operating system for in-vehicle accelerated computing powered by the NVIDIA DRIVE platform. DRIVE OS 5.2 is now functional safety-certified by TÜV SÜD, one of the most experienced and rigorous assessment bodies in Read article > The post Safe Travels: NVIDIA DRIVE OS Receives Premier Safety Certification appeared first on NVIDIA Blog.  ( 5 min )
  • Open

    Surprisingly not that surprising
    World record marathon times have been falling in increments of roughly 30 seconds, each new record shaving roughly 30 seconds off the previous record. If someone were to set a new record, taking 20 seconds off the previous record, this would be exciting, but not suspicious. If someone were to take 5 minutes off the […] Surprisingly not that surprising first appeared on John D. Cook.  ( 6 min )
  • Open

    Hyper-Representations: Self-Supervised Representation Learning on Neural Network Weights for Model Characteristic Prediction. (arXiv:2110.15288v5 [cs.LG] UPDATED)
    Self-Supervised Learning (SSL) has been shown to learn useful and information-preserving representations. Neural Networks (NNs) are widely applied, yet their weight space is still not fully understood. Therefore, we propose to use SSL to learn hyper-representations of the weights of populations of NNs. To that end, we introduce domain specific data augmentations and an adapted attention architecture. Our empirical evaluation demonstrates that self-supervised representation learning in this domain is able to recover diverse NN model characteristics. Further, we show that the proposed learned representations outperform prior work for predicting hyper-parameters, test accuracy, and generalization gap as well as transfer to out-of-distribution settings.  ( 2 min )
    Domain Generalization by Learning and Removing Domain-specific Features. (arXiv:2212.07101v1 [cs.CV])
    Deep Neural Networks (DNNs) suffer from domain shift when the test dataset follows a distribution different from the training dataset. Domain generalization aims to tackle this issue by learning a model that can generalize to unseen domains. In this paper, we propose a new approach that aims to explicitly remove domain-specific features for domain generalization. Following this approach, we propose a novel framework called Learning and Removing Domain-specific features for Generalization (LRDG) that learns a domain-invariant model by tactically removing domain-specific features from the input images. Specifically, we design a classifier to effectively learn the domain-specific features for each source domain, respectively. We then develop an encoder-decoder network to map each input image into a new image space where the learned domain-specific features are removed. With the images output by the encoder-decoder network, another classifier is designed to learn the domain-invariant features to conduct image classification. Extensive experiments demonstrate that our framework achieves superior performance compared with state-of-the-art methods.  ( 2 min )
    Amortized Inference for Causal Structure Learning. (arXiv:2205.12934v3 [cs.LG] UPDATED)
    Inferring causal structure poses a combinatorial search problem that typically involves evaluating structures with a score or independence test. The resulting search is costly, and designing suitable scores or tests that capture prior knowledge is difficult. In this work, we propose to amortize causal structure learning. Rather than searching over structures, we train a variational inference model to directly predict the causal structure from observational or interventional data. This allows our inference model to acquire domain-specific inductive biases for causal discovery solely from data generated by a simulator, bypassing both the hand-engineering of suitable score functions and the search over graphs. The architecture of our inference model emulates permutation invariances that are crucial for statistical efficiency in structure learning, which facilitates generalization to significantly larger problem instances than seen during training. On synthetic data and semisynthetic gene expression data, our models exhibit robust generalization capabilities when subject to substantial distribution shifts and significantly outperform existing algorithms, especially in the challenging genomics domain. Our code and models are publicly available at: https://github.com/larslorch/avici.  ( 2 min )
    Deep Image Style Transfer from Freeform Text. (arXiv:2212.06868v1 [cs.CV])
    This paper creates a novel method of deep neural style transfer by generating style images from freeform user text input. The language model and style transfer model form a seamless pipeline that can create output images with similar losses and improved quality when compared to baseline style transfer methods. The language model returns a closely matching image given a style text and description input, which is then passed to the style transfer model with an input content image to create a final output. A proof-of-concept tool is also developed to integrate the models and demonstrate the effectiveness of deep image style transfer from freeform text.  ( 2 min )
    A review of Generative Adversarial Networks for Electronic Health Records: applications, evaluation measures and data sources. (arXiv:2203.07018v2 [cs.LG] UPDATED)
    Electronic Health Records (EHRs) are a valuable asset to facilitate clinical research and point of care applications; however, many challenges such as data privacy concerns impede its optimal utilization. Deep generative models, particularly, Generative Adversarial Networks (GANs) show great promise in generating synthetic EHR data by learning underlying data distributions while achieving excellent performance and addressing these challenges. This work aims to review the major developments in various applications of GANs for EHRs and provides an overview of the proposed methodologies. For this purpose, we combine perspectives from healthcare applications and machine learning techniques in terms of source datasets and the fidelity and privacy evaluation of the generated synthetic datasets. We also compile a list of the metrics and datasets used by the reviewed works, which can be utilized as benchmarks for future research in the field. We conclude by discussing challenges in GANs for EHRs development and proposing recommended practices. We hope that this work motivates novel research development directions in the intersection of healthcare and machine learning.  ( 2 min )
    Time-aware Random Walk Diffusion to Improve Dynamic Graph Learning. (arXiv:2211.01214v4 [cs.LG] UPDATED)
    How can we augment a dynamic graph for improving the performance of dynamic graph neural networks? Graph augmentation has been widely utilized to boost the learning performance of GNN-based models. However, most existing approaches only enhance spatial structure within an input static graph by transforming the graph, and do not consider dynamics caused by time such as temporal locality, i.e., recent edges are more influential than earlier ones, which remains challenging for dynamic graph augmentation. In this work, we propose TiaRa (Time-aware Random Walk Diffusion), a novel diffusion-based method for augmenting a dynamic graph represented as a discrete-time sequence of graph snapshots. For this purpose, we first design a time-aware random walk proximity so that a surfer can walk along the time dimension as well as edges, resulting in spatially and temporally localized scores. We then derive our diffusion matrices based on the time-aware random walk, and show they become enhanced adjacency matrices that both spatial and temporal localities are augmented. Throughout extensive experiments, we demonstrate that TiaRa effectively augments a given dynamic graph, and leads to significant improvements in dynamic GNN models for various graph datasets and tasks.  ( 2 min )
    3rd Continual Learning Workshop Challenge on Egocentric Category and Instance Level Object Understanding. (arXiv:2212.06833v1 [cs.CV])
    Continual Learning, also known as Lifelong or Incremental Learning, has recently gained renewed interest among the Artificial Intelligence research community. Recent research efforts have quickly led to the design of novel algorithms able to reduce the impact of the catastrophic forgetting phenomenon in deep neural networks. Due to this surge of interest in the field, many competitions have been held in recent years, as they are an excellent opportunity to stimulate research in promising directions. This paper summarizes the ideas, design choices, rules, and results of the challenge held at the 3rd Continual Learning in Computer Vision (CLVision) Workshop at CVPR 2022. The focus of this competition is the complex continual object detection task, which is still underexplored in literature compared to classification tasks. The challenge is based on the challenge version of the novel EgoObjects dataset, a large-scale egocentric object dataset explicitly designed to benchmark continual learning algorithms for egocentric category-/instance-level object understanding, which covers more than 1k unique main objects and 250+ categories in around 100k video frames.
    MCP: Self-supervised Pre-training for Personalized Chatbots with Multi-level Contrastive Sampling. (arXiv:2210.08753v4 [cs.CL] UPDATED)
    Personalized chatbots focus on endowing the chatbots with a consistent personality to behave like real users and further act as personal assistants. Previous studies have explored generating implicit user profiles from the user's dialogue history for building personalized chatbots. However, these studies only use the response generation loss to train the entire model, thus it is prone to suffer from the problem of data sparsity. Besides, they overemphasize the final generated response's quality while ignoring the correlations and fusions between the user's dialogue history, leading to rough data representations and performance degradation. To tackle these problems, we propose a self-supervised learning framework MCP for capturing better representations from users' dialogue history for personalized chatbots. Specifically, we apply contrastive sampling methods to leverage the supervised signals hidden in user dialog history, and generate the pre-training samples for enhancing the model. We design three pre-training tasks based on three types of contrastive pairs from user dialogue history, namely response pairs, sequence augmentation pairs, and user pairs. We pre-train the utterance encoder and the history encoder towards the contrastive objectives and use these pre-trained encoders for generating user profiles while personalized response generation. Experimental results on two real-world datasets show a significant improvement in our proposed model MCP compared with the existing methods.
    Segmentation-guided Domain Adaptation for Efficient Depth Completion. (arXiv:2210.09213v2 [cs.CV] UPDATED)
    Complete depth information and efficient estimators have become vital ingredients in scene understanding for automated driving tasks. A major problem for LiDAR-based depth completion is the inefficient utilization of convolutions due to the lack of coherent information as provided by the sparse nature of uncorrelated LiDAR point clouds, which often leads to complex and resource-demanding networks. The problem is reinforced by the expensive aquisition of depth data for supervised training. In this work, we propose an efficient depth completion model based on a vgg05-like CNN architecture and propose a semi-supervised domain adaptation approach to transfer knowledge from synthetic to real world data to improve data-efficiency and reduce the need for a large database. In order to boost spatial coherence, we guide the learning process using segmentations as additional source of information. The efficiency and accuracy of our approach is evaluated on the KITTI dataset. Our approach improves on previous efficient and low parameter state of the art approaches while having a noticeably lower computational footprint.
    Quantum Clustering with k-Means: a Hybrid Approach. (arXiv:2212.06691v1 [quant-ph] CROSS LISTED)
    Quantum computing is a promising paradigm based on quantum theory for performing fast computations. Quantum algorithms are expected to surpass their classical counterparts in terms of computational complexity for certain tasks, including machine learning. In this paper, we design, implement, and evaluate three hybrid quantum k-Means algorithms, exploiting different degree of parallelism. Indeed, each algorithm incrementally leverages quantum parallelism to reduce the complexity of the cluster assignment step up to a constant cost. In particular, we exploit quantum phenomena to speed up the computation of distances. The core idea is that the computation of distances between records and centroids can be executed simultaneously, thus saving time, especially for big datasets. We show that our hybrid quantum k-Means algorithms can be more efficient than the classical version, still obtaining comparable clustering results.
    Lower Bounds for the Convergence of Tensor Power Iteration on Random Overcomplete Models. (arXiv:2211.03827v2 [cs.LG] UPDATED)
    Tensor decomposition serves as a powerful primitive in statistics and machine learning. In this paper, we focus on using power iteration to decompose an overcomplete random tensor. Past work studying the properties of tensor power iteration either requires a non-trivial data-independent initialization, or is restricted to the undercomplete regime. Moreover, several papers implicitly suggest that logarithmically many iterations (in terms of the input dimension) are sufficient for the power method to recover one of the tensor components. In this paper, we analyze the dynamics of tensor power iteration from random initialization in the overcomplete regime. Surprisingly, we show that polynomially many steps are necessary for convergence of tensor power iteration to any of the true component, which refutes the previous conjecture. On the other hand, our numerical experiments suggest that tensor power iteration successfully recovers tensor components for a broad range of parameters, despite that it takes at least polynomially many steps to converge. To further complement our empirical evidence, we prove that a popular objective function for tensor decomposition is strictly increasing along the power iteration path. Our proof is based on the Gaussian conditioning technique, which has been applied to analyze the approximate message passing (AMP) algorithm. The major ingredient of our argument is a conditioning lemma that allows us to generalize AMP-type analysis to non-proportional limit and polynomially many iterations of the power method.
    Conservative SPDEs as fluctuating mean field limits of stochastic gradient descent. (arXiv:2207.05705v2 [math.PR] UPDATED)
    The convergence of stochastic interacting particle systems in the mean-field limit to solutions of conservative stochastic partial differential equations is established, with optimal rate of convergence. As a second main result, a quantitative central limit theorem for such SPDEs is derived, again, with optimal rate of convergence. The results apply, in particular, to the convergence in the mean-field scaling of stochastic gradient descent dynamics in overparametrized, shallow neural networks to solutions of SPDEs. It is shown that the inclusion of fluctuations in the limiting SPDE improves the rate of convergence, and retains information about the fluctuations of stochastic gradient descent in the continuum limit.
    FeDXL: Provable Federated Learning for Deep X-Risk Optimization. (arXiv:2210.14396v2 [cs.LG] UPDATED)
    In this paper, we tackle a novel federated learning (FL) problem for optimizing a family of X-risks, to which no existing FL algorithms are applicable. In particular, the objective has the form of $\mathbb E_{z\sim S_1} f(\mathbb E_{z'\sim S_2} \ell(w; z, z'))$, where two sets of data $S_1, S_2$ are distributed over multiple machines, $\ell(\cdot)$ is a pairwise loss that only depends on the prediction outputs of the input data pairs $(z, z')$, and $f(\cdot)$ is possibly a non-linear non-convex function. This problem has important applications in machine learning, e.g., AUROC maximization with a pairwise loss, and partial AUROC maximization with a compositional loss. The challenges for designing an FL algorithm lie in the non-decomposability of the objective over multiple machines and the interdependency between different machines. To address the challenges, we propose an active-passive decomposition framework that decouples the gradient's components with two types, namely active parts and passive parts, where the active parts depend on local data that are computed with the local model and the passive parts depend on other machines that are communicated/computed based on historical models and samples. Under this framework, we develop two provable FL algorithms (FeDXL) for handling linear and nonlinear $f$, respectively, based on federated averaging and merging. We develop a novel theoretical analysis to combat the latency of the passive parts and the interdependency between the local model parameters and the involved data for computing local gradient estimators. We establish both iteration and communication complexities and show that using the historical samples and models for computing the passive parts do not degrade the complexities. We conduct empirical studies of FeDXL for deep AUROC and partial AUROC maximization, and demonstrate their performance compared with several baselines.
    Do Not Sleep on Traditional Machine Learning: Simple and Interpretable Techniques Are Competitive to Deep Learning for Sleep Scoring. (arXiv:2207.07753v3 [stat.ML] UPDATED)
    Over the last few years, research in automatic sleep scoring has mainly focused on developing increasingly complex deep learning architectures. However, recently these approaches achieved only marginal improvements, often at the expense of requiring more data and more expensive training procedures. Despite all these efforts and their satisfactory performance, automatic sleep staging solutions are not widely adopted in a clinical context yet. We argue that most deep learning solutions for sleep scoring are limited in their real-world applicability as they are hard to train, deploy, and reproduce. Moreover, these solutions lack interpretability and transparency, which are often key to increase adoption rates. In this work, we revisit the problem of sleep stage classification using classical machine learning. Results show that competitive performance can be achieved with a conventional machine learning pipeline consisting of preprocessing, feature extraction, and a simple machine learning model. In particular, we analyze the performance of a linear model and a non-linear (gradient boosting) model. Our approach surpasses state-of-the-art (that uses the same data) on two public datasets: Sleep-EDF SC-20 (MF1 0.810) and Sleep-EDF ST (MF1 0.795), while achieving competitive results on Sleep-EDF SC-78 (MF1 0.775) and MASS SS3 (MF1 0.817). We show that, for the sleep stage scoring task, the expressiveness of an engineered feature vector is on par with the internally learned representations of deep learning models. This observation opens the door to clinical adoption, as a representative feature vector allows to leverage both the interpretability and successful track record of traditional machine learning models.
    SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video Anomaly Detection. (arXiv:2207.08003v3 [cs.CV] UPDATED)
    A self-supervised multi-task learning (SSMTL) framework for video anomaly detection was recently introduced in literature. Due to its highly accurate results, the method attracted the attention of many researchers. In this work, we revisit the self-supervised multi-task learning framework, proposing several updates to the original method. First, we study various detection methods, e.g. based on detecting high-motion regions using optical flow or background subtraction, since we believe the currently used pre-trained YOLOv3 is suboptimal, e.g. objects in motion or objects from unknown classes are never detected. Second, we modernize the 3D convolutional backbone by introducing multi-head self-attention modules, inspired by the recent success of vision transformers. As such, we alternatively introduce both 2D and 3D convolutional vision transformer (CvT) blocks. Third, in our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps through knowledge distillation, solving jigsaw puzzles, estimating body pose through knowledge distillation, predicting masked regions (inpainting), and adversarial learning with pseudo-anomalies. We conduct experiments to assess the performance impact of the introduced changes. Upon finding more promising configurations of the framework, dubbed SSMTL++v1 and SSMTL++v2, we extend our preliminary experiments to more data sets, demonstrating that our performance gains are consistent across all data sets. In most cases, our results on Avenue, ShanghaiTech and UBnormal raise the state-of-the-art performance bar to a new level.
    Accounting for Temporal Variability in Functional Magnetic Resonance Imaging Improves Prediction of Intelligence. (arXiv:2211.07429v2 [q-bio.NC] UPDATED)
    Neuroimaging-based prediction methods for intelligence and cognitive abilities have seen a rapid development in literature. Among different neuroimaging modalities, prediction based on functional connectivity (FC) has shown great promise. Most literature has focused on prediction using static FC, but there are limited investigations on the merits of such analysis compared to prediction based on dynamic FC or region level functional magnetic resonance imaging (fMRI) times series that encode temporal variability. To account for the temporal dynamics in fMRI data, we propose a deep neural network involving bi-directional long short-term memory (bi-LSTM) approach that also incorporates feature selection mechanism. The proposed pipeline is implemented via an efficient GPU computation framework and applied to predict intelligence scores based on region level fMRI time series as well as dynamic FC. We compare the prediction performance for different intelligence measures based on static FC, dynamic FC, and region level time series acquired from the Adolescent Brain Cognitive Development (ABCD) study involving close to 7000 individuals. Our detailed analysis illustrates that static FC consistently has inferior prediction performance compared to region level time series or dynamic FC for unimodal rest and task fMRI experiments, and in almost all cases using a combination of task and rest features. In addition, the proposed bi-LSTM pipeline based on region level time series identifies several shared and differential important brain regions across task and rest fMRI experiments that drive intelligence prediction. A test-retest analysis of the selected features shows strong reliability across cross-validation folds. Given the large sample size from ABCD study, our results provide strong evidence that superior prediction of intelligence can be achieved by accounting for temporal variations in fMRI.
    Reliable amortized variational inference with physics-based latent distribution correction. (arXiv:2207.11640v2 [stat.ML] UPDATED)
    Bayesian inference for high-dimensional inverse problems is computationally costly and requires selecting a suitable prior distribution. Amortized variational inference addresses these challenges via a neural network that acts as a surrogate conditional distribution, matching the posterior distribution not only for one instance of data, but a distribution of data pertaining to a specific inverse problem. During inference, the neural network -- in our case a conditional normalizing flow -- provides posterior samples with virtually no cost. However, the accuracy of Amortized variational inference relies on the availability of high-fidelity training data, which seldom exists in geophysical inverse problems due to the Earth's heterogeneity. In addition, the network is prone to errors if evaluated over out-of-distribution data. As such, we propose to increases the resilience of amortized variational inference in presence of moderate data distribution shifts. We achieve this via a correction to the latent distribution that improves the posterior distribution approximation for the data at hand. The correction involves relaxing the standard Gaussian assumption on the latent distribution and parameterizing it via a Gaussian distribution with an unknown mean and (diagonal) covariance. These unknowns are then estimated by minimizing the Kullback-Leibler divergence between the corrected and (physics-based) true posterior distributions. While generic and applicable to other inverse problems, by means of a linearized seismic imaging example, we show that our correction step improves the robustness of amortized variational inference with respect to changes in number of seismic sources, noise variance, and shifts in the prior distribution. This approach provides a seismic image with limited artifacts and an assessment of its uncertainty with approximately the same cost as five reverse-time migrations.
    Large-Scale Chemical Language Representations Capture Molecular Structure and Properties. (arXiv:2106.09553v3 [cs.LG] UPDATED)
    Models based on machine learning can enable accurate and fast molecular property predictions, which is of interest in drug discovery and material design. Various supervised machine learning models have demonstrated promising performance, but the vast chemical space and the limited availability of property labels make supervised learning challenging. Recently, unsupervised transformer-based language models pretrained on a large unlabelled corpus have produced state-of-the-art results in many downstream natural language processing tasks. Inspired by this development, we present molecular embeddings obtained by training an efficient transformer encoder model, MoLFormer, which uses rotary positional embeddings. This model employs a linear attention mechanism, coupled with highly distributed training, on SMILES sequences of 1.1 billion unlabelled molecules from the PubChem and ZINC datasets. We show that the learned molecular representation outperforms existing baselines, including supervised and self-supervised graph neural networks and language models, on several downstream tasks from ten benchmark datasets. They perform competitively on two others. Further analyses, specifically through the lens of attention, demonstrate that MoLFormer trained on chemical SMILES indeed learns the spatial relationships between atoms within a molecule. These results provide encouraging evidence that large-scale molecular language models can capture sufficient chemical and structural information to predict various distinct molecular properties, including quantum-chemical properties.
    ARCADE: Adversarially Regularized Convolutional Autoencoder for Network Anomaly Detection. (arXiv:2205.01432v3 [cs.LG] UPDATED)
    As the number of heterogenous IP-connected devices and traffic volume increase, so does the potential for security breaches. The undetected exploitation of these breaches can bring severe cybersecurity and privacy risks. Anomaly-based \acp{IDS} play an essential role in network security. In this paper, we present a practical unsupervised anomaly-based deep learning detection system called ARCADE (Adversarially Regularized Convolutional Autoencoder for unsupervised network anomaly DEtection). With a convolutional \ac{AE}, ARCADE automatically builds a profile of the normal traffic using a subset of raw bytes of a few initial packets of network flows so that potential network anomalies and intrusions can be efficiently detected before they cause more damage to the network. ARCADE is trained exclusively on normal traffic. An adversarial training strategy is proposed to regularize and decrease the \ac{AE}'s capabilities to reconstruct network flows that are out-of-the-normal distribution, thereby improving its anomaly detection capabilities. The proposed approach is more effective than state-of-the-art deep learning approaches for network anomaly detection. Even when examining only two initial packets of a network flow, ARCADE can effectively detect malware infection and network attacks. ARCADE presents 20 times fewer parameters than baselines, achieving significantly faster detection speed and reaction time.
    The alignment problem from a deep learning perspective. (arXiv:2209.00626v2 [cs.AI] UPDATED)
    Within the coming decades, artificial general intelligence (AGI) may surpass human capabilities at a wide range of important tasks. We outline a case for expecting that, without substantial effort to prevent it, AGIs could learn to pursue goals which are very undesirable (in other words, misaligned) from a human perspective. We argue that AGIs trained in similar ways as today's most capable models could learn to act deceptively to receive higher reward; learn internally-represented goals which generalize beyond their training distributions; and pursue those goals using power-seeking strategies. We outline how the deployment of misaligned AGIs might irreversibly undermine human control over the world, and briefly review research directions aimed at preventing these problems.
    Hypercomplex Neural Architectures for Multi-View Breast Cancer Classification. (arXiv:2204.05798v2 [cs.CV] UPDATED)
    Traditionally, deep learning methods for breast cancer classification perform a single-view analysis. However, radiologists simultaneously analyze all four views that compose a mammography exam, owing to the correlations contained in mammography views, which present crucial information for identifying tumors. In light of this, some studies have started to propose multi-view methods. Nevertheless, in such existing architectures, mammogram views are processed as independent images by separate convolutional branches, thus losing correlations among them. To overcome such limitations, in this paper we propose a novel approach for multi-view breast cancer classification based on parameterized hypercomplex neural networks. Thanks to hypercomplex algebra properties, our networks are able to model, and thus leverage, existing correlations between the different views that comprise a mammogram, thus mimicking the reading process performed by clinicians. The proposed methods are able to handle the information of a patient altogether without breaking the multi-view nature of the exam. We define architectures designed to process two-view exams, namely PHResNets, and four-view exams, i.e., PHYSEnet and PHYBOnet. Through an extensive experimental evaluation conducted with publicly available datasets, we demonstrate that our proposed models clearly outperform real-valued counterparts and also state-of-the-art methods, proving that breast cancer classification benefits from the proposed multi-view architectures. We also assess the method's robustness beyond mammogram analysis by considering different benchmarks, as well as a finer-scaled task such as segmentation. Full code and pretrained models for complete reproducibility of our experiments are freely available at: https://github.com/ispamm/PHBreast.
    Fast Bayesian Inference with Batch Bayesian Quadrature via Kernel Recombination. (arXiv:2206.04734v3 [cs.LG] UPDATED)
    Calculation of Bayesian posteriors and model evidences typically requires numerical integration. Bayesian quadrature (BQ), a surrogate-model-based approach to numerical integration, is capable of superb sample efficiency, but its lack of parallelisation has hindered its practical applications. In this work, we propose a parallelised (batch) BQ method, employing techniques from kernel quadrature, that possesses an empirically exponential convergence rate. Additionally, just as with Nested Sampling, our method permits simultaneous inference of both posteriors and model evidence. Samples from our BQ surrogate model are re-selected to give a sparse set of samples, via a kernel recombination algorithm, requiring negligible additional time to increase the batch size. Empirically, we find that our approach significantly outperforms the sampling efficiency of both state-of-the-art BQ techniques and Nested Sampling in various real-world datasets, including lithium-ion battery analytics.
    Anti-Spoofing Using Transfer Learning with Variational Information Bottleneck. (arXiv:2204.01387v2 [eess.AS] UPDATED)
    Recent advances in sophisticated synthetic speech generated from text-to-speech (TTS) or voice conversion (VC) systems cause threats to the existing automatic speaker verification (ASV) systems. Since such synthetic speech is generated from diverse algorithms, generalization ability with using limited training data is indispensable for a robust anti-spoofing system. In this work, we propose a transfer learning scheme based on the wav2vec 2.0 pretrained model with variational information bottleneck (VIB) for speech anti-spoofing task. Evaluation on the ASVspoof 2019 logical access (LA) database shows that our method improves the performance of distinguishing unseen spoofed and genuine speech, outperforming current state-of-the-art anti-spoofing systems. Furthermore, we show that the proposed system improves performance in low-resource and cross-dataset settings of anti-spoofing task significantly, demonstrating that our system is also robust in terms of data size and data distribution.
    Principal-Agent Hypothesis Testing. (arXiv:2205.06812v2 [cs.GT] UPDATED)
    Consider the relationship between a regulator (the principal) and a pharmaceutical company (the agent). The pharmaceutical company wishes to sell a product to make a profit, and the FDA wishes to ensure that only efficacious drugs are released to the public. The efficacy of the drug is not known to the FDA, so the pharmaceutical company must run a costly trial to prove efficacy to the FDA. Critically, the statistical protocol used to establish efficacy affects the behavior of a strategic, self-interested pharmaceutical company; a lower standard of statistical evidence incentivizes the pharmaceutical company to run more trials for drugs that are less likely to be effective, since the drug may pass the trial by chance, resulting in large profits. The interaction between the statistical protocol and the incentives of the pharmaceutical company is crucial to understanding this system and designing protocols with high social utility. In this work, we discuss how the principal and agent can enter into a contract with payoffs based on statistical evidence. When there is stronger evidence for the quality of the product, the principal allows the agent to make a larger profit. We show how to design contracts that are robust to an agent's strategic actions, and derive the optimal contract in the presence of strategic behavior.
    The Role of Lookahead and Approximate Policy Evaluation in Reinforcement Learning with Linear Value Function Approximation. (arXiv:2109.13419v7 [cs.LG] UPDATED)
    Function approximation is widely used in reinforcement learning to handle the computational difficulties associated with very large state spaces. However, function approximation introduces errors which may lead to instabilities when using approximate dynamic programming techniques to obtain the optimal policy. Therefore, techniques such as lookahead for policy improvement and m-step rollout for policy evaluation are used in practice to improve the performance of approximate dynamic programming with function approximation. We quantitatively characterize, for the first time, the impact of lookahead and m-step rollout on the performance of approximate dynamic programming (DP) with function approximation: (i) without a sufficient combination of lookahead and m-step rollout, approximate DP may not converge, (ii) both lookahead and m-step rollout improve the convergence rate of approximate DP, and (iii) lookahead helps mitigate the effect of function approximation and the discount factor on the asymptotic performance of the algorithm. Our results are presented for two approximate DP methods: one which uses least-squares regression to perform function approximation and another which performs several steps of gradient descent of the least-squares objective in each iteration.
    Comparing Sequential Forecasters. (arXiv:2110.00115v4 [stat.ME] UPDATED)
    Consider two forecasters, each making a single prediction for a sequence of events over time. We ask a relatively basic question: how might we compare these forecasters, either online or post-hoc, while avoiding unverifiable assumptions on how the forecasts and outcomes were generated? In this paper, we present a rigorous answer to this question by designing novel sequential inference procedures for estimating the time-varying difference in forecast scores. To do this, we employ confidence sequences (CS), which are sequences of confidence intervals that can be continuously monitored and are valid at arbitrary data-dependent stopping times ("anytime-valid"). The widths of our CSs are adaptive to the underlying variance of the score differences. Underlying their construction is a game-theoretic statistical framework, in which we further identify e-processes and p-processes for sequentially testing a weak null hypothesis -- whether one forecaster outperforms another on average (rather than always). Our methods do not make distributional assumptions on the forecasts or outcomes; our main theorems apply to any bounded scores, and we later provide alternative methods for unbounded scores. We empirically validate our approaches by comparing real-world baseball and weather forecasters.
    Sample Complexity of Offline Reinforcement Learning with Deep ReLU Networks. (arXiv:2103.06671v6 [stat.ML] UPDATED)
    Offline reinforcement learning (RL) leverages previously collected data for policy optimization without any further active exploration. Despite the recent interest in this problem, its theoretical results in neural network function approximation settings remain elusive. In this paper, we study the statistical theory of offline RL with deep ReLU network function approximation. In particular, we establish the sample complexity of $n = \tilde{\mathcal{O}}( H^{4 + 4 \frac{d}{\alpha}} \kappa_{\mu}^{1 + \frac{d}{\alpha}} \epsilon^{-2 - 2\frac{d}{\alpha}} )$ for offline RL with deep ReLU networks, where $\kappa_{\mu}$ is a measure of distributional shift, {$H = (1-\gamma)^{-1}$ is the effective horizon length}, $d$ is the dimension of the state-action space, $\alpha$ is a (possibly fractional) smoothness parameter of the underlying Markov decision process (MDP), and $\epsilon$ is a user-specified error. Notably, our sample complexity holds under two novel considerations: the Besov dynamic closure and the correlated structure. While the Besov dynamic closure subsumes the dynamic conditions for offline RL in the prior works, the correlated structure renders the prior works of offline RL with general/neural network function approximation improper or inefficient {in long (effective) horizon problems}. To the best of our knowledge, this is the first theoretical characterization of the sample complexity of offline RL with deep neural network function approximation under the general Besov regularity condition that goes beyond {the linearity regime} in the traditional Reproducing Hilbert kernel spaces and Neural Tangent Kernels.
    Counterfactual Explanations Using Optimization With Constraint Learning. (arXiv:2209.10997v2 [cs.LG] UPDATED)
    To increase the adoption of counterfactual explanations in practice, several criteria that these should adhere to have been put forward in the literature. We propose counterfactual explanations using optimization with constraint learning (CE-OCL), a generic and flexible approach that addresses all these criteria and allows room for further extensions. Specifically, we discuss how we can leverage an optimization with constraint learning framework for the generation of counterfactual explanations, and how components of this framework readily map to the criteria. We also propose two novel modeling approaches to address data manifold closeness and diversity, which are two key criteria for practical counterfactual explanations. We test CE-OCL on several datasets and present our results in a case study. Compared against the current state-of-the-art methods, CE-OCL allows for more flexibility and has an overall superior performance in terms of several evaluation metrics proposed in related work.
    Post-hoc Uncertainty Learning using a Dirichlet Meta-Model. (arXiv:2212.07359v1 [cs.LG])
    It is known that neural networks have the problem of being over-confident when directly using the output label distribution to generate uncertainty measures. Existing methods mainly resolve this issue by retraining the entire model to impose the uncertainty quantification capability so that the learned model can achieve desired performance in accuracy and uncertainty prediction simultaneously. However, training the model from scratch is computationally expensive and may not be feasible in many situations. In this work, we consider a more practical post-hoc uncertainty learning setting, where a well-trained base model is given, and we focus on the uncertainty quantification task at the second stage of training. We propose a novel Bayesian meta-model to augment pre-trained models with better uncertainty quantification abilities, which is effective and computationally efficient. Our proposed method requires no additional training data and is flexible enough to quantify different uncertainties and easily adapt to different application settings, including out-of-domain data detection, misclassification detection, and trustworthy transfer learning. We demonstrate our proposed meta-model approach's flexibility and superior empirical performance on these applications over multiple representative image classification benchmarks.
    Policy Evaluation for Temporal and/or Spatial Dependent Experiments in Ride-sourcing Platforms. (arXiv:2202.10887v4 [stat.ME] UPDATED)
    The aim of this paper is to establish causal relationship between ride-sharing platform's policies and outcomes of interest under complex temporal and/or spatial dependent experiments. We propose a temporal/spatio-temporal varying coefficient decision process (VCDP) model to capture the dynamic treatment effects in temporal/spatio-temporal dependent experiments. We characterize the average treatment effect by decomposing it as the sum of direct effect (DE) and indirect effect (IE) and develop estimation and inference procedures for both DE and IE. We also establish the statistical properties (e.g., weak convergence and asymptotic power) of our models. We conduct extensive simulations and real data analyses to verify the usefulness of the proposed method.
    Learning soft interventions in complex equilibrium systems. (arXiv:2112.05729v2 [cs.LG] UPDATED)
    Complex systems often contain feedback loops that can be described as cyclic causal models. Intervening in such systems may lead to counterintuitive effects, which cannot be inferred directly from the graph structure. After establishing a framework for differentiable soft interventions based on Lie groups, we take advantage of modern automatic differentiation techniques and their application to implicit functions in order to optimize interventions in cyclic causal models. We illustrate the use of this framework by investigating scenarios of transition to sustainable economies.
    Deep Learning with Functional Inputs. (arXiv:2006.09590v2 [stat.ML] UPDATED)
    We present a methodology for integrating functional data into deep densely connected feed-forward neural networks. The model is defined for scalar responses with multiple functional and scalar covariates. A by-product of the method is a set of dynamic functional weights that can be visualized during the optimization process. This visualization leads to greater interpretability of the relationship between the covariates and the response relative to conventional neural networks. The model is shown to perform well in a number of contexts including prediction of new data and recovery of the true underlying functional weights; these results were confirmed through real applications and simulation studies. A forthcoming R package is developed on top of a popular deep learning library (Keras) allowing for general use of the approach.
    Demystifying Randomly Initialized Networks for Evaluating Generative Models. (arXiv:2208.09218v2 [cs.LG] UPDATED)
    Evaluation of generative models is mostly based on the comparison between the estimated distribution and the ground truth distribution in a certain feature space. To embed samples into informative features, previous works often use convolutional neural networks optimized for classification, which is criticized by recent studies. Therefore, various feature spaces have been explored to discover alternatives. Among them, a surprising approach is to use a randomly initialized neural network for feature embedding. However, the fundamental basis to employ the random features has not been sufficiently justified. In this paper, we rigorously investigate the feature space of models with random weights in comparison to that of trained models. Furthermore, we provide an empirical evidence to choose networks for random features to obtain consistent and reliable results. Our results indicate that the features from random networks can evaluate generative models well similarly to those from trained networks, and furthermore, the two types of features can be used together in a complementary way.
    Learning Invariant Subspaces of Koopman Operators--Part 1: A Methodology for Demonstrating a Dictionary's Approximate Subspace Invariance. (arXiv:2212.07358v1 [eess.SY])
    Koopman operators model nonlinear dynamics as a linear dynamic system acting on a nonlinear function as the state. This nonstandard state is often called a Koopman observable and is usually approximated numerically by a superposition of functions drawn from a dictionary. In a widely used algorithm, Extended Dynamic Mode Decomposition, the dictionary functions are drawn from a fixed class of functions. Recently, deep learning combined with EDMD has been used to learn novel dictionary functions in an algorithm called deep dynamic mode decomposition (deepDMD). The learned representation both (1) accurately models and (2) scales well with the dimension of the original nonlinear system. In this paper we analyze the learned dictionaries from deepDMD and explore the theoretical basis for their strong performance. We explore State-Inclusive Logistic Lifting (SILL) dictionary functions to approximate Koopman observables. Error analysis of these dictionary functions show they satisfy a property of subspace approximation, which we define as uniform finite approximate closure. Our results provide a hypothesis to explain the success of deep neural networks in learning numerical approximations to Koopman operators. Part 2 of this paper will extend this explanation by demonstrating the subspace invariant of heterogeneous dictionaries and presenting a head-to-head numerical comparison of deepDMD and low-parameter heterogeneous dictionary learning.
    Reconstruction of Multivariate Sparse Signals from Mismatched Samples. (arXiv:2212.07368v1 [eess.SP])
    Erroneous correspondences between samples and their respective channel or target commonly arise in several real-world applications. For instance, whole-brain calcium imaging of freely moving organisms, multiple target tracking or multi-person contactless vital sign monitoring may be severely affected by mismatched sample-channel assignments. To systematically address this fundamental problem, we pose it as a signal reconstruction problem where we have lost correspondences between the samples and their respective channels. We show that under the assumption that the signals of interest admit a sparse representation over an overcomplete dictionary, unique signal recovery is possible. Our derivations reveal that the problem is equivalent to a structured unlabeled sensing problem without precise knowledge of the sensing matrix. Unfortunately, existing methods are neither robust to errors in the regressors nor do they exploit the structure of the problem. Therefore, we propose a novel robust two-step approach for the reconstruction of shuffled sparse signals. The performance and robustness of the proposed approach is illustrated in an application of whole-brain calcium imaging in computational neuroscience. The proposed framework can be generalized to sparse signal representations other than the ones considered in this work to be applied in a variety of real-world problems with imprecise measurement or channel assignment.
    Maximal Initial Learning Rates in Deep ReLU Networks. (arXiv:2212.07295v1 [stat.ML])
    Training a neural network requires choosing a suitable learning rate, involving a trade-off between speed and effectiveness of convergence. While there has been considerable theoretical and empirical analysis of how large the learning rate can be, most prior work focuses only on late-stage training. In this work, we introduce the maximal initial learning rate $\eta^{\ast}$ - the largest learning rate at which a randomly initialized neural network can successfully begin training and achieve (at least) a given threshold accuracy. Using a simple approach to estimate $\eta^{\ast}$, we observe that in constant-width fully-connected ReLU networks, $\eta^{\ast}$ demonstrates different behavior to the maximum learning rate later in training. Specifically, we find that $\eta^{\ast}$ is well predicted as a power of $(\text{depth} \times \text{width})$, provided that (i) the width of the network is sufficiently large compared to the depth, and (ii) the input layer of the network is trained at a relatively small learning rate. We further analyze the relationship between $\eta^{\ast}$ and the sharpness $\lambda_{1}$ of the network at initialization, indicating that they are closely though not inversely related. We formally prove bounds for $\lambda_{1}$ in terms of $(\text{depth} \times \text{width})$ that align with our empirical results.
    Hierarchical Over-the-Air FedGradNorm. (arXiv:2212.07414v1 [cs.LG])
    Multi-task learning (MTL) is a learning paradigm to learn multiple related tasks simultaneously with a single shared network where each task has a distinct personalized header network for fine-tuning. MTL can be integrated into a federated learning (FL) setting if tasks are distributed across clients and clients have a single shared network, leading to personalized federated learning (PFL). To cope with statistical heterogeneity in the federated setting across clients which can significantly degrade the learning performance, we use a distributed dynamic weighting approach. To perform the communication between the remote parameter server (PS) and the clients efficiently over the noisy channel in a power and bandwidth-limited regime, we utilize over-the-air (OTA) aggregation and hierarchical federated learning (HFL). Thus, we propose hierarchical over-the-air (HOTA) PFL with a dynamic weighting strategy which we call HOTA-FedGradNorm. Our algorithm considers the channel conditions during the dynamic weight selection process. We conduct experiments on a wireless communication system dataset (RadComDynamic). The experimental results demonstrate that the training speed with HOTA-FedGradNorm is faster compared to the algorithms with a naive static equal weighting strategy. In addition, HOTA-FedGradNorm provides robustness against the negative channel effects by compensating for the channel conditions during the dynamic weight selection process.
    Active Learning for Regression by Inverse Distance Weighting. (arXiv:2204.07177v3 [cs.LG] UPDATED)
    This paper proposes an active learning (AL) algorithm to solve regression problems based on inverse-distance weighting functions for selecting the feature vectors to query. The algorithm has the following features: (i) supports both pool-based and population-based sampling; (ii) is not tailored to a particular class of predictors; (iii) can handle known and unknown constraints on the queryable feature vectors; and (iv) can run either sequentially, or in batch mode, depending on how often the predictor is retrained. The potentials of the method are shown in numerical tests on illustrative synthetic problems and real-world datasets. An implementation of the algorithm, which we call IDEAL (Inverse-Distance based Exploration for Active Learning), is available at this http URL
    A deep learning approach to data-driven model-free pricing and to martingale optimal transport. (arXiv:2103.11435v3 [q-fin.CP] UPDATED)
    We introduce a novel and highly tractable supervised learning approach based on neural networks that can be applied for the computation of model-free price bounds of, potentially high-dimensional, financial derivatives and for the determination of optimal hedging strategies attaining these bounds. In particular, our methodology allows to train a single neural network offline and then to use it online for the fast determination of model-free price bounds of a whole class of financial derivatives with current market data. We show the applicability of this approach and highlight its accuracy in several examples involving real market data. Further, we show how a neural network can be trained to solve martingale optimal transport problems involving fixed marginal distributions instead of financial market data.
    AI-enabled exploration of Instagram profiles predicts soft skills and personality traits to empower hiring decisions. (arXiv:2212.07069v1 [cs.LG])
    It does not matter whether it is a job interview with Tech Giants, Wall Street firms, or a small startup; all candidates want to demonstrate their best selves or even present themselves better than they really are. Meanwhile, recruiters want to know the candidates' authentic selves and detect soft skills that prove an expert candidate would be a great fit in any company. Recruiters worldwide usually struggle to find employees with the highest level of these skills. Digital footprints can assist recruiters in this process by providing candidates' unique set of online activities, while social media delivers one of the largest digital footprints to track people. In this study, for the first time, we show that a wide range of behavioral competencies consisting of 16 in-demand soft skills can be automatically predicted from Instagram profiles based on the following lists and other quantitative features using machine learning algorithms. We also provide predictions on Big Five personality traits. Models were built based on a sample of 400 Iranian volunteer users who answered an online questionnaire and provided their Instagram usernames which allowed us to crawl the public profiles. We applied several machine learning algorithms to the uniformed data. Deep learning models mostly outperformed by demonstrating 70% and 69% average Accuracy in two-level and three-level classifications respectively. Creating a large pool of people with the highest level of soft skills, and making more accurate evaluations of job candidates is possible with the application of AI on social media user-generated data.
    MA-GCL: Model Augmentation Tricks for Graph Contrastive Learning. (arXiv:2212.07035v1 [cs.LG])
    Contrastive learning (CL), which can extract the information shared between different contrastive views, has become a popular paradigm for vision representation learning. Inspired by the success in computer vision, recent work introduces CL into graph modeling, dubbed as graph contrastive learning (GCL). However, generating contrastive views in graphs is more challenging than that in images, since we have little prior knowledge on how to significantly augment a graph without changing its labels. We argue that typical data augmentation techniques (e.g., edge dropping) in GCL cannot generate diverse enough contrastive views to filter out noises. Moreover, previous GCL methods employ two view encoders with exactly the same neural architecture and tied parameters, which further harms the diversity of augmented views. To address this limitation, we propose a novel paradigm named model augmented GCL (MA-GCL), which will focus on manipulating the architectures of view encoders instead of perturbing graph inputs. Specifically, we present three easy-to-implement model augmentation tricks for GCL, namely asymmetric, random and shuffling, which can respectively help alleviate high- frequency noises, enrich training instances and bring safer augmentations. All three tricks are compatible with typical data augmentations. Experimental results show that MA-GCL can achieve state-of-the-art performance on node classification benchmarks by applying the three tricks on a simple base model. Extensive studies also validate our motivation and the effectiveness of each trick. (Code, data and appendix are available at https://github.com/GXM1141/MA-GCL. )
    Reproducible scaling laws for contrastive language-image learning. (arXiv:2212.07143v1 [cs.LG])
    Scaling up neural networks has led to remarkable performance across a wide range of tasks. Moreover, performance often follows reliable scaling laws as a function of training set size, model size, and compute, which offers valuable guidance as large-scale experiments are becoming increasingly expensive. However, previous work on scaling laws has primarily used private data \& models or focused on uni-modal language or vision learning. To address these limitations, we investigate scaling laws for contrastive language-image pre-training (CLIP) with the public LAION dataset and the open-source OpenCLIP repository. Our large-scale experiments involve models trained on up to two billion image-text pairs and identify power law scaling for multiple downstream tasks including zero-shot classification, retrieval, linear probing, and end-to-end fine-tuning. We find that the training distribution plays a key role in scaling laws as the OpenAI and OpenCLIP models exhibit different scaling behavior despite identical model architectures and similar training recipes. We open-source our evaluation workflow and all models, including the largest public CLIP models, to ensure reproducibility and make scaling laws research more accessible. Source code and instructions to reproduce this study will be available at https://github.com/LAION-AI/scaling-laws-openclip
    Learning useful representations for shifting tasks and distributions. (arXiv:2212.07346v1 [cs.LG])
    Does the dominant approach to learn representations (as a side effect of optimizing an expected cost for a single training distribution) remain a good approach when we are dealing with multiple distributions. Our thesis is that such scenarios are better served by representations that are "richer" than those obtained with a single optimization episode. This is supported by a collection of empirical results obtained with an apparently na\"ive ensembling technique: concatenating the representations obtained with multiple training episodes using the same data, model, algorithm, and hyper-parameters, but different random seeds. These independently trained networks perform similarly. Yet, in a number of scenarios involving new distributions, the concatenated representation performs substantially better than an equivalently sized network trained from scratch. This proves that the representations constructed by multiple training episodes are in fact different. Although their concatenation carries little additional information about the training task under the training distribution, it becomes substantially more informative when tasks or distributions change. Meanwhile, a single training episode is unlikely to yield such a redundant representation because the optimization process has no reason to accumulate features that do not incrementally improve the training performance.
    Scheduling and Aggregation Design for Asynchronous Federated Learning over Wireless Networks. (arXiv:2212.07356v1 [cs.LG])
    Federated Learning (FL) is a collaborative machine learning (ML) framework that combines on-device training and server-based aggregation to train a common ML model among distributed agents. In this work, we propose an asynchronous FL design with periodic aggregation to tackle the straggler issue in FL systems. Considering limited wireless communication resources, we investigate the effect of different scheduling policies and aggregation designs on the convergence performance. Driven by the importance of reducing the bias and variance of the aggregated model updates, we propose a scheduling policy that jointly considers the channel quality and training data representation of user devices. The effectiveness of our channel-aware data-importance-based scheduling policy, compared with state-of-the-art methods proposed for synchronous FL, is validated through simulations. Moreover, we show that an "age-aware" aggregation weighting design can significantly improve the learning performance in an asynchronous FL setting.
    Generating extreme quantum scattering in graphene with machine learning. (arXiv:2212.06929v1 [cond-mat.mes-hall])
    Graphene quantum dots provide a platform for manipulating electron behaviors in two-dimensional (2D) Dirac materials. Most previous works were of the "forward" type in that the objective was to solve various confinement, transport and scattering problems with given structures that can be generated by, e.g., applying an external electrical field. There are applications such as cloaking or superscattering where the challenging problem of inverse design needs to be solved: finding a quantum-dot structure according to certain desired functional characteristics. A brute-force search of the system configuration based directly on the solutions of the Dirac equation is computational infeasible. We articulate a machine-learning approach to addressing the inverse-design problem where artificial neural networks subject to physical constraints are exploited to replace the rigorous Dirac equation solver. In particular, we focus on the problem of designing a quantum dot structure to generate both cloaking and superscattering in terms of the scattering efficiency as a function of the energy. We construct a physical loss function that enables accurate prediction of the scattering characteristics. We demonstrate that, in the regime of Klein tunneling, the scattering efficiency can be designed to vary over two orders of magnitudes, allowing any scattering curve to be generated from a proper combination of the gate potentials. Our physics-based machine-learning approach can be a powerful design tool for 2D Dirac material-based electronics.
    Explaining Agent's Decision-making in a Hierarchical Reinforcement Learning Scenario. (arXiv:2212.06967v1 [cs.AI])
    Reinforcement learning is a machine learning approach based on behavioral psychology. It is focused on learning agents that can acquire knowledge and learn to carry out new tasks by interacting with the environment. However, a problem occurs when reinforcement learning is used in critical contexts where the users of the system need to have more information and reliability for the actions executed by an agent. In this regard, explainable reinforcement learning seeks to provide to an agent in training with methods in order to explain its behavior in such a way that users with no experience in machine learning could understand the agent's behavior. One of these is the memory-based explainable reinforcement learning method that is used to compute probabilities of success for each state-action pair using an episodic memory. In this work, we propose to make use of the memory-based explainable reinforcement learning method in a hierarchical environment composed of sub-tasks that need to be first addressed to solve a more complex task. The end goal is to verify if it is possible to provide to the agent the ability to explain its actions in the global task as well as in the sub-tasks. The results obtained showed that it is possible to use the memory-based method in hierarchical environments with high-level tasks and compute the probabilities of success to be used as a basis for explaining the agent's behavior.
    Hybrid Multi-agent Deep Reinforcement Learning for Autonomous Mobility on Demand Systems. (arXiv:2212.07313v1 [cs.LG])
    We consider the sequential decision-making problem of making proactive request assignment and rejection decisions for a profit-maximizing operator of an autonomous mobility on demand system. We formalize this problem as a Markov decision process and propose a novel combination of multi-agent Soft Actor-Critic and weighted bipartite matching to obtain an anticipative control policy. Thereby, we factorize the operator's otherwise intractable action space, but still obtain a globally coordinated decision. Experiments based on real-world taxi data show that our method outperforms state of the art benchmarks with respect to performance, stability, and computational tractability.
    Generative Robust Classification. (arXiv:2212.07283v1 [cs.LG])
    Training adversarially robust discriminative (i.e., softmax) classifier has been the dominant approach to robust classification. Building on recent work on adversarial training (AT)-based generative models, we investigate using AT to learn unnormalized class-conditional density models and then performing generative robust classification. Our result shows that, under the condition of similar model capacities, the generative robust classifier achieves comparable performance to a baseline softmax robust classifier when the test data is clean or when the test perturbation is of limited size, and much better performance when the test perturbation size exceeds the training perturbation size. The generative classifier is also able to generate samples or counterfactuals that more closely resemble the training data, suggesting that the generative classifier can better capture the class-conditional distributions. In contrast to standard discriminative adversarial training where advanced data augmentation techniques are only effective when combined with weight averaging, we find it straightforward to apply advanced data augmentation to achieve better robustness in our approach. Our result suggests that the generative classifier is a competitive alternative to robust classification, especially for problems with limited number of classes.
    Speech and Natural Language Processing Technologies for Pseudo-Pilot Simulator. (arXiv:2212.07164v1 [cs.CL])
    This paper describes a simple yet efficient repetition-based modular system for speeding up air-traffic controllers (ATCos) training. E.g., a human pilot is still required in EUROCONTROL's ESCAPE lite simulator (see https://www.eurocontrol.int/simulator/escape) during ATCo training. However, this need can be substituted by an automatic system that could act as a pilot. In this paper, we aim to develop and integrate a pseudo-pilot agent into the ATCo training pipeline by merging diverse artificial intelligence (AI) powered modules. The system understands the voice communications issued by the ATCo, and, in turn, it generates a spoken prompt that follows the pilot's phraseology to the initial communication. Our system mainly relies on open-source AI tools and air traffic control (ATC) databases, thus, proving its simplicity and ease of replicability. The overall pipeline is composed of the following: (1) a submodule that receives and pre-processes the input stream of raw audio, (2) an automatic speech recognition (ASR) system that transforms audio into a sequence of words; (3) a high-level ATC-related entity parser, which extracts relevant information from the communication, i.e., callsigns and commands, and finally, (4) a speech synthesizer submodule that generates responses based on the high-level ATC entities previously extracted. Overall, we show that this system could pave the way toward developing a real proof-of-concept pseudo-pilot system. Hence, speeding up the training of ATCos while drastically reducing its overall cost.
    Traffic Flow Prediction via Variational Bayesian Inference-based Encoder-Decoder Framework. (arXiv:2212.07194v1 [cs.LG])
    Accurate traffic flow prediction, a hotspot for intelligent transportation research, is the prerequisite for mastering traffic and making travel plans. The speed of traffic flow can be affected by roads condition, weather, holidays, etc. Furthermore, the sensors to catch the information about traffic flow will be interfered with by environmental factors such as illumination, collection time, occlusion, etc. Therefore, the traffic flow in the practical transportation system is complicated, uncertain, and challenging to predict accurately. This paper proposes a deep encoder-decoder prediction framework based on variational Bayesian inference. A Bayesian neural network is constructed by combining variational inference with gated recurrent units (GRU) and used as the deep neural network unit of the encoder-decoder framework to mine the intrinsic dynamics of traffic flow. Then, the variational inference is introduced into the multi-head attention mechanism to avoid noise-induced deterioration of prediction accuracy. The proposed model achieves superior prediction performance on the Guangzhou urban traffic flow dataset over the benchmarks, particularly when the long-term prediction.
    SMSMix: Sense-Maintained Sentence Mixup for Word Sense Disambiguation. (arXiv:2212.07072v1 [cs.CL])
    Word Sense Disambiguation (WSD) is an NLP task aimed at determining the correct sense of a word in a sentence from discrete sense choices. Although current systems have attained unprecedented performances for such tasks, the nonuniform distribution of word senses during training generally results in systems performing poorly on rare senses. To this end, we consider data augmentation to increase the frequency of these least frequent senses (LFS) to reduce the distributional bias of senses during training. We propose Sense-Maintained Sentence Mixup (SMSMix), a novel word-level mixup method that maintains the sense of a target word. SMSMix smoothly blends two sentences using mask prediction while preserving the relevant span determined by saliency scores to maintain a specific word's sense. To the best of our knowledge, this is the first attempt to apply mixup in NLP while preserving the meaning of a specific word. With extensive experiments, we validate that our augmentation method can effectively give more information about rare senses during training with maintained target sense label.
    APOLLO: An Optimized Training Approach for Long-form Numerical Reasoning. (arXiv:2212.07249v1 [cs.CL])
    Long-form numerical reasoning in financial analysis aims to generate a reasoning program to calculate the correct answer for a given question. Previous work followed a retriever-generator framework, where the retriever selects key facts from a long-form document, and the generator generates a reasoning program based on retrieved facts. However, they treated all facts equally without considering the different contributions of facts with and without numbers. Meanwhile, the program consistency were ignored under supervised training, resulting in lower training accuracy and diversity. To solve these problems, we proposed APOLLO to improve the long-form numerical reasoning framework. For the retriever, we adopt a number-aware negative sampling strategy to enable the retriever to be more discriminative on key numerical facts. For the generator, we design consistency-based reinforcement learning and target program augmentation strategy based on the consistency of program execution results. Experimental results on the FinQA and ConvFinQA leaderboard verify the effectiveness of our proposed method, achieving the new state-of-the-art.
    AsPOS: Assamese Part of Speech Tagger using Deep Learning Approach. (arXiv:2212.07043v1 [cs.CL])
    Part of Speech (POS) tagging is crucial to Natural Language Processing (NLP). It is a well-studied topic in several resource-rich languages. However, the development of computational linguistic resources is still in its infancy despite the existence of numerous languages that are historically and literary rich. Assamese, an Indian scheduled language, spoken by more than 25 million people, falls under this category. In this paper, we present a Deep Learning (DL)-based POS tagger for Assamese. The development process is divided into two stages. In the first phase, several pre-trained word embeddings are employed to train several tagging models. This allows us to evaluate the performance of the word embeddings in the POS tagging task. The top-performing model from the first phase is employed to annotate another set of new sentences. In the second phase, the model is trained further using the fresh dataset. Finally, we attain a tagging accuracy of 86.52% in F1 score. The model may serve as a baseline for further study on DL-based Assamese POS tagging.
    Establishing a stronger baseline for lightweight contrastive models. (arXiv:2212.07158v1 [cs.CV])
    Recent research has reported a performance degradation in self-supervised contrastive learning for specially designed efficient networks, such as MobileNet and EfficientNet. A common practice to address this problem is to introduce a pretrained contrastive teacher model and train the lightweight networks with distillation signals generated by the teacher. However, it is time and resource consuming to pretrain a teacher model when it is not available. In this work, we aim to establish a stronger baseline for lightweight contrastive models without using a pretrained teacher model. Specifically, we show that the optimal recipe for efficient models is different from that of larger models, and using the same training settings as ResNet50, as previous research does, is inappropriate. Additionally, we observe a common issu e in contrastive learning where either the positive or negative views can be noisy, and propose a smoothed version of InfoNCE loss to alleviate this problem. As a result, we successfully improve the linear evaluation results from 36.3\% to 62.3\% for MobileNet-V3-Large and from 42.2\% to 65.8\% for EfficientNet-B0 on ImageNet, closing the accuracy gap to ResNet50 with $5\times$ fewer parameters. We hope our research will facilitate the usage of lightweight contrastive models.
    Explainable Artificial Intelligence in Retinal Imaging for the detection of Systemic Diseases. (arXiv:2212.07058v1 [eess.IV])
    Explainable Artificial Intelligence (AI) in the form of an interpretable and semiautomatic approach to stage grading ocular pathologies such as Diabetic retinopathy, Hypertensive retinopathy, and other retinopathies on the backdrop of major systemic diseases. The experimental study aims to evaluate an explainable staged grading process without using deep Convolutional Neural Networks (CNNs) directly. Many current CNN-based deep neural networks used for diagnosing retinal disorders might have appreciable performance but fail to pinpoint the basis driving their decisions. To improve these decisions' transparency, we have proposed a clinician-in-the-loop assisted intelligent workflow that performs a retinal vascular assessment on the fundus images to derive quantifiable and descriptive parameters. The retinal vessel parameters meta-data serve as hyper-parameters for better interpretation and explainability of decisions. The semiautomatic methodology aims to have a federated approach to AI in healthcare applications with more inputs and interpretations from clinicians. The baseline process involved in the machine learning pipeline through image processing techniques for optic disc detection, vessel segmentation, and arteriole/venule identification.
    Uncertainty Quantification for Deep Neural Networks: An Empirical Comparison and Usage Guidelines. (arXiv:2212.07118v1 [cs.SE])
    Deep Neural Networks (DNN) are increasingly used as components of larger software systems that need to process complex data, such as images, written texts, audio/video signals. DNN predictions cannot be assumed to be always correct for several reasons, among which the huge input space that is dealt with, the ambiguity of some inputs data, as well as the intrinsic properties of learning algorithms, which can provide only statistical warranties. Hence, developers have to cope with some residual error probability. An architectural pattern commonly adopted to manage failure-prone components is the supervisor, an additional component that can estimate the reliability of the predictions made by untrusted (e.g., DNN) components and can activate an automated healing procedure when these are likely to fail, ensuring that the Deep Learning based System (DLS) does not cause damages, despite its main functionality being suspended. In this paper, we consider DLS that implement a supervisor by means of uncertainty estimation. After overviewing the main approaches to uncertainty estimation and discussing their pros and cons, we motivate the need for a specific empirical assessment method that can deal with the experimental setting in which supervisors are used, where the accuracy of the DNN matters only as long as the supervisor lets the DLS continue to operate. Then we present a large empirical study conducted to compare the alternative approaches to uncertainty estimation. We distilled a set of guidelines for developers that are useful to incorporate a supervisor based on uncertainty monitoring into a DLS.
    Particle-Based Score Estimation for State Space Model Learning in Autonomous Driving. (arXiv:2212.06968v1 [cs.RO])
    Multi-object state estimation is a fundamental problem for robotic applications where a robot must interact with other moving objects. Typically, other objects' relevant state features are not directly observable, and must instead be inferred from observations. Particle filtering can perform such inference given approximate transition and observation models. However, these models are often unknown a priori, yielding a difficult parameter estimation problem since observations jointly carry transition and observation noise. In this work, we consider learning maximum-likelihood parameters using particle methods. Recent methods addressing this problem typically differentiate through time in a particle filter, which requires workarounds to the non-differentiable resampling step, that yield biased or high variance gradient estimates. By contrast, we exploit Fisher's identity to obtain a particle-based approximation of the score function (the gradient of the log likelihood) that yields a low variance estimate while only requiring stepwise differentiation through the transition and observation models. We apply our method to real data collected from autonomous vehicles (AVs) and show that it learns better models than existing techniques and is more stable in training, yielding an effective smoother for tracking the trajectories of vehicles around an AV.
    AI Ethics on Blockchain: Topic Analysis on Twitter Data for Blockchain Security. (arXiv:2212.06951v1 [cs.CR])
    Blockchain has empowered computer systems to be more secure using a distributed network. However, the current blockchain design suffers from fairness issues in transaction ordering. Miners are able to reorder transactions to generate profits, the so-called miner extractable value (MEV). Existing research recognizes MEV as a severe security issue and proposes potential solutions, including prominent Flashbots. However, previous studies have mostly analyzed blockchain data, which might not capture the impacts of MEV in a much broader AI society. Thus, in this research, we applied natural language processing (NLP) methods to comprehensively analyze topics in tweets on MEV. We collected more than 20000 tweets with \#MEV and \#Flashbots hashtags and analyzed their topics. Our results show that the tweets discussed profound topics of ethical concern, including security, equity, emotional sentiments, and the desire for solutions to MEV. We also identify the co-movements of MEV activities on blockchain and social media platforms. Our study contributes to the literature at the interface of blockchain security, MEV solutions, and AI ethics.
    On the Relationship Between Explanation and Prediction: A Causal View. (arXiv:2212.06925v1 [cs.LG])
    Explainability has become a central requirement for the development, deployment, and adoption of machine learning (ML) models and we are yet to understand what explanation methods can and cannot do. Several factors such as data, model prediction, hyperparameters used in training the model, and random initialization can all influence downstream explanations. While previous work empirically hinted that explanations (E) may have little relationship with the prediction (Y), there is a lack of conclusive study to quantify this relationship. Our work borrows tools from causal inference to systematically assay this relationship. More specifically, we measure the relationship between E and Y by measuring the treatment effect when intervening on their causal ancestors (hyperparameters) (inputs to generate saliency-based Es or Ys). We discover that Y's relative direct influence on E follows an odd pattern; the influence is higher in the lowest-performing models than in mid-performing models, and it then decreases in the top-performing models. We believe our work is a promising first step towards providing better guidance for practitioners who can make more informed decisions in utilizing these explanations by knowing what factors are at play and how they relate to their end task.
    Simulating 2+1D Lattice Quantum Electrodynamics at Finite Density with Neural Flow Wavefunctions. (arXiv:2212.06835v1 [hep-lat])
    We present a neural flow wavefunction, Gauge-Fermion FlowNet, and use it to simulate 2+1D lattice compact quantum electrodynamics with finite density dynamical fermions. The gauge field is represented by a neural network which parameterizes a discretized flow-based transformation of the amplitude while the fermionic sign structure is represented by a neural net backflow. This approach directly represents the $U(1)$ degree of freedom without any truncation, obeys Guass's law by construction, samples autoregressively avoiding any equilibration time, and variationally simulates Gauge-Fermion systems with sign problems accurately. In this model, we investigate confinement and string breaking phenomena in different fermion density and hopping regimes. We study the phase transition from the charge crystal phase to the vacuum phase at zero density, and observe the phase seperation and the net charge penetration blocking effect under magnetic interaction at finite density. In addition, we investigate a magnetic phase transition due to the competition effect between the kinetic energy of fermions and the magnetic energy of the gauge field. With our method, we further note potential differences on the order of the phase transitions between a continuous $U(1)$ system and one with finite truncation. Our state-of-the-art neural network approach opens up new possibilities to study different gauge theories coupled to dynamical matter in higher dimensions.
    Error-Aware B-PINNs: Improving Uncertainty Quantification in Bayesian Physics-Informed Neural Networks. (arXiv:2212.06965v1 [cs.LG])
    Physics-Informed Neural Networks (PINNs) are gaining popularity as a method for solving differential equations. While being more feasible in some contexts than the classical numerical techniques, PINNs still lack credibility. A remedy for that can be found in Uncertainty Quantification (UQ) which is just beginning to emerge in the context of PINNs. Assessing how well the trained PINN complies with imposed differential equation is the key to tackling uncertainty, yet there is lack of comprehensive methodology for this task. We propose a framework for UQ in Bayesian PINNs (B-PINNs) that incorporates the discrepancy between the B-PINN solution and the unknown true solution. We exploit recent results on error bounds for PINNs on linear dynamical systems and demonstrate the predictive uncertainty on a class of linear ODEs.
    Deep Negative Correlation Classification. (arXiv:2212.07070v1 [cs.CV])
    Ensemble learning serves as a straightforward way to improve the performance of almost any machine learning algorithm. Existing deep ensemble methods usually naively train many different models and then aggregate their predictions. This is not optimal in our view from two aspects: i) Naively training multiple models adds much more computational burden, especially in the deep learning era; ii) Purely optimizing each base model without considering their interactions limits the diversity of ensemble and performance gains. We tackle these issues by proposing deep negative correlation classification (DNCC), in which the accuracy and diversity trade-off is systematically controlled by decomposing the loss function seamlessly into individual accuracy and the correlation between individual models and the ensemble. DNCC yields a deep classification ensemble where the individual estimator is both accurate and negatively correlated. Thanks to the optimized diversities, DNCC works well even when utilizing a shared network backbone, which significantly improves its efficiency when compared with most existing ensemble systems. Extensive experiments on multiple benchmark datasets and network structures demonstrate the superiority of the proposed method.
    Cutting Plane Selection with Analytic Centers and Multiregression. (arXiv:2212.07231v1 [math.OC])
    Cutting planes are a crucial component of state-of-the-art mixed-integer programming solvers, with the choice of which subset of cuts to add being vital for solver performance. We propose new distance-based measures to qualify the value of a cut by quantifying the extent to which it separates relevant parts of the relaxed feasible set. For this purpose, we use the analytic centers of the relaxation polytope or of its optimal face, as well as alternative optimal solutions of the linear programming relaxation. We assess the impact of the choice of distance measure on root node performance and throughout the whole branch-and-bound tree, comparing our measures against those prevalent in the literature. Finally, by a multi-output regression, we predict the relative performance of each measure, using static features readily available before the separation process. Our results indicate that analytic center-based methods help to significantly reduce the number of branch-and-bound nodes needed to explore the search space and that our multiregression approach can further improve on any individual method.
    Improving group robustness under noisy labels using predictive uncertainty. (arXiv:2212.07026v1 [cs.LG])
    The standard empirical risk minimization (ERM) can underperform on certain minority groups (i.e., waterbirds in lands or landbirds in water) due to the spurious correlation between the input and its label. Several studies have improved the worst-group accuracy by focusing on the high-loss samples. The hypothesis behind this is that such high-loss samples are \textit{spurious-cue-free} (SCF) samples. However, these approaches can be problematic since the high-loss samples may also be samples with noisy labels in the real-world scenarios. To resolve this issue, we utilize the predictive uncertainty of a model to improve the worst-group accuracy under noisy labels. To motivate this, we theoretically show that the high-uncertainty samples are the SCF samples in the binary classification problem. This theoretical result implies that the predictive uncertainty is an adequate indicator to identify SCF samples in a noisy label setting. Motivated from this, we propose a novel ENtropy based Debiasing (END) framework that prevents models from learning the spurious cues while being robust to the noisy labels. In the END framework, we first train the \textit{identification model} to obtain the SCF samples from a training set using its predictive uncertainty. Then, another model is trained on the dataset augmented with an oversampled SCF set. The experimental results show that our END framework outperforms other strong baselines on several real-world benchmarks that consider both the noisy labels and the spurious-cues.
    Significantly improving zero-shot X-ray pathology classification via fine-tuning pre-trained image-text encoders. (arXiv:2212.07050v1 [cs.LG])
    Deep neural networks have been successfully adopted to diverse domains including pathology classification based on medical images. However, large-scale and high-quality data to train powerful neural networks are rare in the medical domain as the labeling must be done by qualified experts. Researchers recently tackled this problem with some success by taking advantage of models pre-trained on large-scale general domain data. Specifically, researchers took contrastive image-text encoders (e.g., CLIP) and fine-tuned it with chest X-ray images and paired reports to perform zero-shot pathology classification, thus completely removing the need for pathology-annotated images to train a classification model. Existing studies, however, fine-tuned the pre-trained model with the same contrastive learning objective, and failed to exploit the multi-labeled nature of medical image-report pairs. In this paper, we propose a new fine-tuning strategy based on sentence sampling and positive-pair loss relaxation for improving the downstream zero-shot pathology classification performance, which can be applied to any pre-trained contrastive image-text encoders. Our method consistently showed dramatically improved zero-shot pathology classification performance on four different chest X-ray datasets and 3 different pre-trained models (5.77% average AUROC increase). In particular, fine-tuning CLIP with our method showed much comparable or marginally outperformed to board-certified radiologists (0.619 vs 0.625 in F1 score and 0.530 vs 0.544 in MCC) in zero-shot classification of five prominent diseases from the CheXpert dataset.
    Statistical Safety and Robustness Guarantees for Feedback Motion Planning of Unknown Underactuated Stochastic Systems. (arXiv:2212.06874v1 [cs.RO])
    We present a method for providing statistical guarantees on runtime safety and goal reachability for integrated planning and control of a class of systems with unknown nonlinear stochastic underactuated dynamics. Specifically, given a dynamics dataset, our method jointly learns a mean dynamics model, a spatially-varying disturbance bound that captures the effect of noise and model mismatch, and a feedback controller based on contraction theory that stabilizes the learned dynamics. We propose a sampling-based planner that uses the mean dynamics model and simultaneously bounds the closed-loop tracking error via a learned disturbance bound. We employ techniques from Extreme Value Theory (EVT) to estimate, to a specified level of confidence, several constants which characterize the learned components and govern the size of the tracking error bound. This ensures plans are guaranteed to be safely tracked at runtime. We validate that our guarantees translate to empirical safety in simulation on a 10D quadrotor, and in the real world on a physical CrazyFlie quadrotor and Clearpath Jackal robot, whereas baselines that ignore the model error and stochasticity are unsafe.
    Controlling Commercial Cooling Systems Using Reinforcement Learning. (arXiv:2211.07357v2 [cs.LG] UPDATED)
    This paper is a technical overview of DeepMind and Google's recent work on reinforcement learning for controlling commercial cooling systems. Building on expertise that began with cooling Google's data centers more efficiently, we recently conducted live experiments on two real-world facilities in partnership with Trane Technologies, a building management system provider. These live experiments had a variety of challenges in areas such as evaluation, learning from offline data, and constraint satisfaction. Our paper describes these challenges in the hope that awareness of them will benefit future applied RL work. We also describe the way we adapted our RL system to deal with these challenges, resulting in energy savings of approximately 9% and 13% respectively at the two live experiment sites.
    Toroidal Coordinates: Decorrelating Circular Coordinates With Lattice Reduction. (arXiv:2212.07201v1 [cs.CG])
    The circular coordinates algorithm of de Silva, Morozov, and Vejdemo-Johansson takes as input a dataset together with a cohomology class representing a $1$-dimensional hole in the data; the output is a map from the data into the circle that captures this hole, and that is of minimum energy in a suitable sense. However, when applied to several cohomology classes, the output circle-valued maps can be "geometrically correlated" even if the chosen cohomology classes are linearly independent. It is shown in the original work that less correlated maps can be obtained with suitable integer linear combinations of the cohomology classes, with the linear combinations being chosen by inspection. In this paper, we identify a formal notion of geometric correlation between circle-valued maps which, in the Riemannian manifold case, corresponds to the Dirichlet form, a bilinear form derived from the Dirichlet energy. We describe a systematic procedure for constructing low energy torus-valued maps on data, starting from a set of linearly independent cohomology classes. We showcase our procedure with computational examples. Our main algorithm is based on the Lenstra--Lenstra--Lov\'asz algorithm from computational number theory.
    Task-Adaptive Meta-Learning Framework for Advancing Spatial Generalizability. (arXiv:2212.06864v1 [cs.LG])
    Spatio-temporal machine learning is critically needed for a variety of societal applications, such as agricultural monitoring, hydrological forecast, and traffic management. These applications greatly rely on regional features that characterize spatial and temporal differences. However, spatio-temporal data are often complex and pose several unique challenges for machine learning models: 1) multiple models are needed to handle region-based data patterns that have significant spatial heterogeneity across different locations; 2) local models trained on region-specific data have limited ability to adapt to other regions that have large diversity and abnormality; 3) spatial and temporal variations entangle data complexity that requires more robust and adaptive models; 4) limited spatial-temporal data in real scenarios (e.g., crop yield data is collected only once a year) makes the problems intrinsically challenging. To bridge these gaps, we propose task-adaptive formulations and a model-agnostic meta-learning framework that ensembles regionally heterogeneous data into location-sensitive meta tasks. We conduct task adaptation following an easy-to-hard task hierarchy in which different meta models are adapted to tasks of different difficulty levels. One major advantage of our proposed method is that it improves the model adaptation to a large number of heterogeneous tasks. It also enhances the model generalization by automatically adapting the meta model of the corresponding difficulty level to any new tasks. We demonstrate the superiority of our proposed framework over a diverse set of baselines and state-of-the-art meta-learning frameworks. Our extensive experiments on real crop yield data show the effectiveness of the proposed method in handling spatial-related heterogeneous tasks in real societal applications.
    In-Season Crop Progress in Unsurveyed Regions using Networks Trained on Synthetic Data. (arXiv:2212.06896v1 [cs.LG])
    Many commodity crops have growth stages during which they are particularly vulnerable to stress-induced yield loss. In-season crop progress information is useful for quantifying crop risk, and satellite remote sensing (RS) can be used to track progress at regional scales. At present, all existing RS-based crop progress estimation (CPE) methods which target crop-specific stages rely on ground truth data for training/calibration. This reliance on ground survey data confines CPE methods to surveyed regions, limiting their utility. In this study, a new method is developed for conducting RS-based in-season CPE in unsurveyed regions by combining data from surveyed regions with synthetic crop progress data generated for an unsurveyed region. Corn-growing zones in Argentina were used as surrogate 'unsurveyed' regions. Existing weather generation, crop growth, and optical radiative transfer models were linked to produce synthetic weather, crop progress, and canopy reflectance data. A neural network (NN) method based upon bi-directional Long Short-Term Memory was trained separately on surveyed data, synthetic data, and two different combinations of surveyed and synthetic data. A stopping criterion was developed which uses the weighted divergence of surveyed and synthetic data validation loss. Net F1 scores across all crop progress stages increased by 8.7% when trained on a combination of surveyed region and synthetic data, and overall performance was only 21% lower than when the NN was trained on surveyed data and applied in the US Midwest. Performance gain from synthetic data was greatest in zones with dual planting windows, while the inclusion of surveyed region data from the US Midwest helped mitigate NN sensitivity to noise in NDVI data. Overall results suggest in-season CPE in other unsurveyed regions may be possible with increased quantity and variety of synthetic crop progress data.
    Towards Efficient and Domain-Agnostic Evasion Attack with High-dimensional Categorical Inputs. (arXiv:2212.06836v1 [cs.LG])
    Our work targets at searching feasible adversarial perturbation to attack a classifier with high-dimensional categorical inputs in a domain-agnostic setting. This is intrinsically an NP-hard knapsack problem where the exploration space becomes explosively larger as the feature dimension increases. Without the help of domain knowledge, solving this problem via heuristic method, such as Branch-and-Bound, suffers from exponential complexity, yet can bring arbitrarily bad attack results. We address the challenge via the lens of multi-armed bandit based combinatorial search. Our proposed method, namely FEAT, treats modifying each categorical feature as pulling an arm in multi-armed bandit programming. Our objective is to achieve highly efficient and effective attack using an Orthogonal Matching Pursuit (OMP)-enhanced Upper Confidence Bound (UCB) exploration strategy. Our theoretical analysis bounding the regret gap of FEAT guarantees its practical attack performance. In empirical analysis, we compare FEAT with other state-of-the-art domain-agnostic attack methods over various real-world categorical data sets of different applications. Substantial experimental observations confirm the expected efficiency and attack effectiveness of FEAT applied in different application scenarios. Our work further hints the applicability of FEAT for assessing the adversarial vulnerability of classification systems with high-dimensional categorical inputs.
    Interactive Learning with Pricing for Optimal and Stable Allocations in Markets. (arXiv:2212.06891v1 [cs.LG])
    Large-scale online recommendation systems must facilitate the allocation of a limited number of items among competing users while learning their preferences from user feedback. As a principled way of incorporating market constraints and user incentives in the design, we consider our objectives to be two-fold: maximal social welfare with minimal instability. To maximize social welfare, our proposed framework enhances the quality of recommendations by exploring allocations that optimistically maximize the rewards. To minimize instability, a measure of users' incentives to deviate from recommended allocations, the algorithm prices the items based on a scheme derived from the Walrasian equilibria. Though it is known that these equilibria yield stable prices for markets with known user preferences, our approach accounts for the inherent uncertainty in the preferences and further ensures that the users accept their recommendations under offered prices. To the best of our knowledge, our approach is the first to integrate techniques from combinatorial bandits, optimal resource allocation, and collaborative filtering to obtain an algorithm that achieves sub-linear social welfare regret as well as sub-linear instability. Empirical studies on synthetic and real-world data also demonstrate the efficacy of our strategy compared to approaches that do not fully incorporate all these aspects.
    Learning and Predicting Multimodal Vehicle Action Distributions in a Unified Probabilistic Model Without Labels. (arXiv:2212.07013v1 [cs.RO])
    We present a unified probabilistic model that learns a representative set of discrete vehicle actions and predicts the probability of each action given a particular scenario. Our model also enables us to estimate the distribution over continuous trajectories conditioned on a scenario, representing what each discrete action would look like if executed in that scenario. While our primary objective is to learn representative action sets, these capabilities combine to produce accurate multimodal trajectory predictions as a byproduct. Although our learned action representations closely resemble semantically meaningful categories (e.g., "go straight", "turn left", etc.), our method is entirely self-supervised and does not utilize any manually generated labels or categories. Our method builds upon recent advances in variational inference and deep unsupervised clustering, resulting in full distribution estimates based on deterministic model evaluations.
    Reinforcement Learning in System Identification. (arXiv:2212.07123v1 [cs.LG])
    System identification, also known as learning forward models, transfer functions, system dynamics, etc., has a long tradition both in science and engineering in different fields. Particularly, it is a recurring theme in Reinforcement Learning research, where forward models approximate the state transition function of a Markov Decision Process by learning a mapping function from current state and action to the next state. This problem is commonly defined as a Supervised Learning problem in a direct way. This common approach faces several difficulties due to the inherent complexities of the dynamics to learn, for example, delayed effects, high non-linearity, non-stationarity, partial observability and, more important, error accumulation when using bootstrapped predictions (predictions based on past predictions), over large time horizons. Here we explore the use of Reinforcement Learning in this problem. We elaborate on why and how this problem fits naturally and sound as a Reinforcement Learning problem, and present some experimental results that demonstrate RL is a promising technique to solve these kind of problems.
    Backdoor Mitigation in Deep Neural Networks via Strategic Retraining. (arXiv:2212.07278v1 [cs.CR])
    Deep Neural Networks (DNN) are becoming increasingly more important in assisted and automated driving. Using such entities which are obtained using machine learning is inevitable: tasks such as recognizing traffic signs cannot be developed reasonably using traditional software development methods. DNN however do have the problem that they are mostly black boxes and therefore hard to understand and debug. One particular problem is that they are prone to hidden backdoors. This means that the DNN misclassifies its input, because it considers properties that should not be decisive for the output. Backdoors may either be introduced by malicious attackers or by inappropriate training. In any case, detecting and removing them is important in the automotive area, as they might lead to safety violations with potentially severe consequences. In this paper, we introduce a novel method to remove backdoors. Our method works for both intentional as well as unintentional backdoors. We also do not require prior knowledge about the shape or distribution of backdoors. Experimental evidence shows that our method performs well on several medium-sized examples.
    Losses over Labels: Weakly Supervised Learning via Direct Loss Construction. (arXiv:2212.06921v1 [cs.LG])
    Owing to the prohibitive costs of generating large amounts of labeled data, programmatic weak supervision is a growing paradigm within machine learning. In this setting, users design heuristics that provide noisy labels for subsets of the data. These weak labels are combined (typically via a graphical model) to form pseudolabels, which are then used to train a downstream model. In this work, we question a foundational premise of the typical weakly supervised learning pipeline: given that the heuristic provides all ``label" information, why do we need to generate pseudolabels at all? Instead, we propose to directly transform the heuristics themselves into corresponding loss functions that penalize differences between our model and the heuristic. By constructing losses directly from the heuristics, we can incorporate more information than is used in the standard weakly supervised pipeline, such as how the heuristics make their decisions, which explicitly informs feature selection during training. We call our method Losses over Labels (LoL) as it creates losses directly from heuristics without going through the intermediate step of a label. We show that LoL improves upon existing weak supervision methods on several benchmark text and image classification tasks and further demonstrate that incorporating gradient information leads to better performance on almost every task.
    Simplification of Forest Classifiers and Regressors. (arXiv:2212.07103v1 [cs.LG])
    We study the problem of sharing as many branching conditions of a given forest classifier or regressor as possible while keeping classification performance. As a constraint for preventing from accuracy degradation, we first consider the one that the decision paths of all the given feature vectors must not change. For a branching condition that a value of a certain feature is at most a given threshold, the set of values satisfying such constraint can be represented as an interval. Thus, the problem is reduced to the problem of finding the minimum set intersecting all the constraint-satisfying intervals for each set of branching conditions on the same feature. We propose an algorithm for the original problem using an algorithm solving this problem efficiently. The constraint is relaxed later to promote further sharing of branching conditions by allowing decision path change of a certain ratio of the given feature vectors or allowing a certain number of non-intersected constraint-satisfying intervals. We also extended our algorithm for both the relaxations. The effectiveness of our method is demonstrated through comprehensive experiments using 21 datasets (13 classification and 8 regression datasets in UCI machine learning repository) and 4 classifiers/regressors (random forest, extremely randomized trees, AdaBoost and gradient boosting).
    LidarCLIP or: How I Learned to Talk to Point Clouds. (arXiv:2212.06858v1 [cs.CV])
    Research connecting text and images has recently seen several breakthroughs, with models like CLIP, DALL-E 2, and Stable Diffusion. However, the connection between text and other visual modalities, such as lidar data, has received less attention, prohibited by the lack of text-lidar datasets. In this work, we propose LidarCLIP, a mapping from automotive point clouds to a pre-existing CLIP embedding space. Using image-lidar pairs, we supervise a point cloud encoder with the image CLIP embeddings, effectively relating text and lidar data with the image domain as an intermediary. We show the effectiveness of LidarCLIP by demonstrating that lidar-based retrieval is generally on par with image-based retrieval, but with complementary strengths and weaknesses. By combining image and lidar features, we improve upon both single-modality methods and enable a targeted search for challenging detection scenarios under adverse sensor conditions. We also use LidarCLIP as a tool to investigate fundamental lidar capabilities through natural language. Finally, we leverage our compatibility with CLIP to explore a range of applications, such as point cloud captioning and lidar-to-image generation, without any additional training. We hope LidarCLIP can inspire future work to dive deeper into connections between text and point cloud understanding. Code and trained models available at https://github.com/atonderski/lidarclip.
    Efficient Exploration in Resource-Restricted Reinforcement Learning. (arXiv:2212.06988v1 [cs.LG])
    In many real-world applications of reinforcement learning (RL), performing actions requires consuming certain types of resources that are non-replenishable in each episode. Typical applications include robotic control with limited energy and video games with consumable items. In tasks with non-replenishable resources, we observe that popular RL methods such as soft actor critic suffer from poor sample efficiency. The major reason is that, they tend to exhaust resources fast and thus the subsequent exploration is severely restricted due to the absence of resources. To address this challenge, we first formalize the aforementioned problem as a resource-restricted reinforcement learning, and then propose a novel resource-aware exploration bonus (RAEB) to make reasonable usage of resources. An appealing feature of RAEB is that, it can significantly reduce unnecessary resource-consuming trials while effectively encouraging the agent to explore unvisited states. Experiments demonstrate that the proposed RAEB significantly outperforms state-of-the-art exploration strategies in resource-restricted reinforcement learning environments, improving the sample efficiency by up to an order of magnitude.
    Fast and Provably Convergent Algorithms for Gromov-Wasserstein in Graph Data. (arXiv:2205.08115v2 [cs.LG] UPDATED)
    In this paper, we study the design and analysis of a class of efficient algorithms for computing the Gromov-Wasserstein (GW) distance tailored to large-scale graph learning tasks. Armed with the Luo-Tseng error bound condition~\citep{luo1992error}, two proposed algorithms, called Bregman Alternating Projected Gradient (BAPG) and hybrid Bregman Proximal Gradient (hBPG) enjoy the convergence guarantees. Upon task-specific properties, our analysis further provides novel theoretical insights to guide how to select the best-fit method. As a result, we are able to provide comprehensive experiments to validate the effectiveness of our methods on a host of tasks, including graph alignment, graph partition, and shape matching. In terms of both wall-clock time and modeling performance, the proposed methods achieve state-of-the-art results.
    Safety Correction from Baseline: Towards the Risk-aware Policy in Robotics via Dual-agent Reinforcement Learning. (arXiv:2212.06998v1 [cs.LG])
    Learning a risk-aware policy is essential but rather challenging in unstructured robotic tasks. Safe reinforcement learning methods open up new possibilities to tackle this problem. However, the conservative policy updates make it intractable to achieve sufficient exploration and desirable performance in complex, sample-expensive environments. In this paper, we propose a dual-agent safe reinforcement learning strategy consisting of a baseline and a safe agent. Such a decoupled framework enables high flexibility, data efficiency and risk-awareness for RL-based control. Concretely, the baseline agent is responsible for maximizing rewards under standard RL settings. Thus, it is compatible with off-the-shelf training techniques of unconstrained optimization, exploration and exploitation. On the other hand, the safe agent mimics the baseline agent for policy improvement and learns to fulfill safety constraints via off-policy RL tuning. In contrast to training from scratch, safe policy correction requires significantly fewer interactions to obtain a near-optimal policy. The dual policies can be optimized synchronously via a shared replay buffer, or leveraging the pre-trained model or the non-learning-based controller as a fixed baseline agent. Experimental results show that our approach can learn feasible skills without prior knowledge as well as deriving risk-averse counterparts from pre-trained unsafe policies. The proposed method outperforms the state-of-the-art safe RL algorithms on difficult robot locomotion and manipulation tasks with respect to both safety constraint satisfaction and sample efficiency.  ( 2 min )
    Generating Synthetic Mobility Networks with Generative Adversarial Networks. (arXiv:2202.11028v2 [cs.LG] UPDATED)
    The increasingly crucial role of human displacements in complex societal phenomena, such as traffic congestion, segregation, and the diffusion of epidemics, is attracting the interest of scientists from several disciplines. In this article, we address mobility network generation, i.e., generating a city's entire mobility network, a weighted directed graph in which nodes are geographic locations and weighted edges represent people's movements between those locations, thus describing the entire mobility set flows within a city. Our solution is MoGAN, a model based on Generative Adversarial Networks (GANs) to generate realistic mobility networks. We conduct extensive experiments on public datasets of bike and taxi rides to show that MoGAN outperforms the classical Gravity and Radiation models regarding the realism of the generated networks. Our model can be used for data augmentation and performing simulations and what-if analysis.
    Counterfactual Learning of Stochastic Policies with Continuous Actions: from Models to Offline Evaluation. (arXiv:2004.11722v6 [stat.ML] UPDATED)
    Counterfactual reasoning from logged data has become increasingly important for many applications such as web advertising or healthcare. In this paper, we address the problem of learning stochastic policies with continuous actions from the viewpoint of counterfactual risk minimization (CRM). While the CRM framework is appealing and well studied for discrete actions, the continuous action case raises new challenges about modelization, optimization, and~offline model selection with real data which turns out to be particularly challenging. Our paper contributes to these three aspects of the CRM estimation pipeline. First, we introduce a modelling strategy based on a joint kernel embedding of contexts and actions, which overcomes the shortcomings of previous discretization approaches. Second, we empirically show that the optimization aspect of counterfactual learning is important, and we demonstrate the benefits of proximal point algorithms and differentiable estimators. Finally, we propose an evaluation protocol for offline policies in real-world logged systems, which is challenging since policies cannot be replayed on test data, and we release a new large-scale dataset along with multiple synthetic, yet realistic, evaluation setups.
    Covariance-Generalized Matching Component Analysis for Data Fusion and Transfer Learning. (arXiv:2110.13194v3 [cs.LG] UPDATED)
    In order to encode additional statistical information in data fusion and transfer learning applications, we introduce a generalized covariance constraint for the matching component analysis (MCA) transfer learning technique. We provide a closed-form solution to the resulting covariance-generalized optimization problem and an algorithm for its computation. We call the resulting technique -- applicable to both data fusion and transfer learning -- covariance-generalized MCA (CGMCA). We also demonstrate via numerical experiments that CGMCA is capable of meaningfully encoding into its maps more information than MCA.
    Meta-Generalization for Multiparty Privacy Learning to Identify Anomaly Multimedia Traffic in Graynet. (arXiv:2201.03027v2 [cs.CR] UPDATED)
    Identifying anomaly multimedia traffic in cyberspace is a big challenge in distributed service systems, multiple generation networks and future internet of everything. This letter explores meta-generalization for a multiparty privacy learning model in graynet to improve the performance of anomaly multimedia traffic identification. The multiparty privacy learning model in graynet is a globally shared model that is partitioned, distributed and trained by exchanging multiparty parameters updates with preserving private data. The meta-generalization refers to discovering the inherent attributes of a learning model to reduce its generalization error. In experiments, three meta-generalization principles are tested as follows. The generalization error of the multiparty privacy learning model in graynet is reduced by changing the dimension of byte-level imbedding. Following that, the error is reduced by adapting the depth for extracting packet-level features. Finally, the error is reduced by adjusting the size of support set for preprocessing traffic-level data. Experimental results demonstrate that the proposal outperforms the state-of-the-art learning models for identifying anomaly multimedia traffic.  ( 2 min )
    Decision-making at Unsignalized Intersection for Autonomous Vehicles: Left-turn Maneuver with Deep Reinforcement Learning. (arXiv:2008.06595v3 [cs.AI] UPDATED)
    Decision-making module enables autonomous vehicles to reach appropriate maneuvers in the complex urban environments, especially the intersection situations. This work proposes a deep reinforcement learning (DRL) based left-turn decision-making framework at unsignalized intersection for autonomous vehicles. The objective of the studied automated vehicle is to make an efficient and safe left-turn maneuver at a four-way unsignalized intersection. The exploited DRL methods include deep Q-learning (DQL) and double DQL. Simulation results indicate that the presented decision-making strategy could efficaciously reduce the collision rate and improve transport efficiency. This work also reveals that the constructed left-turn control structure has a great potential to be applied in real-time.
    Two Measures of Non-Probabilistic Uncertainty. (arXiv:2201.05818v2 [cs.AI] UPDATED)
    There are two reasons why uncertainty about the future yield of investments may not be adequately described by Probability Theory. The first one is due to unique or nearly-unique events, that either never realized or occurred too seldom for probabilities to be reliable. The second one arises when when one fears that something may happen, that one is not even able to figure out, e.g., if one asks: "Climate change, financial crises, pandemic, war, what next?" In both cases, simple one-to-one causal mappings between available alternatives and possible consequences eventually melt down. However, such destructions reflect into the changing narratives of business executives, employees and other stakeholders in specific, identifiable and differential ways. In particular, texts such as consultants' reports or letters to shareholders can be analysed in order to detect the impact of both sorts of uncertainty onto the causal relations that normally guide decision-making. We propose structural measures of causal mappings as a means to measure non-probabilistic uncertainty, eventually suggesting that automated text analysis can greatly augment the possibilities offered by these techniques. Prospective applications may concern statistical institutes, stock market traders, as well as businesses wishing to compare their own vision to those prevailing in their industry.  ( 2 min )
    Towards mapping the contemporary art world with ArtLM: an art-specific NLP model. (arXiv:2212.07127v1 [cs.CL])
    With an increasing amount of data in the art world, discovering artists and artworks suitable to collectors' tastes becomes a challenge. It is no longer enough to use visual information, as contextual information about the artist has become just as important in contemporary art. In this work, we present a generic Natural Language Processing framework (called ArtLM) to discover the connections among contemporary artists based on their biographies. In this approach, we first continue to pre-train the existing general English language models with a large amount of unlabelled art-related data. We then fine-tune this new pre-trained model with our biography pair dataset manually annotated by a team of professionals in the art industry. With extensive experiments, we demonstrate that our ArtLM achieves 85.6% accuracy and 84.0% F1 score and outperforms other baseline models. We also provide a visualisation and a qualitative analysis of the artist network built from ArtLM's outputs.  ( 2 min )
    Grammar Based Speaker Role Identification for Air Traffic Control Speech Recognition. (arXiv:2108.12175v2 [cs.CL] UPDATED)
    Automatic Speech Recognition (ASR) for air traffic control is generally trained by pooling Air Traffic Controller (ATCO) and pilot data into one set. This is motivated by the fact that pilot's voice communications are more scarce than ATCOs. Due to this data imbalance and other reasons (e.g., varying acoustic conditions), the speech from ATCOs is usually recognized more accurately than from pilots. Automatically identifying the speaker roles is a challenging task, especially in the case of the noisy voice recordings collected using Very High Frequency (VHF) receivers or due to the unavailability of the push-to-talk (PTT) signal, i.e., both audio channels are mixed. In this work, we propose to (1) automatically segment the ATCO and pilot data based on an intuitive approach exploiting ASR transcripts and (2) subsequently consider an automatic recognition of ATCOs' and pilots' voice as two separate tasks. Our work is performed on VHF audio data with high noise levels, i.e., signal-to-noise (SNR) ratios below 15 dB, as this data is recognized to be helpful for various speech-based machine-learning tasks. Specifically, for the speaker role identification task, the module is represented by a simple yet efficient knowledge-based system exploiting a grammar defined by the International Civil Aviation Organization (ICAO). The system accepts text as the input, either manually verified annotations or automatically generated transcripts. The developed approach provides an average accuracy in speaker role identification of about 83%. Finally, we show that training an acoustic model for ASR tasks separately (i.e., separate models for ATCOs and pilots) or using a multitask approach is well suited for the noisy data and outperforms the traditional ASR system where all data is pooled together.  ( 3 min )
    Morphological Network: How Far Can We Go with Morphological Neurons?. (arXiv:1901.00109v4 [cs.LG] UPDATED)
    Morphological neurons, that is morphological operators such as dilation and erosion with learnable structuring elements, have intrigued researchers for quite some time because of the power these operators bring to the table despite their simplicity. These operators are known to be powerful nonlinear tools, but for a given problem coming up with a sequence of operations and their structuring element is a non-trivial task. So, the existing works have mainly focused on this part of the problem without delving deep into their applicability as generic operators. A few works have tried to utilize morphological neurons as a part of classification (and regression) networks when the input is a feature vector. However, these methods mainly focus on a specific problem, without going into generic theoretical analysis. In this work, we have theoretically analyzed morphological neurons and have shown that these are far more powerful than previously anticipated. Our proposed morphological block, containing dilation and erosion followed by their linear combination, represents a sum of hinge functions. Existing works show that hinge functions perform quite well in classification and regression problems. Two morphological blocks can even approximate any continuous function. However, to facilitate the theoretical analysis that we have done in this paper, we have restricted ourselves to the 1D version of the operators, where the structuring element operates on the whole input. Experimental evaluations also indicate the effectiveness of networks built with morphological neurons, over similarly structured neural networks.  ( 3 min )
    ZippyPoint: Fast Interest Point Detection, Description, and Matching through Mixed Precision Discretization. (arXiv:2203.03610v2 [cs.CV] UPDATED)
    Efficient detection and description of geometric regions in images is a prerequisite in visual systems for localization and mapping. Such systems still rely on traditional hand-crafted methods for efficient generation of lightweight descriptors, a common limitation of the more powerful neural network models that come with high compute and specific hardware requirements. In this paper, we focus on the adaptations required by detection and description neural networks to enable their use in computationally limited platforms such as robots, mobile, and augmented reality devices. To that end, we investigate and adapt network quantization techniques to accelerate inference and enable its use on compute limited platforms. In addition, we revisit common practices in descriptor quantization and propose the use of a binary descriptor normalization layer, enabling the generation of distinctive binary descriptors with a constant number of ones. ZippyPoint, our efficient quantized network with binary descriptors, improves the network runtime speed, the descriptor matching speed, and the 3D model size, by at least an order of magnitude when compared to full-precision counterparts. These improvements come at a minor performance degradation as evaluated on the tasks of homography estimation, visual localization, and map-free visual relocalization. Code and trained models will be released upon acceptance.  ( 2 min )
    ShadowNet: A Secure and Efficient On-device Model Inference System for Convolutional Neural Networks. (arXiv:2011.05905v3 [cs.CR] UPDATED)
    With the increased usage of AI accelerators on mobile and edge devices, on-device machine learning (ML) is gaining popularity. Thousands of proprietary ML models are being deployed today on billions of untrusted devices. This raises serious security concerns about model privacy. However, protecting model privacy without losing access to the untrusted AI accelerators is a challenging problem. In this paper, we present a novel on-device model inference system, ShadowNet. ShadowNet protects the model privacy with Trusted Execution Environment (TEE) while securely outsourcing the heavy linear layers of the model to the untrusted hardware accelerators. ShadowNet achieves this by transforming the weights of the linear layers before outsourcing them and restoring the results inside the TEE. The non-linear layers are also kept secure inside the TEE. ShadowNet's design ensures efficient transformation of the weights and the subsequent restoration of the results. We build a ShadowNet prototype based on TensorFlow Lite and evaluate it on five popular CNNs, namely, MobileNet, ResNet-44, MiniVGG, ResNet-404, and YOLOv4-tiny. Our evaluation shows that ShadowNet achieves strong security guarantees with reasonable performance, offering a practical solution for secure on-device model inference.  ( 2 min )
    Hierarchical Strategies for Cooperative Multi-Agent Reinforcement Learning. (arXiv:2212.07397v1 [cs.LG])
    Adequate strategizing of agents behaviors is essential to solving cooperative MARL problems. One intuitively beneficial yet uncommon method in this domain is predicting agents future behaviors and planning accordingly. Leveraging this point, we propose a two-level hierarchical architecture that combines a novel information-theoretic objective with a trajectory prediction model to learn a strategy. To this end, we introduce a latent policy that learns two types of latent strategies: individual $z_A$, and relational $z_R$ using a modified Graph Attention Network module to extract interaction features. We encourage each agent to behave according to the strategy by conditioning its local $Q$ functions on $z_A$, and we further equip agents with a shared $Q$ function that conditions on $z_R$. Additionally, we introduce two regularizers to allow predicted trajectories to be accurate and rewarding. Empirical results on Google Research Football (GRF) and StarCraft (SC) II micromanagement tasks show that our method establishes a new state of the art being, to the best of our knowledge, the first MARL algorithm to solve all super hard SC II scenarios as well as the GRF full game with a win rate higher than $95\%$, thus outperforming all existing methods. Videos and brief overview of the methods and results are available at: https://sites.google.com/view/hier-strats-marl/home.  ( 2 min )
    CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation. (arXiv:2203.10421v2 [cs.CV] UPDATED)
    For robots to be generally useful, they must be able to find arbitrary objects described by people (i.e., be language-driven) even without expensive navigation training on in-domain data (i.e., perform zero-shot inference). We explore these capabilities in a unified setting: language-driven zero-shot object navigation (L-ZSON). Inspired by the recent success of open-vocabulary models for image classification, we investigate a straightforward framework, CLIP on Wheels (CoW), to adapt open-vocabulary models to this task without fine-tuning. To better evaluate L-ZSON, we introduce the Pasture benchmark, which considers finding uncommon objects, objects described by spatial and appearance attributes, and hidden objects described relative to visible objects. We conduct an in-depth empirical study by directly deploying 21 CoW baselines across Habitat, RoboTHOR, and Pasture. In total, we evaluate over 90k navigation episodes and find that (1) CoW baselines often struggle to leverage language descriptions, but are proficient at finding uncommon objects. (2) A simple CoW, with CLIP-based object localization and classical exploration -- and no additional training -- matches the navigation efficiency of a state-of-the-art ZSON method trained for 500M steps on Habitat MP3D data. This same CoW provides a 15.6 percentage point improvement in success over a state-of-the-art RoboTHOR ZSON model.  ( 2 min )
    The Devil is in the GAN: Backdoor Attacks and Defenses in Deep Generative Models. (arXiv:2108.01644v2 [cs.CR] UPDATED)
    Deep Generative Models (DGMs) are a popular class of deep learning models which find widespread use because of their ability to synthesize data from complex, high-dimensional manifolds. However, even with their increasing industrial adoption, they haven't been subject to rigorous security and privacy analysis. In this work we examine one such aspect, namely backdoor attacks on DGMs which can significantly limit the applicability of pre-trained models within a model supply chain and at the very least cause massive reputation damage for companies outsourcing DGMs form third parties. While similar attacks scenarios have been studied in the context of classical prediction models, their manifestation in DGMs hasn't received the same attention. To this end we propose novel training-time attacks which result in corrupted DGMs that synthesize regular data under normal operations and designated target outputs for inputs sampled from a trigger distribution. These attacks are based on an adversarial loss function that combines the dual objectives of attack stealth and fidelity. We systematically analyze these attacks, and show their effectiveness for a variety of approaches like Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), as well as different data domains including images and audio. Our experiments show that - even for large-scale industry-grade DGMs (like StyleGAN) - our attacks can be mounted with only modest computational effort. We also motivate suitable defenses based on static/dynamic model and output inspections, demonstrate their usefulness, and prescribe a practical and comprehensive defense strategy that paves the way for safe usage of DGMs.  ( 2 min )
    Algorithmic Insurance. (arXiv:2106.00839v2 [cs.LG] UPDATED)
    As machine learning algorithms start to get integrated into the decision-making process of companies and organizations, insurance products are being developed to protect their owners from liability risk. Algorithmic liability differs from human liability since it is based on a single model compared to multiple heterogeneous decision-makers and its performance is known a priori for a given set of data. Traditional actuarial tools for human liability do not take these properties into consideration, primarily focusing on the distribution of historical claims. We propose, for the first time, a quantitative framework to estimate the risk exposure of insurance contracts for machine-driven liability, introducing the concept of algorithmic insurance. Specifically, we present an optimization formulation to estimate the risk exposure of a binary classification model given a pre-defined range of premiums. We adjust the formulation to account for uncertainty in the resulting losses using robust optimization. Our approach outlines how properties of the model, such as accuracy, interpretability, and generalizability, can influence the insurance contract evaluation. To showcase a practical implementation of the proposed framework, we present a case study of medical malpractice in the context of breast cancer detection. Our analysis focuses on measuring the effect of the model parameters on the expected financial loss and identifying the aspects of algorithmic performance that predominantly affect the risk of the contract.  ( 2 min )
    Learning particle swarming models from data with Gaussian processes. (arXiv:2106.02735v2 [stat.ML] UPDATED)
    Interacting particle or agent systems that display a rich variety of swarming behaviours are ubiquitous in science and engineering. A fundamental and challenging goal is to understand the link between individual interaction rules and swarming. In this paper, we study the data-driven discovery of a second-order particle swarming model that describes the evolution of $N$ particles in $\mathbb{R}^d$ under radial interactions. We propose a learning approach that models the latent radial interaction function as Gaussian processes, which can simultaneously fulfill two inference goals: one is the nonparametric inference of {the} interaction function with pointwise uncertainty quantification, and the other one is the inference of unknown scalar parameters in the non-collective friction forces of the system. We formulate the learning problem as a statistical inverse problem and provide a detailed analysis of recoverability conditions, establishing that a coercivity condition is sufficient for recoverability. Given data collected from $M$ i.i.d trajectories with independent Gaussian observational noise, we provide a finite-sample analysis, showing that our posterior mean estimator converges in a Reproducing kernel Hilbert space norm, at an optimal rate in $M$ equal to the one in the classical 1-dimensional Kernel Ridge regression. As a byproduct, we show we can obtain a parametric learning rate in $M$ for the posterior marginal variance using $L^{\infty}$ norm, and the rate could also involve $N$ and $L$ (the number of observation time instances for each trajectory), depending on the condition number of the inverse problem. Numerical results on systems that exhibit different swarming behaviors demonstrate efficient learning of our approach from scarce noisy trajectory data.  ( 2 min )
    Towards improving discriminative reconstruction via simultaneous dense and sparse coding. (arXiv:2006.09534v3 [cs.IT] UPDATED)
    Discriminative features extracted from the sparse coding model have been shown to perform well for classification. Recent deep learning architectures have further improved reconstruction in inverse problems by considering new dense priors learned from data. We propose a novel dense and sparse coding model that integrates both representation capability and discriminative features. The model studies the problem of recovering a dense vector $\mathbf{x}$ and a sparse vector $\mathbf{u}$ given measurements of the form $\mathbf{y} = \mathbf{A}\mathbf{x}+\mathbf{B}\mathbf{u}$. Our first analysis proposes a geometric condition based on the minimal angle between spanning subspaces corresponding to the matrices $\mathbf{A}$ and $\mathbf{B}$ that guarantees unique solution to the model. The second analysis shows that, under mild assumptions, a convex program recovers the dense and sparse components. We validate the effectiveness of the model on simulated data and propose a dense and sparse autoencoder (DenSaE) tailored to learning the dictionaries from the dense and sparse model. We demonstrate that (i) DenSaE denoises natural images better than architectures derived from the sparse coding model ($\mathbf{B}\mathbf{u}$), (ii) in the presence of noise, training the biases in the latter amounts to implicitly learning the $\mathbf{A}\mathbf{x} + \mathbf{B}\mathbf{u}$ model, (iii) $\mathbf{A}$ and $\mathbf{B}$ capture low- and high-frequency contents, respectively, and (iv) compared to the sparse coding model, DenSaE offers a balance between discriminative power and representation.  ( 2 min )
    Cross-Domain Transfer via Semantic Skill Imitation. (arXiv:2212.07407v1 [cs.LG])
    We propose an approach for semantic imitation, which uses demonstrations from a source domain, e.g. human videos, to accelerate reinforcement learning (RL) in a different target domain, e.g. a robotic manipulator in a simulated kitchen. Instead of imitating low-level actions like joint velocities, our approach imitates the sequence of demonstrated semantic skills like "opening the microwave" or "turning on the stove". This allows us to transfer demonstrations across environments (e.g. real-world to simulated kitchen) and agent embodiments (e.g. bimanual human demonstration to robotic arm). We evaluate on three challenging cross-domain learning problems and match the performance of demonstration-accelerated RL approaches that require in-domain demonstrations. In a simulated kitchen environment, our approach learns long-horizon robot manipulation tasks, using less than 3 minutes of human video demonstrations from a real-world kitchen. This enables scaling robot learning via the reuse of demonstrations, e.g. collected as human videos, for learning in any number of target domains.  ( 2 min )
    Bayesian data fusion with shared priors. (arXiv:2212.07311v1 [cs.LG])
    The integration of data and knowledge from several sources is known as data fusion. When data is available in a distributed fashion or when different sensors are used to infer a quantity of interest, data fusion becomes essential. In Bayesian settings, a priori information of the unknown quantities is available and, possibly, shared among the distributed estimators. When the local estimates are fused, such prior might be overused unless it is accounted for. This paper explores the effects of shared priors in Bayesian data fusion contexts, providing fusion rules and analysis to understand the performance of such fusion as a function of the number of collaborative agents and the uncertainty of the priors. Analytical results are corroborated through experiments in a variety of estimation and classification problems.  ( 2 min )
    Quantifying Statistical Significance of Neural Network-based Image Segmentation by Selective Inference. (arXiv:2010.01823v3 [stat.ML] UPDATED)
    Although a vast body of literature relates to image segmentation methods that use deep neural networks (DNNs), less attention has been paid to assessing the statistical reliability of segmentation results. In this study, we interpret the segmentation results as hypotheses driven by DNN (called DNN-driven hypotheses) and propose a method by which to quantify the reliability of these hypotheses within a statistical hypothesis testing framework. Specifically, we consider a statistical hypothesis test for the difference between the object and background regions. This problem is challenging, as the difference would be falsely large because of the adaptation of the DNN to the data. To overcome this difficulty, we introduce a conditional selective inference (SI) framework -- a new statistical inference framework for data-driven hypotheses that has recently received considerable attention -- to compute exact (non-asymptotic) valid p-values for the segmentation results. To use the conditional SI framework for DNN-based segmentation, we develop a new SI algorithm based on the homotopy method, which enables us to derive the exact (non-asymptotic) sampling distribution of DNN-driven hypothesis. We conduct experiments on both synthetic and real-world datasets, through which we offer evidence that our proposed method can successfully control the false positive rate, has good performance in terms of computational efficiency, and provides good results when applied to medical image data.  ( 2 min )
    Sequential Kernelized Independence Testing. (arXiv:2212.07383v1 [stat.ML])
    Independence testing is a fundamental and classical statistical problem that has been extensively studied in the batch setting when one fixes the sample size before collecting data. However, practitioners often prefer procedures that adapt to the complexity of a problem at hand instead of setting sample size in advance. Ideally, such procedures should (a) allow stopping earlier on easy tasks (and later on harder tasks), hence making better use of available resources, and (b) continuously monitor the data and efficiently incorporate statistical evidence after collecting new data, while controlling the false alarm rate. It is well known that classical batch tests are not tailored for streaming data settings, since valid inference after data peeking requires correcting for multiple testing, but such corrections generally result in low power. In this paper, we design sequential kernelized independence tests (SKITs) that overcome such shortcomings based on the principle of testing by betting. We exemplify our broad framework using bets inspired by kernelized dependence measures such as the Hilbert-Schmidt independence criterion (HSIC) and the constrained-covariance criterion (COCO). Importantly, we also generalize the framework to non-i.i.d. time-varying settings, for which there exist no batch tests. We demonstrate the power of our approaches on both simulated and real data.  ( 2 min )
    Directional Direct Feedback Alignment: Estimating Backpropagation Paths for Efficient Learning on Neural Processors. (arXiv:2212.07282v1 [cs.LG])
    The error Backpropagation algorithm (BP) is a key method for training deep neural networks. While performant, it is also resource-demanding in terms of computation, memory usage and energy. This makes it unsuitable for online learning on edge devices that require a high processing rate and low energy consumption. More importantly, BP does not take advantage of the parallelism and local characteristics offered by dedicated neural processors. There is therefore a demand for alternative algorithms to BP that could improve the latency, memory requirements, and energy footprint of neural networks on hardware. In this work, we propose a novel method based on Direct Feedback Alignment (DFA) which uses Forward-Mode Automatic Differentiation to estimate backpropagation paths and learn feedback connections in an online manner. We experimentally show that Directional DFA achieves performances that are closer to BP than other feedback methods on several benchmark datasets and architectures while benefiting from the locality and parallelization characteristics of DFA. Moreover, we show that, unlike other feedback learning algorithms, our method provides stable learning for convolution layers.  ( 2 min )
    Self-Play and Self-Describe: Policy Adaptation with Vision-Language Foundation Models. (arXiv:2212.07398v1 [cs.LG])
    Recent progress on vision-language foundation models have brought significant advancement to building general-purpose robots. By using the pre-trained models to encode the scene and instructions as inputs for decision making, the instruction-conditioned policy can generalize across different objects and tasks. While this is encouraging, the policy still fails in most cases given an unseen task or environment. To adapt the policy to unseen tasks and environments, we explore a new paradigm on leveraging the pre-trained foundation models with Self-PLAY and Self-Describe (SPLAYD). When deploying the trained policy to a new task or a new environment, we first let the policy self-play with randomly generated instructions to record the demonstrations. While the execution could be wrong, we can use the pre-trained foundation models to accurately self-describe (i.e., re-label or classify) the demonstrations. This automatically provides new pairs of demonstration-instruction data for policy fine-tuning. We evaluate our method on a broad range of experiments with the focus on generalization on unseen objects, unseen tasks, unseen environments, and sim-to-real transfer. We show SPLAYD improves baselines by a large margin in all cases. Our project page is available at https://geyuying.github.io/SPLAYD/  ( 2 min )
    Learning Invariant Subspaces of Koopman Operators--Part 2: Heterogeneous Dictionary Mixing to Approximate Subspace Invariance. (arXiv:2212.07365v1 [eess.SY])
    This work builds on the models and concepts presented in part 1 to learn approximate dictionary representations of Koopman operators from data. Part I of this paper presented a methodology for arguing the subspace invariance of a Koopman dictionary. This methodology was demonstrated on the state-inclusive logistic lifting (SILL) basis. This is an affine basis augmented with conjunctive logistic functions. The SILL dictionary's nonlinear functions are homogeneous, a norm in data-driven dictionary learning of Koopman operators. In this paper, we discover that structured mixing of heterogeneous dictionary functions drawn from different classes of nonlinear functions achieve the same accuracy and dimensional scaling as the deep-learning-based deepDMD algorithm. We specifically show this by building a heterogeneous dictionary comprised of SILL functions and conjunctive radial basis functions (RBFs). This mixed dictionary achieves the same accuracy and dimensional scaling as deepDMD with an order of magnitude reduction in parameters, while maintaining geometric interpretability. These results strengthen the viability of dictionary-based Koopman models to solving high-dimensional nonlinear learning problems.
    Lorentz Group Equivariant Autoencoders. (arXiv:2212.07347v1 [hep-ex])
    There has been significant work recently in developing machine learning models in high energy physics (HEP), for tasks such as classification, simulation, and anomaly detection. Typically, these models are adapted from those designed for datasets in computer vision or natural language processing without necessarily incorporating inductive biases suited to HEP data, such as respecting its inherent symmetries. Such inductive biases can make the model more performant and interpretable, and reduce the amount of training data needed. To that end, we develop the Lorentz group autoencoder (LGAE), an autoencoder model equivariant with respect to the proper, orthochronous Lorentz group $\mathrm{SO}^+(3,1)$, with a latent space living in the representations of the group. We present our architecture and several experimental results on jets at the LHC and find it significantly outperforms a non-Lorentz-equivariant graph neural network baseline on compression and reconstruction, and anomaly detection. We also demonstrate the advantage of such an equivariant model in analyzing the latent space of the autoencoder, which can have a significant impact on the explainability of anomalies found by such black-box machine learning models.
    On the Probability of Necessity and Sufficiency of Explaining Graph Neural Networks: A Lower Bound Optimization Approach. (arXiv:2212.07056v1 [cs.LG])
    Explainability of Graph Neural Networks (GNNs) is critical to various GNN applications but remains an open challenge. A convincing explanation should be both necessary and sufficient simultaneously. However, existing GNN explaining approaches focus on only one of the two aspects, necessity or sufficiency, or a trade-off between the two. To search for the most necessary and sufficient explanation, the Probability of Necessity and Sufficiency (PNS) can be applied since it can mathematically quantify the necessity and sufficiency of an explanation. Nevertheless, the difficulty of obtaining PNS due to non-monotonicity and the challenge of counterfactual estimation limits its wide use. To address the non-identifiability of PNS, we resort to a lower bound of PNS that can be optimized via counterfactual estimation, and propose Necessary and Sufficient Explanation for GNN (NSEG) via optimizing that lower bound. Specifically, we employ nearest neighbor matching to generate counterfactual samples for the features, which is different from the random perturbation. In particular, NSEG combines the edges and node features to generate an explanation, where the common edge explanation is a special case of the combined explanation. Empirical study shows that NSEG achieves excellent performance in generating the most necessary and sufficient explanations among a series of state-of-the-art methods.
    FLAGS Framework for Comparative Analysis of Federated Learning Algorithms. (arXiv:2212.07179v1 [cs.LG])
    Federated Learning (FL) has become a key choice for distributed machine learning. Initially focused on centralized aggregation, recent works in FL have emphasized greater decentralization to adapt to the highly heterogeneous network edge. Among these, Hierarchical, Device-to-Device and Gossip Federated Learning (HFL, D2DFL \& GFL respectively) can be considered as foundational FL algorithms employing fundamental aggregation strategies. A number of FL algorithms were subsequently proposed employing multiple fundamental aggregation schemes jointly. Existing research, however, subjects the FL algorithms to varied conditions and gauges the performance of these algorithms mainly against Federated Averaging (FedAvg) only. This work consolidates the FL landscape and offers an objective analysis of the major FL algorithms through a comprehensive cross-evaluation for a wide range of operating conditions. In addition to the three foundational FL algorithms, this work also analyzes six derived algorithms. To enable a uniform assessment, a multi-FL framework named FLAGS: Federated Learning AlGorithms Simulation has been developed for rapid configuration of multiple FL algorithms. Our experiments indicate that fully decentralized FL algorithms achieve comparable accuracy under multiple operating conditions, including asynchronous aggregation and the presence of stragglers. Furthermore, decentralized FL can also operate in noisy environments and with a comparably higher local update rate. However, the impact of extremely skewed data distributions on decentralized FL is much more adverse than on centralized variants. The results indicate that it may not be necessary to restrict the devices to a single FL algorithm; rather, multi-FL nodes may operate with greater efficiency.
    FedSkip: Combatting Statistical Heterogeneity with Federated Skip Aggregation. (arXiv:2212.07224v1 [cs.LG])
    The statistical heterogeneity of the non-independent and identically distributed (non-IID) data in local clients significantly limits the performance of federated learning. Previous attempts like FedProx, SCAFFOLD, MOON, FedNova and FedDyn resort to an optimization perspective, which requires an auxiliary term or re-weights local updates to calibrate the learning bias or the objective inconsistency. However, in addition to previous explorations for improvement in federated averaging, our analysis shows that another critical bottleneck is the poorer optima of client models in more heterogeneous conditions. We thus introduce a data-driven approach called FedSkip to improve the client optima by periodically skipping federated averaging and scattering local models to the cross devices. We provide theoretical analysis of the possible benefit from FedSkip and conduct extensive experiments on a range of datasets to demonstrate that FedSkip achieves much higher accuracy, better aggregation efficiency and competing communication efficiency. Source code is available at: https://github.com/MediaBrain-SJTU/FedSkip.
    An Exploratory Study of AI System Risk Assessment from the Lens of Data Distribution and Uncertainty. (arXiv:2212.06828v1 [cs.LG])
    Deep learning (DL) has become a driving force and has been widely adopted in many domains and applications with competitive performance. In practice, to solve the nontrivial and complicated tasks in real-world applications, DL is often not used standalone, but instead contributes as a piece of gadget of a larger complex AI system. Although there comes a fast increasing trend to study the quality issues of deep neural networks (DNNs) at the model level, few studies have been performed to investigate the quality of DNNs at both the unit level and the potential impacts on the system level. More importantly, it also lacks systematic investigation on how to perform the risk assessment for AI systems from unit level to system level. To bridge this gap, this paper initiates an early exploratory study of AI system risk assessment from both the data distribution and uncertainty angles to address these issues. We propose a general framework with an exploratory study for analyzing AI systems. After large-scale (700+ experimental configurations and 5000+ GPU hours) experiments and in-depth investigations, we reached a few key interesting findings that highlight the practical need and opportunities for more in-depth investigations into AI systems.
    Bridging Graph Position Encodings for Transformers with Weighted Graph-Walking Automata. (arXiv:2212.06898v1 [cs.LG])
    A current goal in the graph neural network literature is to enable transformers to operate on graph-structured data, given their success on language and vision tasks. Since the transformer's original sinusoidal positional encodings (PEs) are not applicable to graphs, recent work has focused on developing graph PEs, rooted in spectral graph theory or various spatial features of a graph. In this work, we introduce a new graph PE, Graph Automaton PE (GAPE), based on weighted graph-walking automata (a novel extension of graph-walking automata). We compare the performance of GAPE with other PE schemes on both machine translation and graph-structured tasks, and we show that it generalizes several other PEs. An additional contribution of this study is a theoretical and controlled experimental comparison of many recent PEs in graph transformers, independent of the use of edge features.
    Enabling the Wireless Metaverse via Semantic Multiverse Communication. (arXiv:2212.06908v1 [cs.NI])
    Metaverse over wireless networks is an emerging use case of the sixth generation (6G) wireless systems, posing unprecedented challenges in terms of its multi-modal data transmissions with stringent latency and reliability requirements. Towards enabling this wireless metaverse, in this article we propose a novel semantic communication (SC) framework by decomposing the metaverse into human/machine agent-specific semantic multiverses (SMs). An SM stored at each agent comprises a semantic encoder and a generator, leveraging recent advances in generative artificial intelligence (AI). To improve communication efficiency, the encoder learns the semantic representations (SRs) of multi-modal data, while the generator learns how to manipulate them for locally rendering scenes and interactions in the metaverse. Since these learned SMs are biased towards local environments, their success hinges on synchronizing heterogeneous SMs in the background while communicating SRs in the foreground, turning the wireless metaverse problem into the problem of semantic multiverse communication (SMC). Based on this SMC architecture, we propose several promising algorithmic and analytic tools for modeling and designing SMC, ranging from distributed learning and multi-agent reinforcement learning (MARL) to signaling games and symbolic AI.
    Deep Neural Networks integrating genomics and histopathological images for predicting stages and survival time-to-event in colon cancer. (arXiv:2212.06834v1 [q-bio.QM])
    There exists unexplained diverse variation within the predefined colon cancer stages using only features either from genomics or histopathological whole slide images as prognostic factors. Unraveling this variation will bring about improved in staging and treatment outcome, hence motivated by the advancement of Deep Neural Network libraries and different structures and factors within some genomic dataset, we aggregate atypical patterns in histopathological images with diverse carcinogenic expression from mRNA, miRNA and DNA Methylation as an integrative input source into an ensemble deep neural network for colon cancer stages classification and samples stratification into low or high risk survival groups. The results of our Ensemble Deep Convolutional Neural Network model show an improved performance in stages classification on the integrated dataset. The fused input features return Area under curve Receiver Operating Characteristic curve (AUC ROC) of 0.95 compared with AUC ROC of 0.71 and 0.68 obtained when only genomics and images features are used for the stage's classification, respectively. Also, the extracted features were used to split the patients into low or high risk survival groups. Among the 2548 fused features, 1695 features showed a statistically significant survival probability differences between the two risk groups defined by the extracted features.
    Approximating Optimal Estimation of Time Offset Synchronization with Temperature Variations. (arXiv:2212.07138v1 [cs.NI])
    The paper addresses the problem of time offset synchronization in the presence of temperature variations, which lead to a non-Gaussian environment. In this context, regular Kalman filtering reveals to be suboptimal. A functional optimization approach is developed in order to approximate optimal estimation of the clock offset between master and slave. A numerical approximation is provided to this aim, based on regular neural network training. Other heuristics are provided as well, based on spline regression. An extensive performance evaluation highlights the benefits of the proposed techniques, which can be easily generalized to several clock synchronization protocols and operating environments.
  • Open

    Amortized Inference for Causal Structure Learning. (arXiv:2205.12934v3 [cs.LG] UPDATED)
    Inferring causal structure poses a combinatorial search problem that typically involves evaluating structures with a score or independence test. The resulting search is costly, and designing suitable scores or tests that capture prior knowledge is difficult. In this work, we propose to amortize causal structure learning. Rather than searching over structures, we train a variational inference model to directly predict the causal structure from observational or interventional data. This allows our inference model to acquire domain-specific inductive biases for causal discovery solely from data generated by a simulator, bypassing both the hand-engineering of suitable score functions and the search over graphs. The architecture of our inference model emulates permutation invariances that are crucial for statistical efficiency in structure learning, which facilitates generalization to significantly larger problem instances than seen during training. On synthetic data and semisynthetic gene expression data, our models exhibit robust generalization capabilities when subject to substantial distribution shifts and significantly outperform existing algorithms, especially in the challenging genomics domain. Our code and models are publicly available at: https://github.com/larslorch/avici.
    Policy Evaluation for Temporal and/or Spatial Dependent Experiments in Ride-sourcing Platforms. (arXiv:2202.10887v4 [stat.ME] UPDATED)
    The aim of this paper is to establish causal relationship between ride-sharing platform's policies and outcomes of interest under complex temporal and/or spatial dependent experiments. We propose a temporal/spatio-temporal varying coefficient decision process (VCDP) model to capture the dynamic treatment effects in temporal/spatio-temporal dependent experiments. We characterize the average treatment effect by decomposing it as the sum of direct effect (DE) and indirect effect (IE) and develop estimation and inference procedures for both DE and IE. We also establish the statistical properties (e.g., weak convergence and asymptotic power) of our models. We conduct extensive simulations and real data analyses to verify the usefulness of the proposed method.
    Hierarchical Over-the-Air FedGradNorm. (arXiv:2212.07414v1 [cs.LG])
    Multi-task learning (MTL) is a learning paradigm to learn multiple related tasks simultaneously with a single shared network where each task has a distinct personalized header network for fine-tuning. MTL can be integrated into a federated learning (FL) setting if tasks are distributed across clients and clients have a single shared network, leading to personalized federated learning (PFL). To cope with statistical heterogeneity in the federated setting across clients which can significantly degrade the learning performance, we use a distributed dynamic weighting approach. To perform the communication between the remote parameter server (PS) and the clients efficiently over the noisy channel in a power and bandwidth-limited regime, we utilize over-the-air (OTA) aggregation and hierarchical federated learning (HFL). Thus, we propose hierarchical over-the-air (HOTA) PFL with a dynamic weighting strategy which we call HOTA-FedGradNorm. Our algorithm considers the channel conditions during the dynamic weight selection process. We conduct experiments on a wireless communication system dataset (RadComDynamic). The experimental results demonstrate that the training speed with HOTA-FedGradNorm is faster compared to the algorithms with a naive static equal weighting strategy. In addition, HOTA-FedGradNorm provides robustness against the negative channel effects by compensating for the channel conditions during the dynamic weight selection process.
    Quantifying Statistical Significance of Neural Network-based Image Segmentation by Selective Inference. (arXiv:2010.01823v3 [stat.ML] UPDATED)
    Although a vast body of literature relates to image segmentation methods that use deep neural networks (DNNs), less attention has been paid to assessing the statistical reliability of segmentation results. In this study, we interpret the segmentation results as hypotheses driven by DNN (called DNN-driven hypotheses) and propose a method by which to quantify the reliability of these hypotheses within a statistical hypothesis testing framework. Specifically, we consider a statistical hypothesis test for the difference between the object and background regions. This problem is challenging, as the difference would be falsely large because of the adaptation of the DNN to the data. To overcome this difficulty, we introduce a conditional selective inference (SI) framework -- a new statistical inference framework for data-driven hypotheses that has recently received considerable attention -- to compute exact (non-asymptotic) valid p-values for the segmentation results. To use the conditional SI framework for DNN-based segmentation, we develop a new SI algorithm based on the homotopy method, which enables us to derive the exact (non-asymptotic) sampling distribution of DNN-driven hypothesis. We conduct experiments on both synthetic and real-world datasets, through which we offer evidence that our proposed method can successfully control the false positive rate, has good performance in terms of computational efficiency, and provides good results when applied to medical image data.
    Toroidal Coordinates: Decorrelating Circular Coordinates With Lattice Reduction. (arXiv:2212.07201v1 [cs.CG])
    The circular coordinates algorithm of de Silva, Morozov, and Vejdemo-Johansson takes as input a dataset together with a cohomology class representing a $1$-dimensional hole in the data; the output is a map from the data into the circle that captures this hole, and that is of minimum energy in a suitable sense. However, when applied to several cohomology classes, the output circle-valued maps can be "geometrically correlated" even if the chosen cohomology classes are linearly independent. It is shown in the original work that less correlated maps can be obtained with suitable integer linear combinations of the cohomology classes, with the linear combinations being chosen by inspection. In this paper, we identify a formal notion of geometric correlation between circle-valued maps which, in the Riemannian manifold case, corresponds to the Dirichlet form, a bilinear form derived from the Dirichlet energy. We describe a systematic procedure for constructing low energy torus-valued maps on data, starting from a set of linearly independent cohomology classes. We showcase our procedure with computational examples. Our main algorithm is based on the Lenstra--Lenstra--Lov\'asz algorithm from computational number theory.
    Maximal Initial Learning Rates in Deep ReLU Networks. (arXiv:2212.07295v1 [stat.ML])
    Training a neural network requires choosing a suitable learning rate, involving a trade-off between speed and effectiveness of convergence. While there has been considerable theoretical and empirical analysis of how large the learning rate can be, most prior work focuses only on late-stage training. In this work, we introduce the maximal initial learning rate $\eta^{\ast}$ - the largest learning rate at which a randomly initialized neural network can successfully begin training and achieve (at least) a given threshold accuracy. Using a simple approach to estimate $\eta^{\ast}$, we observe that in constant-width fully-connected ReLU networks, $\eta^{\ast}$ demonstrates different behavior to the maximum learning rate later in training. Specifically, we find that $\eta^{\ast}$ is well predicted as a power of $(\text{depth} \times \text{width})$, provided that (i) the width of the network is sufficiently large compared to the depth, and (ii) the input layer of the network is trained at a relatively small learning rate. We further analyze the relationship between $\eta^{\ast}$ and the sharpness $\lambda_{1}$ of the network at initialization, indicating that they are closely though not inversely related. We formally prove bounds for $\lambda_{1}$ in terms of $(\text{depth} \times \text{width})$ that align with our empirical results.
    Sequential Kernelized Independence Testing. (arXiv:2212.07383v1 [stat.ML])
    Independence testing is a fundamental and classical statistical problem that has been extensively studied in the batch setting when one fixes the sample size before collecting data. However, practitioners often prefer procedures that adapt to the complexity of a problem at hand instead of setting sample size in advance. Ideally, such procedures should (a) allow stopping earlier on easy tasks (and later on harder tasks), hence making better use of available resources, and (b) continuously monitor the data and efficiently incorporate statistical evidence after collecting new data, while controlling the false alarm rate. It is well known that classical batch tests are not tailored for streaming data settings, since valid inference after data peeking requires correcting for multiple testing, but such corrections generally result in low power. In this paper, we design sequential kernelized independence tests (SKITs) that overcome such shortcomings based on the principle of testing by betting. We exemplify our broad framework using bets inspired by kernelized dependence measures such as the Hilbert-Schmidt independence criterion (HSIC) and the constrained-covariance criterion (COCO). Importantly, we also generalize the framework to non-i.i.d. time-varying settings, for which there exist no batch tests. We demonstrate the power of our approaches on both simulated and real data.
    Sample Complexity of Offline Reinforcement Learning with Deep ReLU Networks. (arXiv:2103.06671v6 [stat.ML] UPDATED)
    Offline reinforcement learning (RL) leverages previously collected data for policy optimization without any further active exploration. Despite the recent interest in this problem, its theoretical results in neural network function approximation settings remain elusive. In this paper, we study the statistical theory of offline RL with deep ReLU network function approximation. In particular, we establish the sample complexity of $n = \tilde{\mathcal{O}}( H^{4 + 4 \frac{d}{\alpha}} \kappa_{\mu}^{1 + \frac{d}{\alpha}} \epsilon^{-2 - 2\frac{d}{\alpha}} )$ for offline RL with deep ReLU networks, where $\kappa_{\mu}$ is a measure of distributional shift, {$H = (1-\gamma)^{-1}$ is the effective horizon length}, $d$ is the dimension of the state-action space, $\alpha$ is a (possibly fractional) smoothness parameter of the underlying Markov decision process (MDP), and $\epsilon$ is a user-specified error. Notably, our sample complexity holds under two novel considerations: the Besov dynamic closure and the correlated structure. While the Besov dynamic closure subsumes the dynamic conditions for offline RL in the prior works, the correlated structure renders the prior works of offline RL with general/neural network function approximation improper or inefficient {in long (effective) horizon problems}. To the best of our knowledge, this is the first theoretical characterization of the sample complexity of offline RL with deep neural network function approximation under the general Besov regularity condition that goes beyond {the linearity regime} in the traditional Reproducing Hilbert kernel spaces and Neural Tangent Kernels.
    Identifying the latent space geometry of network models through analysis of curvature. (arXiv:2012.10559v5 [stat.ME] UPDATED)
    A common approach to modeling networks assigns each node to a position on a low-dimensional manifold where distance is inversely proportional to connection likelihood. More positive manifold curvature encourages more and tighter communities; negative curvature induces repulsion. We consistently estimate manifold type, dimension, and curvature from simply connected, complete Riemannian manifolds of constant curvature. We represent the graph as a noisy distance matrix based on the ties between cliques, then develop hypothesis tests to determine whether the observed distances could plausibly be embedded isometrically in each of the candidate geometries. We apply our approach to data-sets from economics and neuroscience.
    Comparing Sequential Forecasters. (arXiv:2110.00115v4 [stat.ME] UPDATED)
    Consider two forecasters, each making a single prediction for a sequence of events over time. We ask a relatively basic question: how might we compare these forecasters, either online or post-hoc, while avoiding unverifiable assumptions on how the forecasts and outcomes were generated? In this paper, we present a rigorous answer to this question by designing novel sequential inference procedures for estimating the time-varying difference in forecast scores. To do this, we employ confidence sequences (CS), which are sequences of confidence intervals that can be continuously monitored and are valid at arbitrary data-dependent stopping times ("anytime-valid"). The widths of our CSs are adaptive to the underlying variance of the score differences. Underlying their construction is a game-theoretic statistical framework, in which we further identify e-processes and p-processes for sequentially testing a weak null hypothesis -- whether one forecaster outperforms another on average (rather than always). Our methods do not make distributional assumptions on the forecasts or outcomes; our main theorems apply to any bounded scores, and we later provide alternative methods for unbounded scores. We empirically validate our approaches by comparing real-world baseball and weather forecasters.  ( 2 min )
    Bayesian data fusion with shared priors. (arXiv:2212.07311v1 [cs.LG])
    The integration of data and knowledge from several sources is known as data fusion. When data is available in a distributed fashion or when different sensors are used to infer a quantity of interest, data fusion becomes essential. In Bayesian settings, a priori information of the unknown quantities is available and, possibly, shared among the distributed estimators. When the local estimates are fused, such prior might be overused unless it is accounted for. This paper explores the effects of shared priors in Bayesian data fusion contexts, providing fusion rules and analysis to understand the performance of such fusion as a function of the number of collaborative agents and the uncertainty of the priors. Analytical results are corroborated through experiments in a variety of estimation and classification problems.  ( 2 min )
    Morphological Network: How Far Can We Go with Morphological Neurons?. (arXiv:1901.00109v4 [cs.LG] UPDATED)
    Morphological neurons, that is morphological operators such as dilation and erosion with learnable structuring elements, have intrigued researchers for quite some time because of the power these operators bring to the table despite their simplicity. These operators are known to be powerful nonlinear tools, but for a given problem coming up with a sequence of operations and their structuring element is a non-trivial task. So, the existing works have mainly focused on this part of the problem without delving deep into their applicability as generic operators. A few works have tried to utilize morphological neurons as a part of classification (and regression) networks when the input is a feature vector. However, these methods mainly focus on a specific problem, without going into generic theoretical analysis. In this work, we have theoretically analyzed morphological neurons and have shown that these are far more powerful than previously anticipated. Our proposed morphological block, containing dilation and erosion followed by their linear combination, represents a sum of hinge functions. Existing works show that hinge functions perform quite well in classification and regression problems. Two morphological blocks can even approximate any continuous function. However, to facilitate the theoretical analysis that we have done in this paper, we have restricted ourselves to the 1D version of the operators, where the structuring element operates on the whole input. Experimental evaluations also indicate the effectiveness of networks built with morphological neurons, over similarly structured neural networks.  ( 3 min )
    Counterfactual Learning of Stochastic Policies with Continuous Actions: from Models to Offline Evaluation. (arXiv:2004.11722v6 [stat.ML] UPDATED)
    Counterfactual reasoning from logged data has become increasingly important for many applications such as web advertising or healthcare. In this paper, we address the problem of learning stochastic policies with continuous actions from the viewpoint of counterfactual risk minimization (CRM). While the CRM framework is appealing and well studied for discrete actions, the continuous action case raises new challenges about modelization, optimization, and~offline model selection with real data which turns out to be particularly challenging. Our paper contributes to these three aspects of the CRM estimation pipeline. First, we introduce a modelling strategy based on a joint kernel embedding of contexts and actions, which overcomes the shortcomings of previous discretization approaches. Second, we empirically show that the optimization aspect of counterfactual learning is important, and we demonstrate the benefits of proximal point algorithms and differentiable estimators. Finally, we propose an evaluation protocol for offline policies in real-world logged systems, which is challenging since policies cannot be replayed on test data, and we release a new large-scale dataset along with multiple synthetic, yet realistic, evaluation setups.  ( 2 min )
    Deep Learning with Functional Inputs. (arXiv:2006.09590v2 [stat.ML] UPDATED)
    We present a methodology for integrating functional data into deep densely connected feed-forward neural networks. The model is defined for scalar responses with multiple functional and scalar covariates. A by-product of the method is a set of dynamic functional weights that can be visualized during the optimization process. This visualization leads to greater interpretability of the relationship between the covariates and the response relative to conventional neural networks. The model is shown to perform well in a number of contexts including prediction of new data and recovery of the true underlying functional weights; these results were confirmed through real applications and simulation studies. A forthcoming R package is developed on top of a popular deep learning library (Keras) allowing for general use of the approach.  ( 2 min )
    Algorithmic Insurance. (arXiv:2106.00839v2 [cs.LG] UPDATED)
    As machine learning algorithms start to get integrated into the decision-making process of companies and organizations, insurance products are being developed to protect their owners from liability risk. Algorithmic liability differs from human liability since it is based on a single model compared to multiple heterogeneous decision-makers and its performance is known a priori for a given set of data. Traditional actuarial tools for human liability do not take these properties into consideration, primarily focusing on the distribution of historical claims. We propose, for the first time, a quantitative framework to estimate the risk exposure of insurance contracts for machine-driven liability, introducing the concept of algorithmic insurance. Specifically, we present an optimization formulation to estimate the risk exposure of a binary classification model given a pre-defined range of premiums. We adjust the formulation to account for uncertainty in the resulting losses using robust optimization. Our approach outlines how properties of the model, such as accuracy, interpretability, and generalizability, can influence the insurance contract evaluation. To showcase a practical implementation of the proposed framework, we present a case study of medical malpractice in the context of breast cancer detection. Our analysis focuses on measuring the effect of the model parameters on the expected financial loss and identifying the aspects of algorithmic performance that predominantly affect the risk of the contract.  ( 2 min )
    Do Not Sleep on Traditional Machine Learning: Simple and Interpretable Techniques Are Competitive to Deep Learning for Sleep Scoring. (arXiv:2207.07753v3 [stat.ML] UPDATED)
    Over the last few years, research in automatic sleep scoring has mainly focused on developing increasingly complex deep learning architectures. However, recently these approaches achieved only marginal improvements, often at the expense of requiring more data and more expensive training procedures. Despite all these efforts and their satisfactory performance, automatic sleep staging solutions are not widely adopted in a clinical context yet. We argue that most deep learning solutions for sleep scoring are limited in their real-world applicability as they are hard to train, deploy, and reproduce. Moreover, these solutions lack interpretability and transparency, which are often key to increase adoption rates. In this work, we revisit the problem of sleep stage classification using classical machine learning. Results show that competitive performance can be achieved with a conventional machine learning pipeline consisting of preprocessing, feature extraction, and a simple machine learning model. In particular, we analyze the performance of a linear model and a non-linear (gradient boosting) model. Our approach surpasses state-of-the-art (that uses the same data) on two public datasets: Sleep-EDF SC-20 (MF1 0.810) and Sleep-EDF ST (MF1 0.795), while achieving competitive results on Sleep-EDF SC-78 (MF1 0.775) and MASS SS3 (MF1 0.817). We show that, for the sleep stage scoring task, the expressiveness of an engineered feature vector is on par with the internally learned representations of deep learning models. This observation opens the door to clinical adoption, as a representative feature vector allows to leverage both the interpretability and successful track record of traditional machine learning models.  ( 3 min )
    Lower Bounds for the Convergence of Tensor Power Iteration on Random Overcomplete Models. (arXiv:2211.03827v2 [cs.LG] UPDATED)
    Tensor decomposition serves as a powerful primitive in statistics and machine learning. In this paper, we focus on using power iteration to decompose an overcomplete random tensor. Past work studying the properties of tensor power iteration either requires a non-trivial data-independent initialization, or is restricted to the undercomplete regime. Moreover, several papers implicitly suggest that logarithmically many iterations (in terms of the input dimension) are sufficient for the power method to recover one of the tensor components. In this paper, we analyze the dynamics of tensor power iteration from random initialization in the overcomplete regime. Surprisingly, we show that polynomially many steps are necessary for convergence of tensor power iteration to any of the true component, which refutes the previous conjecture. On the other hand, our numerical experiments suggest that tensor power iteration successfully recovers tensor components for a broad range of parameters, despite that it takes at least polynomially many steps to converge. To further complement our empirical evidence, we prove that a popular objective function for tensor decomposition is strictly increasing along the power iteration path. Our proof is based on the Gaussian conditioning technique, which has been applied to analyze the approximate message passing (AMP) algorithm. The major ingredient of our argument is a conditioning lemma that allows us to generalize AMP-type analysis to non-proportional limit and polynomially many iterations of the power method.  ( 2 min )
    Reliable amortized variational inference with physics-based latent distribution correction. (arXiv:2207.11640v2 [stat.ML] UPDATED)
    Bayesian inference for high-dimensional inverse problems is computationally costly and requires selecting a suitable prior distribution. Amortized variational inference addresses these challenges via a neural network that acts as a surrogate conditional distribution, matching the posterior distribution not only for one instance of data, but a distribution of data pertaining to a specific inverse problem. During inference, the neural network -- in our case a conditional normalizing flow -- provides posterior samples with virtually no cost. However, the accuracy of Amortized variational inference relies on the availability of high-fidelity training data, which seldom exists in geophysical inverse problems due to the Earth's heterogeneity. In addition, the network is prone to errors if evaluated over out-of-distribution data. As such, we propose to increases the resilience of amortized variational inference in presence of moderate data distribution shifts. We achieve this via a correction to the latent distribution that improves the posterior distribution approximation for the data at hand. The correction involves relaxing the standard Gaussian assumption on the latent distribution and parameterizing it via a Gaussian distribution with an unknown mean and (diagonal) covariance. These unknowns are then estimated by minimizing the Kullback-Leibler divergence between the corrected and (physics-based) true posterior distributions. While generic and applicable to other inverse problems, by means of a linearized seismic imaging example, we show that our correction step improves the robustness of amortized variational inference with respect to changes in number of seismic sources, noise variance, and shifts in the prior distribution. This approach provides a seismic image with limited artifacts and an assessment of its uncertainty with approximately the same cost as five reverse-time migrations.  ( 2 min )
    On LASSO for High Dimensional Predictive Regression. (arXiv:2212.07052v1 [econ.EM])
    In a high dimensional linear predictive regression where the number of potential predictors can be larger than the sample size, we consider using LASSO, a popular L1-penalized regression method, to estimate the sparse coefficients when many unit root regressors are present. Consistency of LASSO relies on two building blocks: the deviation bound of the cross product of the regressors and the error term, and the restricted eigenvalue of the Gram matrix of the regressors. In our setting where unit root regressors are driven by temporal dependent non-Gaussian innovations, we establish original probabilistic bounds for these two building blocks. The bounds imply that the rates of convergence of LASSO are different from those in the familiar cross sectional case. In practical applications given a mixture of stationary and nonstationary predictors, asymptotic guarantee of LASSO is preserved if all predictors are scale-standardized. In an empirical example of forecasting the unemployment rate with many macroeconomic time series, strong performance is delivered by LASSO when the initial specification is guided by macroeconomic domain expertise.  ( 2 min )
    On the Relationship Between Explanation and Prediction: A Causal View. (arXiv:2212.06925v1 [cs.LG])
    Explainability has become a central requirement for the development, deployment, and adoption of machine learning (ML) models and we are yet to understand what explanation methods can and cannot do. Several factors such as data, model prediction, hyperparameters used in training the model, and random initialization can all influence downstream explanations. While previous work empirically hinted that explanations (E) may have little relationship with the prediction (Y), there is a lack of conclusive study to quantify this relationship. Our work borrows tools from causal inference to systematically assay this relationship. More specifically, we measure the relationship between E and Y by measuring the treatment effect when intervening on their causal ancestors (hyperparameters) (inputs to generate saliency-based Es or Ys). We discover that Y's relative direct influence on E follows an odd pattern; the influence is higher in the lowest-performing models than in mid-performing models, and it then decreases in the top-performing models. We believe our work is a promising first step towards providing better guidance for practitioners who can make more informed decisions in utilizing these explanations by knowing what factors are at play and how they relate to their end task.  ( 2 min )
    Asymptotic in a class of network models with an increasing sub-Gamma degree sequence. (arXiv:2111.01301v3 [math.ST] UPDATED)
    For the differential privacy under the sub-Gamma noise, we derive the asymptotic properties of a class of network models with binary values with general link function. In this paper, we release the degree sequences of the binary networks under a general noisy mechanism with the discrete Laplace mechanism as a special case. We establish the asymptotic result including both consistency and asymptotically normality of the parameter estimator when the number of parameters goes to infinity in a class of network models. Simulations and a real data example are provided to illustrate asymptotic results.  ( 2 min )
    Cryptocurrency Valuation: An Explainable AI Approach. (arXiv:2201.12893v2 [econ.GN] UPDATED)
    Currently, there are no convincing proxies for the fundamentals of cryptocurrency assets. We propose a new market-to-fundamental ratio, the price-to-utility (PU) ratio, utilizing unique blockchain accounting methods. We then proxy various fundamental-to-market ratios by Bitcoin historical data and find they have little predictive power for short-term bitcoin returns. However, PU ratio effectively predicts long-term bitcoin returns. We verify PU ratio valuation by unsupervised and supervised machine learning. The valuation method informs investment returns and predicts bull markets effectively. Finally, we present an automated trading strategy advised by the PU ratio that outperforms the conventional buy-and-hold and market-timing strategies. We distribute the trading algorithms as open-source software via Python Package Index for future research.  ( 2 min )
    Learning particle swarming models from data with Gaussian processes. (arXiv:2106.02735v2 [stat.ML] UPDATED)
    Interacting particle or agent systems that display a rich variety of swarming behaviours are ubiquitous in science and engineering. A fundamental and challenging goal is to understand the link between individual interaction rules and swarming. In this paper, we study the data-driven discovery of a second-order particle swarming model that describes the evolution of $N$ particles in $\mathbb{R}^d$ under radial interactions. We propose a learning approach that models the latent radial interaction function as Gaussian processes, which can simultaneously fulfill two inference goals: one is the nonparametric inference of {the} interaction function with pointwise uncertainty quantification, and the other one is the inference of unknown scalar parameters in the non-collective friction forces of the system. We formulate the learning problem as a statistical inverse problem and provide a detailed analysis of recoverability conditions, establishing that a coercivity condition is sufficient for recoverability. Given data collected from $M$ i.i.d trajectories with independent Gaussian observational noise, we provide a finite-sample analysis, showing that our posterior mean estimator converges in a Reproducing kernel Hilbert space norm, at an optimal rate in $M$ equal to the one in the classical 1-dimensional Kernel Ridge regression. As a byproduct, we show we can obtain a parametric learning rate in $M$ for the posterior marginal variance using $L^{\infty}$ norm, and the rate could also involve $N$ and $L$ (the number of observation time instances for each trajectory), depending on the condition number of the inverse problem. Numerical results on systems that exhibit different swarming behaviors demonstrate efficient learning of our approach from scarce noisy trajectory data.  ( 2 min )
    FeDXL: Provable Federated Learning for Deep X-Risk Optimization. (arXiv:2210.14396v2 [cs.LG] UPDATED)
    In this paper, we tackle a novel federated learning (FL) problem for optimizing a family of X-risks, to which no existing FL algorithms are applicable. In particular, the objective has the form of $\mathbb E_{z\sim S_1} f(\mathbb E_{z'\sim S_2} \ell(w; z, z'))$, where two sets of data $S_1, S_2$ are distributed over multiple machines, $\ell(\cdot)$ is a pairwise loss that only depends on the prediction outputs of the input data pairs $(z, z')$, and $f(\cdot)$ is possibly a non-linear non-convex function. This problem has important applications in machine learning, e.g., AUROC maximization with a pairwise loss, and partial AUROC maximization with a compositional loss. The challenges for designing an FL algorithm lie in the non-decomposability of the objective over multiple machines and the interdependency between different machines. To address the challenges, we propose an active-passive decomposition framework that decouples the gradient's components with two types, namely active parts and passive parts, where the active parts depend on local data that are computed with the local model and the passive parts depend on other machines that are communicated/computed based on historical models and samples. Under this framework, we develop two provable FL algorithms (FeDXL) for handling linear and nonlinear $f$, respectively, based on federated averaging and merging. We develop a novel theoretical analysis to combat the latency of the passive parts and the interdependency between the local model parameters and the involved data for computing local gradient estimators. We establish both iteration and communication complexities and show that using the historical samples and models for computing the passive parts do not degrade the complexities. We conduct empirical studies of FeDXL for deep AUROC and partial AUROC maximization, and demonstrate their performance compared with several baselines.  ( 3 min )
    Fast Bayesian Inference with Batch Bayesian Quadrature via Kernel Recombination. (arXiv:2206.04734v3 [cs.LG] UPDATED)
    Calculation of Bayesian posteriors and model evidences typically requires numerical integration. Bayesian quadrature (BQ), a surrogate-model-based approach to numerical integration, is capable of superb sample efficiency, but its lack of parallelisation has hindered its practical applications. In this work, we propose a parallelised (batch) BQ method, employing techniques from kernel quadrature, that possesses an empirically exponential convergence rate. Additionally, just as with Nested Sampling, our method permits simultaneous inference of both posteriors and model evidence. Samples from our BQ surrogate model are re-selected to give a sparse set of samples, via a kernel recombination algorithm, requiring negligible additional time to increase the batch size. Empirically, we find that our approach significantly outperforms the sampling efficiency of both state-of-the-art BQ techniques and Nested Sampling in various real-world datasets, including lithium-ion battery analytics.  ( 2 min )
    Conservative SPDEs as fluctuating mean field limits of stochastic gradient descent. (arXiv:2207.05705v2 [math.PR] UPDATED)
    The convergence of stochastic interacting particle systems in the mean-field limit to solutions of conservative stochastic partial differential equations is established, with optimal rate of convergence. As a second main result, a quantitative central limit theorem for such SPDEs is derived, again, with optimal rate of convergence. The results apply, in particular, to the convergence in the mean-field scaling of stochastic gradient descent dynamics in overparametrized, shallow neural networks to solutions of SPDEs. It is shown that the inclusion of fluctuations in the limiting SPDE improves the rate of convergence, and retains information about the fluctuations of stochastic gradient descent in the continuum limit.  ( 2 min )
    Fair Infinitesimal Jackknife: Mitigating the Influence of Biased Training Data Points Without Refitting. (arXiv:2212.06803v1 [cs.LG] CROSS LISTED)
    In consequential decision-making applications, mitigating unwanted biases in machine learning models that yield systematic disadvantage to members of groups delineated by sensitive attributes such as race and gender is one key intervention to strive for equity. Focusing on demographic parity and equality of opportunity, in this paper we propose an algorithm that improves the fairness of a pre-trained classifier by simply dropping carefully selected training data points. We select instances based on their influence on the fairness metric of interest, computed using an infinitesimal jackknife-based approach. The dropping of training points is done in principle, but in practice does not require the model to be refit. Crucially, we find that such an intervention does not substantially reduce the predictive performance of the model but drastically improves the fairness metric. Through careful experiments, we evaluate the effectiveness of the proposed approach on diverse tasks and find that it consistently improves upon existing alternatives.  ( 2 min )
    Sparse Interaction Neighborhood Selection for Markov Random Fields via Reversible Jump and Pseudoposteriors. (arXiv:2204.05933v3 [stat.CO] UPDATED)
    We consider the problem of estimating the interacting neighborhood of a Markov Random Field model with finite support and homogeneous pairwise interactions based on relative positions of a two-dimensional lattice. Using a Bayesian framework, we propose a Reversible Jump Monte Carlo Markov Chain algorithm that jumps across subsets of a maximal range neighborhood, allowing us to perform model selection based on a marginal pseudoposterior distribution of models. To show the strength of our proposed methodology we perform a simulation study and apply it to a real dataset from a discrete texture image analysis.  ( 2 min )
    A deep learning approach to data-driven model-free pricing and to martingale optimal transport. (arXiv:2103.11435v3 [q-fin.CP] UPDATED)
    We introduce a novel and highly tractable supervised learning approach based on neural networks that can be applied for the computation of model-free price bounds of, potentially high-dimensional, financial derivatives and for the determination of optimal hedging strategies attaining these bounds. In particular, our methodology allows to train a single neural network offline and then to use it online for the fast determination of model-free price bounds of a whole class of financial derivatives with current market data. We show the applicability of this approach and highlight its accuracy in several examples involving real market data. Further, we show how a neural network can be trained to solve martingale optimal transport problems involving fixed marginal distributions instead of financial market data.  ( 2 min )

  • Open

    [P] Medical question-answering without hallucinating
    tl;dr I built a site that uses GPT-3.5 to answer natural-language medical questions using peer-reviewed medical studies. Live demo: https://www.glaciermd.com/search Background I've been working for a while on building a better version of WebMD, and I recently started playing around with LLMs, trying to figure out if there was anything useful there. The problem with the current batch of "predict-next-token" LLMs is that they hallucinate—you can ask ChatGPT to answer medical questions, but it'll either Refuse to answer (not great) Give a completely false answer (really super bad) So I spent some time trying to coax these LLMs to give answers based on a very specific set of inputs (peer-reviewed medical research) to see if I could get more accurate answers. And I did! The best part is you can actually trace the final answer back to the original sources, which will hopefully instill some confidence in the result. Here's how it works: User types in a question Pull top ~800 studies from Semantic Scholar and Pubmed Re-rank using sentence-transformers/multi-qa-MiniLM-L6-cos-v1 Ask text-davinci-003 to answer the question based on the top 10 studies (if possible) Summarize those answers using text-davinci-003 Would love to hear what people think (and if there's a better/cheaper way to do it!). submitted by /u/tmblweeds [link] [comments]  ( 63 min )
    [R] Any Benchmarks of RTX 3090 vs RTX 3090 Ti for NLP, Computer Vision, Deep Learning
    Does anyone have Benchmarks of RTX 3090 vs RTX 3090 Ti for NLP (such as Transformer Fine-Tuning or anything), Computer vision or deep learning problems. I am trying to see if speed difference makes sense or not for these models and also if its possible to run multiple models in parallel on these machines without frying them ? submitted by /u/jaiabh1 [link] [comments]  ( 58 min )
    [D] Taking DNA as input and a person's appearance as output
    Well, I basically said it in my title. Is it possible to take digitized DNA as input for a neural network and generate an output with a face, assuming it has training data (DNA-Photo) from an international database? submitted by /u/st4s1k [link] [comments]  ( 64 min )
    [D] Waveform recognition question
    Hey, I'm new to machine learning and I have a question. I'm currently working on my first project where I'm trying to use machine learning to recognise different waveforms that occur in my home network. I have a small dataset of about 10 different waveforms (each has around 80 values of measured voltage in a short time frame once a specific event in the network happens). What I need is for the algorithm to recognise once one of these specific events happen and tell me which one it was (or was close to) compared with the waveforms in the dataset. Which model would you recomment for doing this? I've tried knn tree but couldn't get it to work.. Any help is greatly appreciated submitted by /u/Tavallist [link] [comments]  ( 61 min )
    [D] Small scale grants for human-in-the-loop experiments?
    Hi all, I'm running a small scale ML experiment with human subjects, and the lab ran into funding issues right when we were ready to launch the final version of the experiment. Are you aware of any small scale (<$10k), fast turnaround (<month) grants or scholarships that we can apply for to close this financial gap? If it matters, the lab is part of a US research institution, the researchers have a green card but are not US citizens. Thanks!! submitted by /u/HackZisBotez [link] [comments]  ( 60 min )
    Neural networks and machine learning for data science in business [D]
    I am currently learning tools for data science, in particular in a business analysis setting for pricing strategy, demand forecast, etc. I am currently reading Géron's Hands on ML and I am fascinated by some of the ML concepts, such as regressions and random forests, and see the potentials of these tools for business data science. But now when I am reading the neural network part of the book with Keras and TensorFlow, I slowly realised that these tools are used for a really big datasets and features for tasks such as computer vision, voice recognition, etc and not for business analysis applications. Am I right in this feeling that the basic ML tools are enough, or is there real application for neural networks and advanced ML tools for business data analysis? submitted by /u/lordgriefter [link] [comments]  ( 59 min )
    [D] What are the strongest plain baselines for Vision Transformers on ImageNet?
    I am looking for the hyper-parameter settings that could produce the highest accuracies for plain ViT (i.e., without modifying the model architecture) on ImageNet-1K, training from scratch. A lot of people in this sub have experience with ViT so I hope I could get some help here. For ViT-S, we have a recipe that can achieve 80.0% top-1 accuracy from this paper: Better plain ViT baselines for ImageNet-1k. Unfortunately they did not experiment with larger architecture (ViT-B or ViT-L). For ViT-B, ViT-L and ViT-H, the authors of MAE claimed to achieve 82.3%, 82.6% and 83.1%, respectively (see their Table 3). However, I was unable to reproduce these results using their code and their reported hyper-parameters. Any references to strong ViT baselines with reproducible results would be very much appreciated! Thanks. submitted by /u/netw0rkf10w [link] [comments]  ( 60 min )
    [D] Trying to find paper about n-grams in early transformer layers
    I remember reading a paper a while back that showed early attention layers in a transformer could be replaced with a simpler mechanism since most heads only modeled small n-grams. I think they used some kind of pooling? Wondering if anyone knows which paper that was and had any thoughts about it since then. Thanks! submitted by /u/soraki_soladead [link] [comments]  ( 58 min )
    [D] Regarding Momentum Scheduler and if they are impactful in Deep Learning of neural networks.
    Currently I noticed that some pytorch training modules use Learning Rate Scheduler and momentum rate Scheduler , a lot of momentum rate schedulers exist similar to LR scheduler ranging from Lambda, Cosine, Cyclic schedulers , one article that caught my eye that was very interesting was something called Demon (paper: link) where the momentum starts at a very high value such as 0.9 and then reduces to a very low value towards the end. So my idea was to implement a OneCycle Learning Rate Scheduler with a warmup of 15 epochs which starts from 0.0002 and then goes to 0.1 and then falls back to a very low value towards the end of the last epoch to about 0.00002 and also implement the Demon momentum Scheduler. Now my question is will using a demon momentum where we end towards a very low value useful. Now according to the stochastic gradient update equations, initially we would give importance to the average value of the change in gradients and towards the end we would give importance to the next immediate change in gradients using the demon momentum (due to low value) and since the learning rate would be low, the change in the weights would also be minimal , is this useful ? , or in other articles they suggest using low momentum for high learning rate and towards the end of the epoch increase the momentum and decrease the learning rate. So basically first it starts with a momentum with SGD then slowly towards the end it becomes a vanilla SGD. Demon momentum scheduler One Cycle LR cosine sheduler SGD momentum formulas submitted by /u/skeletons_of_closet [link] [comments]  ( 64 min )
    [D] Search Documents Quickly with Extractive Question Answering and Sparse Transformers
    Imagine a situation where you have thousands of documents But need to find an answer from the documents And at the same time, get the document where the answer is coming from You could open and search the documents one by one But that would take forever Enter Extractive Question Answering with Sparse Transformers With Extractive Question Answering, you input a query into the system And in return, you get the answer to your question and the document containing the answer. Extractive Question Answering enables you to search many records and find the answer. It works by: - Retrieving documents that are relevant to answering the questions. - Returns text that answers that question. Language models make this possible. For example, the receiver can be a masked language model. The reader can be a question-answering model. The challenge of these language models is that they are quite large. The size makes it hard to deploy the models for real-time inference. For example, deploying big models is not possible on mobile devices. Furthermore, inference time, latency, and throughput are also critical. The solution is to reduce the model's size while maintaining its accuracy. Making the model small is easy but maintaining accuracy is challenging. These can be achieved by pruning and quantizing the model. Pruning involves removing some weight connection from an otherwise overprecise and overparameterized model. Furthermore, you can reduce the precision of the floating points to make the model smaller. In today's article, I cover this in more detail. Including: 💡 Document retrieval with DeepSparse and arXiv dataset 💡 Document retrieval with a dense and sparse model 💡 Comparing the performance between dense and sparse models Read the full article: https://neuralmagic.com/blog/search-documents-quickly-with-extractive-question-answering-and-sparse-transformers/ submitted by /u/mwitiderrick [link] [comments]  ( 61 min )
    [P] Image search with localization and open-vocabulary reranking.
    TL;DR Image search with open vocabulary localization using both index and search time methods. Article (no paywall): https://medium.com/@jesse_894/image-search-with-localization-and-open-vocabulary-reranking-using-marqo-yolox-clip-and-owl-vit-9c636350bf66?source=friends_link&sk=b4e94d9d4095a2b8b60c5d1904a60825 Markdown: https://github.com/marqo-ai/marqo/blob/mainline/examples/ImageSearchLocalization/article.md Code: https://github.com/marqo-ai/marqo/blob/mainline/examples/ImageSearchLocalization/index_all_data.py I wanted to have a few choices getting localization into image search (index and search time). I immediately thought of using a region proposal network (rpn) from mask-rcnn to create patches that can also be indexed and searched (and add the localisation). I figured it might …  ( 59 min )
    [D] Can we use decoders on pretrained embeddings to qualitatively assess what information the embedding contains?
    Say I have some trained model that learns an embedding eg: classifier. I freeze it and chop of all layers past the embedding that I want. Now using this embedding, I train a decoder to reconstruct the original encoded object. Would the reconstruction give me a qualitative insight into what sort of representation the frozen model has learned? If so, why? An example with images: I observe that human eyes are reconstructed quite well, but everything else is unclear, giving me the indication that my learned embedding successfully encodes eyes. submitted by /u/SwiftLynx [link] [comments]  ( 65 min )
    Using multiple ocr for better recognition [D]
    How to set up multiple ocr in such a way that the final recognition is better keeping in mind that it should work well in production as we.. What methods can be used to get better output for each text from multiple OCR ? How can we determine which OCR produce better result in multiple OCR setup. submitted by /u/fountainhop [link] [comments]  ( 61 min )
    [D] Is "natural" text always maximally likely according to language models ?
    Been experimenting with language models a lot lately and wondering if human generated text (i.e. "natural" text) is really supposed to be maximally likely according to language models even after training. For example, has someone checked likelihood of human translated text to likelihood of machine translated text according to a language model like GPT-3 ? ​ Are there any works that do this already ? Does this idea even make sense to begin with ? submitted by /u/Emergency_Apricot_77 [link] [comments]  ( 63 min )
    [D] Dealing with extremely imbalanced dataset
    I am working on a problem where the negative/0 label to postie/1 label ratio is 180MM/10MM. The data size is around 25GB and I have >500 features. Certainly, I don't want to use all 180MM rows of majority class to train my model due to computational limitations. Currently, I simply perform an under-sampling from majority class. However, I have been reading that this may cause loss of the useful information or cause difficulties for determining the decision boundary between the classes (see https://machinelearningmastery.com/undersampling-algorithms-for-imbalanced-classification/). When I do the under-sampling, I try to make sure that distribution of my data stays the same. I am wondering if there is a better way to handle this? submitted by /u/hopedallas [link] [comments]  ( 66 min )
    [P] A CUDA-free instant NGP renderer: Support real-time rendering and camera interaction and consume less than 1GB of VRAM.
    Project repo: https://github.com/Linyou/taichi-ngp-renderer Instant NGP is a novel view synthesis framework that reduces the model training for a single scene from hours to a few seconds. This project is a CUDA-free instant NGP renderer implemented in Taichi. Supported by Taichi's built-in GUI system, the project supports real-time rendering and camera interaction while consuming less than 1GB of VRAM. It also contains a fully fused multilayer perceptron (MLP) enabled by the SharedArray feature. The following are some pre-trained NeRF synthesis scenes: https://preview.redd.it/kdlwzps1uy5a1.png?width=1236&format=png&auto=webp&s=f803745aa67f072bf04ebc178b5f0da8cabb3e77 submitted by /u/TaichiOfficial [link] [comments]  ( 66 min )
    [D] Tensorflow vs. PyTorch Memory Usage
    Why is it that when I go to create a CNN with 4 layers (output channels: 64, 32, 16, 16), I can do this in PyTorch, but in Tensorflow I get resource errors saying I don't have enough resources? For reference I am using a stock NVIDIA RTX 3080. Also, now that I am experimenting with larger models, would I benefit from renting TPU? Does this make the actual models train faster and would it help with larger batches? submitted by /u/Oceanboi [link] [comments]  ( 6 min )
  • Open

    Looking for a good implementation of ucbvi
    I'm doing a research project for which I need to run UCB-VI on a tabular gym environment. I've been searching for python packages that implement it, but they are surprisingly scarce! The only one I've found so far is rlberry, but it seems to be quite buggy. Anyone have any suggestions? submitted by /u/the_bovine_life [link] [comments]  ( 54 min )
    Why would an Actor / Critic Reinforcement Learning algorithm start outputting zeros after about 20k steps?
    I have a very large algorithm written in C++ for LibTorch that outputs zero after about 20k steps. I have encluded the code below, but there is quite a lot of code here, so maybe I can get a more general answer or get some ideas from the community to test because you likely will not want to run this code. I had to delete a good portion of it be below the char limit for StackOverflow. But, be my guest. This is the Maximum a Posteriori Policy Optimisation algorithm. This algorithm controls agents in the MuJoCo physics simulator. The algorithm uses a Markov Decision Process and a reward is set for the agent to learn to maximize. I tried the very simple "agent" of an inverted pendulum and it seemed to maximize the reward and balance the pendulum after a few thousand steps. When I try it on a …  ( 59 min )
    Deep RL for a custom problem.
    I'm trying to use RL to optimize a game that I'm currently playing. ​ I feel its a supply chain problem which can be optimized using RL. ​ Below is the chart depicting all the items that have to be created to complete a level. Game items flow. These items have to be crafted in different quantities with individual lead times. They are crafted using coins and the unit price increases with time to craft remaining the same. ​ The unit cost increases as follows: 1st unit is crafted with say x + (x/2) * 0 coins 2nd unit will be crafted with x + (x/2) * 1 coins 3rd unit will be crafted with x + (x/2) * 2 coins ​ But the lead time for a batch of items remains the same. Say I craft a 10 unit batch of needle, I have to make sure that 10 units of ribbon & 40 units of metal is available at the time of crafting. And the price of the batch will be 10 x + (x/2) * (0 + 1 + 2 + ... + 9) ​ The time to craft the batch is constant at y. ​ ​ I want to create a policy such that the wait times of the higher order items is minimized along with the number of coins used. ​ Need help in creating this environment, running simulations and optimize with DRL. PS: Please feel free to ask questions if anything is unclear. submitted by /u/haldarankit [link] [comments]  ( 54 min )
    [Discussion] Catching up with SOTA and innovations from 2022?
    Hey all! I've been exploring new areas of ML over 2022 so I've missed a decent amount in terms of RL innovations over this year. I was wondering if anyone had good paper recommendations for me to catch up on? What were your "wow, this is big" papers of this year? submitted by /u/iamquah [link] [comments]  ( 56 min )
    Best Books to Learn Reinforcement Learning in 2022 -
    submitted by /u/Lakshmireddys [link] [comments]  ( 51 min )
    Learning to Throw with a Handful of Samples using Decision Transformers
    Decision Transformers (DT) are relatively new in the world of Reinforcement Learning. In a new work, recently accepted for publication in IEEE Robotics and Automation Letters, we explore the use of DT for robotic object throwing. Object throwing, without prior knowledge of the required motion, can be very dangerous to learn on a real robot. Training in simulation, in addition, provides an out-of-distribution and far-from-reality policy. We show that, with DT and after training a policy with simulated throws, sim2real can be done with only a handful of real throws. The DT can extrapolate and accurately throw to goals that are out-of-distribution to the training data. Video: https://m.youtube.com/watch?v=5_G6o_H3HeE&feature=youtu.be Paper: https://ieeexplore.ieee.org/document/9984828 submitted by /u/maxorpaxor [link] [comments]  ( 55 min )
    TurtleBot3 Deep Reinforcement Learning Experimentation Platform (ROS2, PyTorch)
    https://github.com/tomasvr/turtlebot3_drlnav Hi all! I created this platform based on the existing TurtleBot3 platform in order to make it easier for people to experiment with deep reinforcement learning for mobile robot navigation. Currently, the platform includes PyTorch implementations for DQN, DDPG, and TD3. The platform is based on ROS2 and provides multiple facilities such as storing/loading models, recording training output, and visualizing neural network activity. The system has also been validated on a low-cost physical robot, videos are included in the GitHub readme. I wanted to share the platform here in the hope that it could be helpful for anyone wanting to experiment with deep reinforcement learning or even implement their own algorithms. Thanks! submitted by /u/FlutteringReeds [link] [comments]  ( 55 min )
    Can't hit a moving target
    Hello! I am trying to learn reinforcement learning for a game that I'm working on, and, to start small, I have the following environment: A 2d 50x50 grid with a target. The AI starts in a random location, and is given a reward of 1.0 if it hits the target and -1.0 if it hits the edge of the grid. Its actions are to move by 1.0 in one of the four cardinal directions, or to do nothing. The observation space is simply the x,y coordinates of the AI and of the target. If the target is always in the center, then it learns quickly to move toward the target. If the target spawns at a random location each time, it does not, though it does seem to learn to avoid the walls, and eventually just kind of jiggles in place. I started with a Rust port of CleanRL's PPO implementation, here. Thinking that I may have messed up the port, I have also created it in python here, using the actual CleanRL PPO implementation, but the behavior appears to be the same. Is there something fundamental I could be missing? It seems to me that this should be a very simple environment to learn. submitted by /u/paholg [link] [comments]  ( 57 min )
    "Embedding Synthetic Off-Policy Experience for Autonomous Driving via Zero-Shot Curricula", Bronstein et al 2022
    submitted by /u/gwern [link] [comments]  ( 55 min )
  • Open

    AI Dream 130 - EPIC Trippy Dream Feat. DJ Wizard69
    submitted by /u/LordPewPew777 [link] [comments]  ( 53 min )
    I made a video about the "new" GPT-3 Chatbot by OpenAI. Theres alot about this AI people dont know. Give it a watch if you have a few minutes. Thanks!
    https://youtu.be/0fEG0ClkIdo submitted by /u/dondomigo [link] [comments]  ( 53 min )
    Hardware for genetic algorithms?
    I'm just getting into GAs and am interested in them for mathematics work. I have some funding to build a machine and am planning to also use it for training some deep learning models. So I'm currently considering an Intel i9 13900k, 128GB RAM, and an RTX 4090 GPU. But would I benefit from a Threadripper pro CPU for GAs? It wouldn't do much for deep learning but if it would benefit GA work then I could justify it. Thanks! Note: Apparently I'm restricted from posting at /r/genetic_algorithms as it's only for "trusted members". I hope it's ok here. submitted by /u/computing_professor [link] [comments]  ( 53 min )
    Is there any text-to-image AI better than mid-journey? if not, anyone has the same quality in prediction with a bigger limit or unlimited if available?
    submitted by /u/Shady_Shama [link] [comments]  ( 53 min )
    AI Dream 130 - Psychedelic Daydream Wizard69 Remix
    submitted by /u/LordPewPew777 [link] [comments]  ( 53 min )
    Making an AI picture of a character in a photo?
    This is probably asking a lot, and apologies if it is not a good fit for this subreddit, but are there any existing AI tools that would allow me to take a still image of a character that I drew myself, and create a new picture with that character doing something else? Like, I have the character standing in a neutral pose, and I want it to be riding a motorcycle? submitted by /u/darnbirch [link] [comments]  ( 52 min )
    I asked AI to make a Music Video… the results are trippy
    submitted by /u/Prior_Appearance_44 [link] [comments]  ( 51 min )
    Is AI smarter than an infant? Not even close.
    submitted by /u/estasfuera [link] [comments]  ( 55 min )
    AI tool developed in Israel can predict heart failure weeks in advance
    Researchers in Israel have come up with an artificial intelligence tool capable of analysing ECG tests and predicting heart failure with an unprecedented accuracy rate. According to the Times of Israel, the technology is currently being used for patients who suffer from myositis — a condition that significantly increases the risk of heart failure. The AI model was updated by feeding the data from ECG scans and medical records of 89 patients suffering from myositis between 2000 and 2020. The report claimed that the AI can understand subtle patterns in the ECGs and predict possible heart failures well ahead of time. Head Researcher, Dr. Shahar Shelly of Rambam Healthcare Campus said: “We are running ECG tests through the AI model, which sees details that doctors can’t normally detect and then predicts who is at risk of heart failure,” said Shelly. “Given that it’s these cardiac dysfunctions that often end up killing people, this can save lives.” Another Game-changing development from AI, and this one is absolutely MASSIVE ​ This is from the AI With Vibes Newsletter, read the full issue here: https://aiwithvibes.beehiiv.com/p/ai-porn-billie-eilish-goes-viral-tiktok-chatgpt-brutally-destroyed-pun-competition submitted by /u/Mk_Makanaki [link] [comments]  ( 49 min )
    Google AI-Chat Technology: Moving Too Fast Could Damage Reputation
    submitted by /u/liquidocelotYT [link] [comments]  ( 51 min )
    What's so special about ChatGPT?
    From what I gathered it's a tool developed by OpenAI to chat with an AI, I don't find that particularly fascinating to the point that everyone is talking about it so maybe I'm wrong and someone could enlighten me in that aspect? ​ Thanks. submitted by /u/hisdudeness_ishere [link] [comments]  ( 51 min )
    AlterEgo - Generate AI images of you in millions of styles
    submitted by /u/magenta_placenta [link] [comments]  ( 50 min )
    Can I give chat gtp 3 books of the live of vikings and it will write a new one, which uses the informations from the 3 books?
    This would change my life. submitted by /u/Thesmallcookie [link] [comments]  ( 50 min )
    Looking for a remote job in AI/ML?
    ai-jobs.net has you covered - it currently lists over 1K open remote roles globally: https://ai-jobs.net/remote-jobs/ Is there anything you miss in terms of job search? Would also be interesting to hear what your take is on the current job market in this space. submitted by /u/ai_jobs [link] [comments]  ( 56 min )
    FAQ: What is ChatGPT?
    submitted by /u/yudiz [link] [comments]  ( 53 min )
    Welcome Home but Every Lyric is an AI Generated Animation!
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 51 min )
  • Open

    New and Improved Embedding Model
    We are excited to announce a new embedding model which is significantly more capable, cost effective, and simpler to use. The new model, text-embedding-ada-002, replaces five separate models for text search, text similarity, and code search, and outperforms our previous most capable model, Davinci, at most tasks, while being priced  ( 4 min )
  • Open

    LightOn Lyra-fr model is now available on Amazon SageMaker
    We are thrilled to announce the availability of the LightOn Lyra-fr foundation model for customers using Amazon SageMaker. LightOn is a leader in building foundation models specializing in European languages. Lyra-fr is a state-of-the-art French language model that can be used to build conversational AI, copywriting tools, text classifiers, semantic search, and more. You can […]  ( 7 min )
  • Open

    Have a Holly, Jolly Holiday Streaming Top Titles on GeForce NOW
    While the weather outside may or may not be frightful this holiday season, new games on GeForce NOW each week make every GFN Thursday delightful. It doesn’t matter whether you’re on the naughty or nice list. With over 1,400 titles streaming from the cloud, there’s something for everyone to play across nearly all of their Read article > The post Have a Holly, Jolly Holiday Streaming Top Titles on GeForce NOW appeared first on NVIDIA Blog.  ( 6 min )
  • Open

    Sphere of infuence
    Suppose a spaceship is headed from the earth to the moon. At some point we say that the ship has left the earth’s sphere of influence is now in the moon’s sphere of influence (SOI). What does that mean exactly? Wrong explanation #1 One way you’ll hear it described is that the moon’s sphere of […] Sphere of infuence first appeared on John D. Cook.  ( 7 min )
    Lagrange’s quintic and Descartes’ rule
    Do fifth degree polynomial equations come up in applications? Yes, and this post will give an example. In general the three-body problem, describing the motion of three objects interacting under gravity, does not have a closed-form solution. However, Euler and Lagrange discovered a few special cases that do have closed-form solutions. We will look at […] Lagrange’s quintic and Descartes’ rule first appeared on John D. Cook.  ( 5 min )
  • Open

    AP: Selective Activation for De-sparsifying Pruned Neural Networks. (arXiv:2212.06145v1 [cs.LG])
    The rectified linear unit (ReLU) is a highly successful activation function in neural networks as it allows networks to easily obtain sparse representations, which reduces overfitting in overparameterized networks. However, in network pruning, we find that the sparsity introduced by ReLU, which we quantify by a term called dynamic dead neuron rate (DNR), is not beneficial for the pruned network. Interestingly, the more the network is pruned, the smaller the dynamic DNR becomes during optimization. This motivates us to propose a method to explicitly reduce the dynamic DNR for the pruned network, i.e., de-sparsify the network. We refer to our method as Activating-while-Pruning (AP). We note that AP does not function as a stand-alone method, as it does not evaluate the importance of weights. Instead, it works in tandem with existing pruning methods and aims to improve their performance by selective activation of nodes to reduce the dynamic DNR. We conduct extensive experiments using popular networks (e.g., ResNet, VGG) via two classical and three state-of-the-art pruning methods. The experimental results on public datasets (e.g., CIFAR-10/100) suggest that AP works well with existing pruning methods and improves the performance by 3% - 4%. For larger scale datasets (e.g., ImageNet) and state-of-the-art networks (e.g., vision transformer), we observe an improvement of 2% - 3% with AP as opposed to without. Lastly, we conduct an ablation study to examine the effectiveness of the components comprising AP.
    Multivariate Powered Dirichlet Hawkes Process. (arXiv:2212.05995v2 [cs.LG] UPDATED)
    The publication time of a document carries a relevant information about its semantic content. The Dirichlet-Hawkes process has been proposed to jointly model textual information and publication dynamics. This approach has been used with success in several recent works, and extended to tackle specific challenging problems --typically for short texts or entangled publication dynamics. However, the prior in its current form does not allow for complex publication dynamics. In particular, inferred topics are independent from each other --a publication about finance is assumed to have no influence on publications about politics, for instance. In this work, we develop the Multivariate Powered Dirichlet-Hawkes Process (MPDHP), that alleviates this assumption. Publications about various topics can now influence each other. We detail and overcome the technical challenges that arise from considering interacting topics. We conduct a systematic evaluation of MPDHP on a range of synthetic datasets to define its application domain and limitations. Finally, we develop a use case of the MPDHP on Reddit data. At the end of this article, the interested reader will know how and when to use MPDHP, and when not to.
    Multi-Agent Path Finding via Tree LSTM. (arXiv:2210.12933v2 [cs.AI] UPDATED)
    In recent years, Multi-Agent Path Finding (MAPF) has attracted attention from the fields of both Operations Research (OR) and Reinforcement Learning (RL). However, in the 2021 Flatland3 Challenge, a competition on MAPF, the best RL method scored only 27.9, far less than the best OR method. This paper proposes a new RL solution to Flatland3 Challenge, which scores 125.3, several times higher than the best RL solution before. We creatively apply a novel network architecture, TreeLSTM, to MAPF in our solution. Together with several other RL techniques, including reward shaping, multiple-phase training, and centralized control, our solution is comparable to the top 2-3 OR methods.
    Regression modelling of spatiotemporal extreme U.S. wildfires via partially-interpretable neural networks. (arXiv:2208.07581v3 [stat.ML] UPDATED)
    Risk management in many environmental settings requires an understanding of the mechanisms that drive extreme events. Useful metrics for quantifying such risk are extreme quantiles of response variables conditioned on predictor variables that describe, e.g., climate, biosphere and environmental states. Typically these quantiles lie outside the range of observable data and so, for estimation, require specification of parametric extreme value models within a regression framework. Classical approaches in this context utilise linear or additive relationships between predictor and response variables and suffer in either their predictive capabilities or computational efficiency; moreover, their simplicity is unlikely to capture the truly complex structures that lead to the creation of extreme wildfires. In this paper, we propose a new methodological framework for performing extreme quantile regression using artificial neutral networks, which are able to capture complex non-linear relationships and scale well to high-dimensional data. The ``black box" nature of neural networks means that they lack the desirable trait of interpretability often favoured by practitioners; thus, we unify linear, and additive, regression methodology with deep learning to create partially-interpretable neural networks that can be used for statistical inference but retain high prediction accuracy. To complement this methodology, we further propose a novel point process model for extreme values which overcomes the finite lower-endpoint problem associated with the generalised extreme value class of distributions. Efficacy of our unified framework is illustrated on U.S. wildfire data with a high-dimensional predictor set and we illustrate vast improvements in predictive performance over linear and spline-based regression techniques.
    Achieving and Understanding Out-of-Distribution Generalization in Systematic Reasoning in Small-Scale Transformers. (arXiv:2210.03275v2 [cs.LG] UPDATED)
    Out-of-distribution generalization (OODG) is a longstanding challenge for neural networks. This challenge is quite apparent in tasks with well-defined variables and rules, where explicit use of the rules could solve problems independently of the particular values of the variables, but networks tend to be tied to the range of values sampled in their training data. Large transformer-based language models have pushed the boundaries on how well neural networks can solve previously unseen problems, but their complexity and lack of clarity about the relevant content in their training data obfuscates how they achieve such robustness. As a step toward understanding how transformer-based systems generalize, we explore the question of OODG in small scale transformers trained with examples from a known distribution. Using a reasoning task based on the puzzle Sudoku, we show that OODG can occur on a complex problem if the training set includes examples sampled from the whole distribution of simpler component tasks. Successful generalization depends on carefully managing positional alignment when absolute position encoding is used, but we find that suppressing sensitivity to absolute positions overcomes this limitation. Taken together our results represent a small step toward understanding and promoting systematic generalization in transformers.
    Barlow Graph Auto-Encoder for Unsupervised Network Embedding. (arXiv:2110.15742v3 [cs.LG] UPDATED)
    Network embedding has emerged as a promising research field for network analysis. Recently, an approach, named Barlow Twins, has been proposed for self-supervised learning in computer vision by applying the redundancy-reduction principle to the embedding vectors corresponding to two distorted versions of the image samples. Motivated by this, we propose Barlow Graph Auto-Encoder, a simple yet effective architecture for learning network embedding. It aims to maximize the similarity between the embedding vectors of immediate and larger neighborhoods of a node, while minimizing the redundancy between the components of these projections. In addition, we also present the variation counterpart named as Barlow Variational Graph Auto-Encoder. Our approach yields promising results for inductive link prediction and is also on par with state of the art for clustering and downstream node classification, as demonstrated by extensive comparisons with several well-known techniques on three benchmark citation datasets.
    Multi-armed Bandit Learning on a Graph. (arXiv:2209.09419v2 [cs.LG] UPDATED)
    The multi-armed bandit(MAB) problem is a simple yet powerful framework that has been extensively studied in the context of decision-making under uncertainty. In many real-world applications, such as robotic applications, selecting an arm corresponds to a physical action that constrains the choices of the next available arms (actions). Motivated by this, we study an extension of MAB called the graph bandit, where an agent travels over a graph to maximize the reward collected from different nodes. The graph defines the agent's freedom in selecting the next available nodes at each step. We assume the graph structure is fully available, but the reward distributions are unknown. Built upon an offline graph-based planning algorithm and the principle of optimism, we design a learning algorithm, \texttt{G-UCB}, that balances long-term exploration-exploitation using the principle of optimism. We show that our proposed algorithm achieves $O(\sqrt{|S|T\log(T)}+D|S|\log T)$ learning regret, where $|S|$ is the number of nodes and $D$ is the diameter of the graph, which matches the theoretical lower bound $\Omega(\sqrt{|S|T})$ up to logarithmic factors. To our knowledge, this result is among the first tight regret bounds in non-episodic, un-discounted learning problems with known deterministic transitions. Numerical experiments confirm that our algorithm outperforms several benchmarks.
    A Framework for Benchmarking Clustering Algorithms. (arXiv:2209.09493v2 [cs.LG] UPDATED)
    The evaluation of clustering algorithms can involve running them on a variety of benchmark problems, and comparing their outputs to the reference, ground-truth groupings provided by experts. Unfortunately, many research papers and graduate theses consider only a small number of datasets. Also, the fact that there can be many equally valid ways to cluster a given problem set is rarely taken into account. In order to overcome these limitations, we have developed a framework whose aim is to introduce a consistent methodology for testing clustering algorithms. Furthermore, we have aggregated, polished, and standardised many clustering benchmark dataset collections referred to across the machine learning and data mining literature, and included new datasets of different dimensionalities, sizes, and cluster types. An interactive datasets explorer, the documentation of the Python API, a description of the ways to interact with the framework from other programming languages such as R or MATLAB, and other details are all provided at .
    Textless Speech Emotion Conversion using Discrete and Decomposed Representations. (arXiv:2111.07402v3 [cs.CL] UPDATED)
    Speech emotion conversion is the task of modifying the perceived emotion of a speech utterance while preserving the lexical content and speaker identity. In this study, we cast the problem of emotion conversion as a spoken language translation task. We use a decomposition of the speech signal into discrete learned representations, consisting of phonetic-content units, prosodic features, speaker, and emotion. First, we modify the speech content by translating the phonetic-content units to a target emotion, and then predict the prosodic features based on these units. Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder. Such a paradigm allows us to go beyond spectral and parametric changes of the signal, and model non-verbal vocalizations, such as laughter insertion, yawning removal, etc. We demonstrate objectively and subjectively that the proposed method is vastly superior to current approaches and even beats text-based systems in terms of perceived emotion and audio quality. We rigorously evaluate all components of such a complex system and conclude with an extensive model analysis and ablation study to better emphasize the architectural choices, strengths and weaknesses of the proposed method. Samples are available under the following link: https://speechbot.github.io/emotion.
    Localized Latent Updates for Fine-Tuning Vision-Language Models. (arXiv:2212.06556v1 [cs.CV])
    Although massive pre-trained vision-language models like CLIP show impressive generalization capabilities for many tasks, still it often remains necessary to fine-tune them for improved performance on specific datasets. When doing so, it is desirable that updating the model is fast and that the model does not lose its capabilities on data outside of the dataset, as is often the case with classical fine-tuning approaches. In this work we suggest a lightweight adapter, that only updates the models predictions close to seen datapoints. We demonstrate the effectiveness and speed of this relatively simple approach in the context of few-shot learning, where our results both on classes seen and unseen during training are comparable with or improve on the state of the art.
    High-Frequency Space Diffusion Models for Accelerated MRI. (arXiv:2208.05481v3 [eess.IV] UPDATED)
    Denoising diffusion probabilistic models (DDPMs) have been shown to have superior performances in MRI reconstruction. From the perspective of continuous stochastic differential equations (SDEs), the reverse process of DDPM can be seen as maximizing the energy of the reconstructed MR image, leading to SDE sequence divergence. For this reason, a modified high-frequency DDPM model is proposed for MRI reconstruction. From its continuous SDE viewpoint, termed high-frequency space SDE (HFS-SDE), the energy concentrated low-frequency part of the MR image is no longer amplified, and the diffusion process focuses more on acquiring high-frequency prior information. It not only improves the stability of the diffusion model but also provides the possibility of better recovery of high-frequency details. Experiments on the publicly fastMRI dataset show that our proposed HFS-SDE outperforms the DDPM-driven VP-SDE, supervised deep learning methods and traditional parallel imaging methods in terms of stability and reconstruction accuracy.
    Behavioral Experiments for Understanding Catastrophic Forgetting. (arXiv:2110.10570v3 [cs.LG] UPDATED)
    In this paper we explore whether the fundamental tool of experimental psychology, the behavioral experiment, has the power to generate insight not only into humans and animals, but artificial systems too. We apply the techniques of experimental psychology to investigating catastrophic forgetting in neural networks. We present a series of controlled experiments with two-layer ReLU networks, and exploratory results revealing a new understanding of the behavior of catastrophic forgetting. Alongside our empirical findings, we demonstrate an alternative, behavior-first approach to investigating neural network phenomena.
    Unsupervised visualization of image datasets using contrastive learning. (arXiv:2210.09879v2 [cs.LG] UPDATED)
    Visualization methods based on the nearest neighbor graph, such as t-SNE or UMAP, are widely used for visualizing high-dimensional data. Yet, these approaches only produce meaningful results if the nearest neighbors themselves are meaningful. For images represented in pixel space this is not the case, as distances in pixel space are often not capturing our sense of similarity and therefore neighbors are not semantically close. This problem can be circumvented by self-supervised approaches based on contrastive learning, such as SimCLR, relying on data augmentation to generate implicit neighbors, but these methods do not produce two-dimensional embeddings suitable for visualization. Here, we present a new method, called t-SimCNE, for unsupervised visualization of image data. T-SimCNE combines ideas from contrastive learning and neighbor embeddings, and trains a parametric mapping from the high-dimensional pixel space into two dimensions. We show that the resulting 2D embeddings achieve classification accuracy comparable to the state-of-the-art high-dimensional SimCLR representations, thus faithfully capturing semantic relationships. Using t-SimCNE, we obtain informative visualizations of the CIFAR-10 and CIFAR-100 datasets, showing rich cluster structure and highlighting artifacts and outliers.
    Coarse-to-Fine Contrastive Learning on Graphs. (arXiv:2212.06423v1 [cs.LG])
    Inspired by the impressive success of contrastive learning (CL), a variety of graph augmentation strategies have been employed to learn node representations in a self-supervised manner. Existing methods construct the contrastive samples by adding perturbations to the graph structure or node attributes. Although impressive results are achieved, it is rather blind to the wealth of prior information assumed: with the increase of the perturbation degree applied on the original graph, 1) the similarity between the original graph and the generated augmented graph gradually decreases; 2) the discrimination between all nodes within each augmented view gradually increases. In this paper, we argue that both such prior information can be incorporated (differently) into the contrastive learning paradigm following our general ranking framework. In particular, we first interpret CL as a special case of learning to rank (L2R), which inspires us to leverage the ranking order among positive augmented views. Meanwhile, we introduce a self-ranking paradigm to ensure that the discriminative information among different nodes can be maintained and also be less altered to the perturbations of different degrees. Experiment results on various benchmark datasets verify the effectiveness of our algorithm compared with the supervised and unsupervised models.
    Reinforced Approximate Exploratory Data Analysis. (arXiv:2212.06225v1 [cs.LG])
    Exploratory data analytics (EDA) is a sequential decision making process where analysts choose subsequent queries that might lead to some interesting insights based on the previous queries and corresponding results. Data processing systems often execute the queries on samples to produce results with low latency. Different downsampling strategy preserves different statistics of the data and have different magnitude of latency reductions. The optimum choice of sampling strategy often depends on the particular context of the analysis flow and the hidden intent of the analyst. In this paper, we are the first to consider the impact of sampling in interactive data exploration settings as they introduce approximation errors. We propose a Deep Reinforcement Learning (DRL) based framework which can optimize the sample selection in order to keep the analysis and insight generation flow intact. Evaluations with 3 real datasets show that our technique can preserve the original insight generation flow while improving the interaction latency, compared to baseline methods.
    The Importance of Image Interpretation: Patterns of Semantic Misclassification in Real-World Adversarial Images. (arXiv:2206.01467v2 [cs.CV] UPDATED)
    Adversarial images are created with the intention of causing an image classifier to produce a misclassification. In this paper, we propose that adversarial images should be evaluated based on semantic mismatch, rather than label mismatch, as used in current work. In other words, we propose that an image of a "mug" would be considered adversarial if classified as "turnip", but not as "cup", as current systems would assume. Our novel idea of taking semantic misclassification into account in the evaluation of adversarial images offers two benefits. First, it is a more realistic conceptualization of what makes an image adversarial, which is important in order to fully understand the implications of adversarial images for security and privacy. Second, it makes it possible to evaluate the transferability of adversarial images to a real-world classifier, without requiring the classifier's label set to have been available during the creation of the images. The paper carries out an evaluation of a transfer attack on a real-world image classifier that is made possible by our semantic misclassification approach. The attack reveals patterns in the semantics of adversarial misclassifications that could not be investigated using conventional label mismatch.
    MAntRA: A framework for model agnostic reliability analysis. (arXiv:2212.06303v1 [stat.ME])
    We propose a novel model agnostic data-driven reliability analysis framework for time-dependent reliability analysis. The proposed approach -- referred to as MAntRA -- combines interpretable machine learning, Bayesian statistics, and identifying stochastic dynamic equation to evaluate reliability of stochastically-excited dynamical systems for which the governing physics is \textit{apriori} unknown. A two-stage approach is adopted: in the first stage, an efficient variational Bayesian equation discovery algorithm is developed to determine the governing physics of an underlying stochastic differential equation (SDE) from measured output data. The developed algorithm is efficient and accounts for epistemic uncertainty due to limited and noisy data, and aleatoric uncertainty because of environmental effect and external excitation. In the second stage, the discovered SDE is solved using a stochastic integration scheme and the probability failure is computed. The efficacy of the proposed approach is illustrated on three numerical examples. The results obtained indicate the possible application of the proposed approach for reliability analysis of in-situ and heritage structures from on-site measurements.
    NeuGuard: Lightweight Neuron-Guided Defense against Membership Inference Attacks. (arXiv:2206.05565v2 [cs.CR] UPDATED)
    Membership inference attacks (MIAs) against machine learning models can lead to serious privacy risks for the training dataset used in the model training. In this paper, we propose a novel and effective Neuron-Guided Defense method named NeuGuard against membership inference attacks (MIAs). We identify a key weakness in existing defense mechanisms against MIAs wherein they cannot simultaneously defend against two commonly used neural network based MIAs, indicating that these two attacks should be separately evaluated to assure the defense effectiveness. We propose NeuGuard, a new defense approach that jointly controls the output and inner neurons' activation with the object to guide the model output of training set and testing set to have close distributions. NeuGuard consists of class-wise variance minimization targeting restricting the final output neurons and layer-wise balanced output control aiming to constrain the inner neurons in each layer. We evaluate NeuGuard and compare it with state-of-the-art defenses against two neural network based MIAs, five strongest metric based MIAs including the newly proposed label-only MIA on three benchmark datasets. Results show that NeuGuard outperforms the state-of-the-art defenses by offering much improved utility-privacy trade-off, generality, and overhead.
    Which Explanation Should I Choose? A Function Approximation Perspective to Characterizing Post Hoc Explanations. (arXiv:2206.01254v2 [cs.LG] UPDATED)
    A critical problem in post hoc explainability is the lack of a common foundational goal among methods. For example, some methods are motivated by function approximation, some by game theoretic notions, and some by obtaining clean visualizations. This fragmentation of goals causes not only an inconsistent conceptual understanding of explanations but also the practical challenge of not knowing which method to use when. In this work, we begin to address these challenges by unifying eight popular post hoc explanation methods (LIME, C-LIME, SHAP, Occlusion, Vanilla Gradients, Gradients x Input, SmoothGrad, and Integrated Gradients). We show that these methods all perform local function approximation of the black-box model, differing only in the neighbourhood and loss function used to perform the approximation. This unification enables us to (1) state a no free lunch theorem for explanation methods which demonstrates that no single method can perform optimally across all neighbourhoods, and (2) provide a guiding principle to choose among methods based on faithfulness to the black-box model. We empirically validate these theoretical results using various real-world datasets, model classes, and prediction tasks. By bringing diverse explanation methods into a common framework, this work (1) advances the conceptual understanding of these methods, revealing their shared local function approximation objective, properties, and relation to one another, and (2) guides the use of these methods in practice, providing a principled approach to choose among methods and paving the way for the creation of new ones.
    ALRt: An Active Learning Framework for Irregularly Sampled Temporal Data. (arXiv:2212.06364v1 [cs.LG])
    Sepsis is a deadly condition affecting many patients in the hospital. Recent studies have shown that patients diagnosed with sepsis have significant mortality and morbidity, resulting from the body's dysfunctional host response to infection. Clinicians often rely on the use of Sequential Organ Failure Assessment (SOFA), Systemic Inflammatory Response Syndrome (SIRS), and the Modified Early Warning Score (MEWS) to identify early signs of clinical deterioration requiring further work-up and treatment. However, many of these tools are manually computed and were not designed for automated computation. There have been different methods used for developing sepsis onset models, but many of these models must be trained on a sufficient number of patient observations in order to form accurate sepsis predictions. Additionally, the accurate annotation of patients with sepsis is a major ongoing challenge. In this paper, we propose the use of Active Learning Recurrent Neural Networks (ALRts) for short temporal horizons to improve the prediction of irregularly sampled temporal events such as sepsis. We show that an active learning RNN model trained on limited data can form robust sepsis predictions comparable to models using the entire training dataset.
    Securing the Spike: On the Transferability and Security of Spiking Neural Networks to Adversarial Examples. (arXiv:2209.03358v2 [cs.NE] UPDATED)
    Spiking neural networks (SNNs) have attracted much attention for their high energy efficiency and for recent advances in their classification performance. However, unlike traditional deep learning approaches, the analysis and study of the robustness of SNNs to adversarial examples remain relatively underdeveloped. In this work we focus on advancing the adversarial attack side of SNNs and make three major contributions. First, we show that successful white-box adversarial attacks on SNNs are highly dependent on the underlying surrogate gradient technique. Second, using the best surrogate gradient technique, we analyze the transferability of adversarial attacks on SNNs and other state-of-the-art architectures like Vision Transformers (ViTs) and Big Transfer Convolutional Neural Networks (CNNs). We demonstrate that SNNs are not often deceived by adversarial examples generated by Vision Transformers and certain types of CNNs. Third, due to the lack of an ubiquitous white-box attack that is effective across both the SNN and CNN/ViT domains, we develop a new white-box attack, the Auto Self-Attention Gradient Attack (Auto SAGA). Our novel attack generates adversarial examples capable of fooling both SNN models and non-SNN models simultaneously. Auto SAGA is as much as $87.9\%$ more effective on SNN/ViT model ensembles than conventional white-box attacks like PGD. Our experiments and analyses are broad and rigorous covering three datasets (CIFAR-10, CIFAR-100 and ImageNet), five different white-box attacks and nineteen different classifier models (seven for each CIFAR dataset and five different models for ImageNet).
    Matrix Profile XXVII: A Novel Distance Measure for Comparing Long Time Series. (arXiv:2212.06146v1 [cs.LG])
    The most useful data mining primitives are distance measures. With an effective distance measure, it is possible to perform classification, clustering, anomaly detection, segmentation, etc. For single-event time series Euclidean Distance and Dynamic Time Warping distance are known to be extremely effective. However, for time series containing cyclical behaviors, the semantic meaningfulness of such comparisons is less clear. For example, on two separate days the telemetry from an athlete workout routine might be very similar. The second day may change the order in of performing push-ups and squats, adding repetitions of pull-ups, or completely omitting dumbbell curls. Any of these minor changes would defeat existing time series distance measures. Some bag-of-features methods have been proposed to address this problem, but we argue that in many cases, similarity is intimately tied to the shapes of subsequences within these longer time series. In such cases, summative features will lack discrimination ability. In this work we introduce PRCIS, which stands for Pattern Representation Comparison in Series. PRCIS is a distance measure for long time series, which exploits recent progress in our ability to summarize time series with dictionaries. We will demonstrate the utility of our ideas on diverse tasks and datasets.
    Learning Dynamical Systems via Koopman Operator Regression in Reproducing Kernel Hilbert Spaces. (arXiv:2205.14027v4 [cs.LG] UPDATED)
    We study a class of dynamical systems modelled as Markov chains that admit an invariant distribution via the corresponding transfer, or Koopman, operator. While data-driven algorithms to reconstruct such operators are well known, their relationship with statistical learning is largely unexplored. We formalize a framework to learn the Koopman operator from finite data trajectories of the dynamical system. We consider the restriction of this operator to a reproducing kernel Hilbert space and introduce a notion of risk, from which different estimators naturally arise. We link the risk with the estimation of the spectral decomposition of the Koopman operator. These observations motivate a reduced-rank operator regression (RRR) estimator. We derive learning bounds for the proposed estimator, holding both in i.i.d. and non i.i.d. settings, the latter in terms of mixing coefficients. Our results suggest RRR might be beneficial over other widely used estimators as confirmed in numerical experiments both for forecasting and mode decomposition.
    A Sandbox Tool to Bias(Stress)-Test Fairness Algorithms. (arXiv:2204.10233v2 [cs.LG] UPDATED)
    Motivated by the growing importance of reducing unfairness in ML predictions, Fair-ML researchers have presented an extensive suite of algorithmic 'fairness-enhancing' remedies. Most existing algorithms, however, are agnostic to the sources of the observed unfairness. As a result, the literature currently lacks guiding frameworks to specify conditions under which each algorithmic intervention can potentially alleviate the underpinning cause of unfairness. To close this gap, we scrutinize the underlying biases (e.g., in the training data or design choices) that cause observational unfairness. We present the conceptual idea and a first implementation of a bias-injection sandbox tool to investigate fairness consequences of various biases and assess the effectiveness of algorithmic remedies in the presence of specific types of bias. We call this process the bias(stress)-testing of algorithmic interventions. Unlike existing toolkits, ours provides a controlled environment to counterfactually inject biases in the ML pipeline. This stylized setup offers the distinct capability of testing fairness interventions beyond observational data and against an unbiased benchmark. In particular, we can test whether a given remedy can alleviate the injected bias by comparing the predictions resulting after the intervention in the biased setting with true labels in the unbiased regime-that is, before any bias injection. We illustrate the utility of our toolkit via a proof-of-concept case study on synthetic data. Our empirical analysis showcases the type of insights that can be obtained through our simulations.
    Multi-instrument Music Synthesis with Spectrogram Diffusion. (arXiv:2206.05408v3 [cs.SD] UPDATED)
    An ideal music synthesizer should be both interactive and expressive, generating high-fidelity audio in realtime for arbitrary combinations of instruments and notes. Recent neural synthesizers have exhibited a tradeoff between domain-specific models that offer detailed control of only specific instruments, or raw waveform models that can train on any music but with minimal control and slow generation. In this work, we focus on a middle ground of neural synthesizers that can generate audio from MIDI sequences with arbitrary combinations of instruments in realtime. This enables training on a wide range of transcription datasets with a single model, which in turn offers note-level control of composition and instrumentation across a wide range of instruments. We use a simple two-stage process: MIDI to spectrograms with an encoder-decoder Transformer, then spectrograms to audio with a generative adversarial network (GAN) spectrogram inverter. We compare training the decoder as an autoregressive model and as a Denoising Diffusion Probabilistic Model (DDPM) and find that the DDPM approach is superior both qualitatively and as measured by audio reconstruction and Fr\'echet distance metrics. Given the interactivity and generality of this approach, we find this to be a promising first step towards interactive and expressive neural synthesis for arbitrary combinations of instruments and notes.
    Improving Diversity with Adversarially Learned Transformations for Domain Generalization. (arXiv:2206.07736v2 [cs.LG] UPDATED)
    To be successful in single source domain generalization, maximizing diversity of synthesized domains has emerged as one of the most effective strategies. Many of the recent successes have come from methods that pre-specify the types of diversity that a model is exposed to during training, so that it can ultimately generalize well to new domains. However, na\"ive diversity based augmentations do not work effectively for domain generalization either because they cannot model large domain shift, or because the span of transforms that are pre-specified do not cover the types of shift commonly occurring in domain generalization. To address this issue, we present a novel framework that uses adversarially learned transformations (ALT) using a neural network to model plausible, yet hard image transformations that fool the classifier. This network is randomly initialized for each batch and trained for a fixed number of steps to maximize classification error. Further, we enforce consistency between the classifier's predictions on the clean and transformed images. With extensive empirical analysis, we find that this new form of adversarial transformations achieve both objectives of diversity and hardness simultaneously, outperforming all existing techniques on competitive benchmarks for single source domain generalization. We also show that ALT can naturally work with existing diversity modules to produce highly distinct, and large transformations of the source domain leading to state-of-the-art performance.
    Formal limitations of sample-wise information-theoretic generalization bounds. (arXiv:2205.06915v2 [cs.LG] UPDATED)
    Some of the tightest information-theoretic generalization bounds depend on the average information between the learned hypothesis and a single training example. However, these sample-wise bounds were derived only for expected generalization gap. We show that even for expected squared generalization gap no such sample-wise information-theoretic bounds exist. The same is true for PAC-Bayes and single-draw bounds. Remarkably, PAC-Bayes, single-draw and expected squared generalization gap bounds that depend on information in pairs of examples exist.
    The Unreasonable Effectiveness of Deep Evidential Regression. (arXiv:2205.10060v2 [cs.LG] UPDATED)
    There is a significant need for principled uncertainty reasoning in machine learning systems as they are increasingly deployed in safety-critical domains. A new approach with uncertainty-aware regression-based neural networks (NNs), based on learning evidential distributions for aleatoric and epistemic uncertainties, shows promise over traditional deterministic methods and typical Bayesian NNs, notably with the capabilities to disentangle aleatoric and epistemic uncertainties. Despite some empirical success of Deep Evidential Regression (DER), there are important gaps in the mathematical foundation that raise the question of why the proposed technique seemingly works. We detail the theoretical shortcomings and analyze the performance on synthetic and real-world data sets, showing that Deep Evidential Regression is a heuristic rather than an exact uncertainty quantification. We go on to propose corrections and redefinitions of how aleatoric and epistemic uncertainties should be extracted from NNs.
    Rethinking the Unpretentious U-net for Medical Ultrasound Image Segmentation. (arXiv:2209.07193v4 [eess.IV] UPDATED)
    Breast tumor segmentation is one of the key steps that helps us characterize and localize tumor regions. However, variable tumor morphology, blurred boundary, and similar intensity distributions bring challenges for accurate segmentation of breast tumors. Recently, many U-net variants have been proposed and widely used for breast tumors segmentation. However, these architectures suffer from two limitations: (1) Ignoring the characterize ability of the benchmark networks, and (2) Introducing extra complex operations increases the difficulty of understanding and reproducing the network. To alleviate these challenges, this paper proposes a simple yet powerful nested U-net (NU-net) for accurate segmentation of breast tumors. The key idea is to utilize U-Nets with different depths and shared weights to achieve robust characterization of breast tumors. NU-net mainly has the following advantages: (1) Improving network adaptability and robustness to breast tumors with different scales, (2) This method is easy to reproduce and execute, and (3) The extra operations increase network parameters without significantly increasing computational cost. Extensive experimental results with twelve state-of-the-art segmentation methods on three public breast ultrasound datasets demonstrate that NU-net has more competitive segmentation performance on breast tumors. Furthermore, the robustness of NU-net is further illustrated on the segmentation of renal ultrasound images. The source code is publicly available on https://github.com/CGPzy/NU-net.
    Sampling Through the Lens of Sequential Decision Making. (arXiv:2208.08056v3 [cs.LG] UPDATED)
    Sampling is ubiquitous in machine learning methodologies. Due to the growth of large datasets and model complexity, we want to learn and adapt the sampling process while training a representation. Towards achieving this grand goal, a variety of sampling techniques have been proposed. However, most of them either use a fixed sampling scheme or adjust the sampling scheme based on simple heuristics. They cannot choose the best sample for model training in different stages. Inspired by "Think, Fast and Slow" (System 1 and System 2) in cognitive science, we propose a reward-guided sampling strategy called Adaptive Sample with Reward (ASR) to tackle this challenge. To the best of our knowledge, this is the first work utilizing reinforcement learning (RL) to address the sampling problem in representation learning. Our approach optimally adjusts the sampling process to achieve optimal performance. We explore geographical relationships among samples by distance-based sampling to maximize overall cumulative reward. We apply ASR to the long-standing sampling problems in similarity-based loss functions. Empirical results in information retrieval and clustering demonstrate ASR's superb performance across different datasets. We also discuss an engrossing phenomenon which we name as "ASR gravity well" in experiments.
    Earthquake Impact Analysis Based on Text Mining and Social Media Analytics. (arXiv:2212.06765v1 [cs.CL])
    Earthquakes have a deep impact on wide areas, and emergency rescue operations may benefit from social media information about the scope and extent of the disaster. Therefore, this work presents a text miningbased approach to collect and analyze social media data for early earthquake impact analysis. First, disasterrelated microblogs are collected from the Sina microblog based on crawler technology. Then, after data cleaning a series of analyses are conducted including (1) the hot words analysis, (2) the trend of the number of microblogs, (3) the trend of public opinion sentiment, and (4) a keyword and rule-based text classification for earthquake impact analysis. Finally, two recent earthquakes with the same magnitude and focal depth in China are analyzed to compare their impacts. The results show that the public opinion trend analysis and the trend of public opinion sentiment can estimate the earthquake's social impact at an early stage, which will be helpful to decision-making and rescue management.
    Learning Representations for New Sound Classes With Continual Self-Supervised Learning. (arXiv:2205.07390v2 [eess.AS] UPDATED)
    In this paper, we work on a sound recognition system that continually incorporates new sound classes. Our main goal is to develop a framework where the model can be updated without relying on labeled data. For this purpose, we propose adopting representation learning, where an encoder is trained using unlabeled data. This learning framework enables the study and implementation of a practically relevant use case where only a small amount of the labels is available in a continual learning context. We also make the empirical observation that a similarity-based representation learning method within this framework is robust to forgetting even if no explicit mechanism against forgetting is employed. We show that this approach obtains similar performance compared to several distillation-based continual learning methods when employed on self-supervised representation learning methods.
    Wassmap: Wasserstein Isometric Mapping for Image Manifold Learning. (arXiv:2204.06645v2 [cs.LG] UPDATED)
    In this paper, we propose Wasserstein Isometric Mapping (Wassmap), a nonlinear dimensionality reduction technique that provides solutions to some drawbacks in existing global nonlinear dimensionality reduction algorithms in imaging applications. Wassmap represents images via probability measures in Wasserstein space, then uses pairwise Wasserstein distances between the associated measures to produce a low-dimensional, approximately isometric embedding. We show that the algorithm is able to exactly recover parameters of some image manifolds including those generated by translations or dilations of a fixed generating measure. Additionally, we show that a discrete version of the algorithm retrieves parameters from manifolds generated from discrete measures by providing a theoretical bridge to transfer recovery results from functional data to discrete data. Testing of the proposed algorithms on various image data manifolds show that Wassmap yields good embeddings compared with other global and local techniques.
    Ship Performance Monitoring using Machine-learning. (arXiv:2110.03594v2 [stat.ML] UPDATED)
    The hydrodynamic performance of a sea-going ship varies over its lifespan due to factors like marine fouling and the condition of the anti-fouling paint system. In order to accurately estimate the power demand and fuel consumption for a planned voyage, it is important to assess the hydrodynamic performance of the ship. The current work uses machine-learning (ML) methods to estimate the hydrodynamic performance of a ship using the onboard recorded in-service data. Three ML methods, NL-PCR, NL-PLSR and probabilistic ANN, are calibrated using the data from two sister ships. The calibrated models are used to extract the varying trend in ship's hydrodynamic performance over time and predict the change in performance through several propeller and hull cleaning events. The predicted change in performance is compared with the corresponding values estimated using the fouling friction coefficient ($\Delta C_F$). The ML methods are found to be performing well while modelling the hydrodynamic state variables of the ships with probabilistic ANN model performing the best, but the results from NL-PCR and NL-PLSR are not far behind, indicating that it may be possible to use simple methods to solve such problems with the help of domain knowledge.
    Fairness and Ethics Under Model Multiplicity in Machine Learning. (arXiv:2203.07139v2 [cs.LG] UPDATED)
    While data-driven predictive models are a strictly technological construct, they may operate within a social context in which benign engineering choices entail implicit, indirect and unexpected real-life consequences. Fairness of such systems -- pertaining both to individuals and groups -- is one relevant consideration in this space; it surfaces when data capture protected characteristics upon which people may be discriminated. To date, this notion has predominantly been studied for a fixed predictive model, often under different classification thresholds, striving to identify and eradicate undesirable, and possibly unlawful, aspects of its operation. Here, we backtrack on this assumption to propose and explore a novel definition of fairness where individuals can be harmed when one predictor is chosen ad hoc from a group of equally-well performing models, i.e., in view of utility-based model multiplicity. Since a person may be classified differently across models that are otherwise considered equivalent, this individual could argue for a predictor with the most favourable outcome, employing which may have adverse effects on others. We introduce this scenario with a two-dimensional example based on linear classification; then, we investigate its analytical properties in a broader context; and, finally, we present experimental results on data sets that are popular in fairness studies. Our findings suggest that such unfairness can be found in real-life situations and may be difficult to mitigate by technical means alone, as doing so degrades certain metrics of predictive performance.
    Learning Robotic Navigation from Experience: Principles, Methods, and Recent Results. (arXiv:2212.06759v1 [cs.RO])
    Navigation is one of the most heavily studied problems in robotics, and is conventionally approached as a geometric mapping and planning problem. However, real-world navigation presents a complex set of physical challenges that defies simple geometric abstractions. Machine learning offers a promising way to go beyond geometry and conventional planning, allowing for navigational systems that make decisions based on actual prior experience. Such systems can reason about traversability in ways that go beyond geometry, accounting for the physical outcomes of their actions and exploiting patterns in real-world environments. They can also improve as more data is collected, potentially providing a powerful network effect. In this article, we present a general toolkit for experiential learning of robotic navigation skills that unifies several recent approaches, describe the underlying design principles, summarize experimental results from several of our recent papers, and discuss open problems and directions for future work.
    DiffStack: A Differentiable and Modular Control Stack for Autonomous Vehicles. (arXiv:2212.06437v1 [cs.RO])
    Autonomous vehicle (AV) stacks are typically built in a modular fashion, with explicit components performing detection, tracking, prediction, planning, control, etc. While modularity improves reusability, interpretability, and generalizability, it also suffers from compounding errors, information bottlenecks, and integration challenges. To overcome these challenges, a prominent approach is to convert the AV stack into an end-to-end neural network and train it with data. While such approaches have achieved impressive results, they typically lack interpretability and reusability, and they eschew principled analytical components, such as planning and control, in favor of deep neural networks. To enable the joint optimization of AV stacks while retaining modularity, we present DiffStack, a differentiable and modular stack for prediction, planning, and control. Crucially, our model-based planning and control algorithms leverage recent advancements in differentiable optimization to produce gradients, enabling optimization of upstream components, such as prediction, via backpropagation through planning and control. Our results on the nuScenes dataset indicate that end-to-end training with DiffStack yields substantial improvements in open-loop and closed-loop planning metrics by, e.g., learning to make fewer prediction errors that would affect planning. Beyond these immediate benefits, DiffStack opens up new opportunities for fully data-driven yet modular and interpretable AV architectures. Project website: https://sites.google.com/view/diffstack
    Jointly Learning Visual and Auditory Speech Representations from Raw Data. (arXiv:2212.06246v1 [cs.LG])
    We present RAVEn, a self-supervised multi-modal approach to jointly learn visual and auditory speech representations. Our pre-training objective involves encoding masked inputs, and then predicting contextualised targets generated by slowly-evolving momentum encoders. Driven by the inherent differences between video and audio, our design is asymmetric w.r.t. the two modalities' pretext tasks: Whereas the auditory stream predicts both the visual and auditory targets, the visual one predicts only the auditory targets. We observe strong results in low- and high-resource labelled data settings when fine-tuning the visual and auditory encoders resulting from a single pre-training stage, in which the encoders are jointly trained. Notably, RAVEn surpasses all self-supervised methods on visual speech recognition (VSR) on LRS3, and combining RAVEn with self-training using only 30 hours of labelled data even outperforms a recent semi-supervised method trained on 90,000 hours of non-public data. At the same time, we achieve state-of-the-art results in the LRS3 low-resource setting for auditory speech recognition (as well as for VSR). Our findings point to the viability of learning powerful speech representations entirely from raw video and audio, i.e., without relying on handcrafted features. Code and models will be made public.
    Decentralized Stochastic Multi-Player Multi-Armed Walking Bandits. (arXiv:2212.06279v1 [cs.LG])
    Multi-player multi-armed bandit is an increasingly relevant decision-making problem, motivated by applications to cognitive radio systems. Most research for this problem focuses exclusively on the settings that players have \textit{full access} to all arms and receive no reward when pulling the same arm. Hence all players solve the same bandit problem with the goal of maximizing their cumulative reward. However, these settings neglect several important factors in many real-world applications, where players have \textit{limited access} to \textit{a dynamic local subset of arms} (i.e., an arm could sometimes be ``walking'' and not accessible to the player). To this end, this paper proposes a \textit{multi-player multi-armed walking bandits} model, aiming to address aforementioned modeling issues. The goal now is to maximize the reward, however, players can only pull arms from the local subset and only collect a full reward if no other players pull the same arm. We adopt Upper Confidence Bound (UCB) to deal with the exploration-exploitation tradeoff and employ distributed optimization techniques to properly handle collisions. By carefully integrating these two techniques, we propose a decentralized algorithm with near-optimal guarantee on the regret, and can be easily implemented to obtain competitive empirical performance.
    Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning. (arXiv:2110.15501v2 [stat.ML] UPDATED)
    Evaluating the performance of an ongoing policy plays a vital role in many areas such as medicine and economics, to provide crucial instruction on the early-stop of the online experiment and timely feedback from the environment. Policy evaluation in online learning thus attracts increasing attention by inferring the mean outcome of the optimal policy (i.e., the value) in real-time. Yet, such a problem is particularly challenging due to the dependent data generated in the online environment, the unknown optimal policy, and the complex exploration and exploitation trade-off in the adaptive experiment. In this paper, we aim to overcome these difficulties in policy evaluation for online learning. We explicitly derive the probability of exploration that quantifies the probability of exploring the non-optimal actions under commonly used bandit algorithms. We use this probability to conduct valid inference on the online conditional mean estimator under each action and develop the doubly robust interval estimation (DREAM) method to infer the value under the estimated optimal policy in online learning. The proposed value estimator provides double protection on the consistency and is asymptotically normal with a Wald-type confidence interval provided. Extensive simulations and real data applications are conducted to demonstrate the empirical validity of the proposed DREAM method.
    PACE: A Parallelizable Computation Encoder for Directed Acyclic Graphs. (arXiv:2203.10304v3 [cs.LG] UPDATED)
    Optimization of directed acyclic graph (DAG) structures has many applications, such as neural architecture search (NAS) and probabilistic graphical model learning. Encoding DAGs into real vectors is a dominant component in most neural-network-based DAG optimization frameworks. Currently, most DAG encoders use an asynchronous message passing scheme which sequentially processes nodes according to the dependency between nodes in a DAG. That is, a node must not be processed until all its predecessors are processed. As a result, they are inherently not parallelizable. In this work, we propose a Parallelizable Attention-based Computation structure Encoder (PACE) that processes nodes simultaneously and encodes DAGs in parallel. We demonstrate the superiority of PACE through encoder-dependent optimization subroutines that search the optimal DAG structure based on the learned DAG embeddings. Experiments show that PACE not only improves the effectiveness over previous sequential DAG encoders with a significantly boosted training and inference speed, but also generates smooth latent (DAG encoding) spaces that are beneficial to downstream optimization subroutines. Our source code is available at \url{https://github.com/zehao-dong/PACE}
    HardVis: Visual Analytics to Handle Instance Hardness Using Undersampling and Oversampling Techniques. (arXiv:2203.15753v3 [cs.LG] UPDATED)
    Despite the tremendous advances in machine learning (ML), training with imbalanced data still poses challenges in many real-world applications. Among a series of diverse techniques to solve this problem, sampling algorithms are regarded as an efficient solution. However, the problem is more fundamental, with many works emphasizing the importance of instance hardness. This issue refers to the significance of managing unsafe or potentially noisy instances that are more likely to be misclassified and serve as the root cause of poor classification performance. This paper introduces HardVis, a visual analytics system designed to handle instance hardness mainly in imbalanced classification scenarios. Our proposed system assists users in visually comparing different distributions of data types, selecting types of instances based on local characteristics that will later be affected by the active sampling method, and validating which suggestions from undersampling or oversampling techniques are beneficial for the ML model. Additionally, rather than uniformly undersampling/oversampling a specific class, we allow users to find and sample easy and difficult to classify training instances from all classes. Users can explore subsets of data from different perspectives to decide all those parameters, while HardVis keeps track of their steps and evaluates the model's predictive performance in a test set separately. The end result is a well-balanced data set that boosts the predictive power of the ML model. The efficacy and effectiveness of HardVis are demonstrated with a hypothetical usage scenario and a use case. Finally, we also look at how useful our system is based on feedback we received from ML experts.
    Physics Guided Deep Learning for Generative Design of Crystal Materials with Symmetry Constraints. (arXiv:2203.14352v3 [cond-mat.mtrl-sci] UPDATED)
    Discovering new materials is a challenging task in materials science crucial to the progress of human society. Conventional approaches based on experiments and simulations are labor-intensive or costly with success heavily depending on experts' heuristic knowledge. Here, we propose a deep learning based Physics Guided Crystal Generative Model (PGCGM) for efficient crystal material design with high structural diversity and symmetry. Our model increases the generation validity by more than 700\% compared to FTCP, one of the latest structure generators and by more than 45\% compared to our previous CubicGAN model. Density Functional Theory (DFT) calculations are used to validate the generated structures with 1,869 materials out of 2,000 are successfully optimized and deposited into the Carolina Materials Database \url{www.carolinamatdb.org}, of which 39.6\% have negative formation energy and 5.3\% have energy-above-hull less than 0.25 eV/atom, indicating their thermodynamic stability and potential synthesizability.
    CausalEGM: a general causal inference framework by encoding generative modeling. (arXiv:2212.05925v2 [stat.ML] UPDATED)
    Although understanding and characterizing causal effects have become essential in observational studies, it is challenging when the confounders are high-dimensional. In this article, we develop a general framework $\textit{CausalEGM}$ for estimating causal effects by encoding generative modeling, which can be applied in both binary and continuous treatment settings. Under the potential outcome framework with unconfoundedness, we establish a bidirectional transformation between the high-dimensional confounders space and a low-dimensional latent space where the density is known (e.g., multivariate normal distribution). Through this, CausalEGM simultaneously decouples the dependencies of confounders on both treatment and outcome and maps the confounders to the low-dimensional latent space. By conditioning on the low-dimensional latent features, CausalEGM can estimate the causal effect for each individual or the average causal effect within a population. Our theoretical analysis shows that the excess risk for CausalEGM can be bounded through empirical process theory. Under an assumption on encoder-decoder networks, the consistency of the estimate can be guaranteed. In a series of experiments, CausalEGM demonstrates superior performance over existing methods for both binary and continuous treatments. Specifically, we find CausalEGM to be substantially more powerful than competing methods in the presence of large sample sizes and high dimensional confounders. The software of CausalEGM is freely available at https://github.com/SUwonglab/CausalEGM.
    Failing with Grace: Learning Neural Network Controllers that are Boundedly Unsafe. (arXiv:2106.11881v2 [eess.SY] UPDATED)
    In this work, we consider the problem of learning a feed-forward neural network controller to safely steer an arbitrarily shaped planar robot in a compact and obstacle-occluded workspace. Unlike existing methods that depend strongly on the density of data points close to the boundary of the safe state space to train neural network controllers with closed-loop safety guarantees, here we propose an alternative approach that lifts such strong assumptions on the data that are hard to satisfy in practice and instead allows for graceful safety violations, i.e., of a bounded magnitude that can be spatially controlled. To do so, we employ reachability analysis techniques to encapsulate safety constraints in the training process. Specifically, to obtain a computationally efficient over-approximation of the forward reachable set of the closed-loop system, we partition the robot's state space into cells and adaptively subdivide the cells that contain states which may escape the safe set under the trained control law. Then, using the overlap between each cell's forward reachable set and the set of infeasible robot configurations as a measure for safety violations, we introduce appropriate terms into the loss function that penalize this overlap in the training process. As a result, our method can learn a safe vector field for the closed-loop system and, at the same time, provide worst-case bounds on safety violation over the whole configuration space, defined by the overlap between the over-approximation of the forward reachable set of the closed-loop system and the set of unsafe states. Moreover, it can control the tradeoff between computational complexity and tightness of these bounds. Our proposed method is supported by both theoretical results and simulation studies.
    Online Bidding Algorithms for Return-on-Spend Constrained Advertisers. (arXiv:2208.13713v2 [cs.LG] UPDATED)
    Online advertising has recently grown into a highly competitive and complex multi-billion-dollar industry, with advertisers bidding for ad slots at large scales and high frequencies. This has resulted in a growing need for efficient "auto-bidding" algorithms that determine the bids for incoming queries to maximize advertisers' targets subject to their specified constraints. This work explores efficient online algorithms for a single value-maximizing advertiser under an increasingly popular constraint: Return-on-Spend (RoS). We quantify efficiency in terms of regret relative to the optimal algorithm, which knows all queries a priori. We contribute a simple online algorithm that achieves near-optimal regret in expectation while always respecting the specified RoS constraint when the input sequence of queries are i.i.d. samples from some distribution. We also integrate our results with the previous work of Balseiro, Lu, and Mirrokni [BLM20] to achieve near-optimal regret while respecting both RoS and fixed budget constraints. Our algorithm follows the primal-dual framework and uses online mirror descent (OMD) for the dual updates. However, we need to use a non-canonical setup of OMD, and therefore the classic low-regret guarantee of OMD, which is for the adversarial setting in online learning, no longer holds. Nonetheless, in our case and more generally where low-regret dynamics are applied in algorithm design, the gradients encountered by OMD can be far from adversarial but influenced by our algorithmic choices. We exploit this key insight to show our OMD setup achieves low regret in the realm of our algorithm.
    Multi-objective Tree-structured Parzen Estimator Meets Meta-learning. (arXiv:2212.06751v1 [cs.LG])
    Hyperparameter optimization (HPO) is essential for the better performance of deep learning, and practitioners often need to consider the trade-off between multiple metrics, such as error rate, latency, memory requirements, robustness, and algorithmic fairness. Due to this demand and the heavy computation of deep learning, the acceleration of multi-objective (MO) optimization becomes ever more important. Although meta-learning has been extensively studied to speedup HPO, existing methods are not applicable to the MO tree-structured parzen estimator (MO-TPE), a simple yet powerful MO-HPO algorithm. In this paper, we extend TPE's acquisition function to the meta-learning setting, using a task similarity defined by the overlap in promising domains of each task. In a comprehensive set of experiments, we demonstrate that our method accelerates MO-TPE on tabular HPO benchmarks and yields state-of-the-art performance. Our method was also validated externally by winning the AutoML 2022 competition on "Multiobjective Hyperparameter Optimization for Transformers".
    Formulating Event-based Image Reconstruction as a Linear Inverse Problem with Deep Regularization using Optical Flow. (arXiv:2112.06242v3 [cs.CV] UPDATED)
    Event cameras are novel bio-inspired sensors that measure per-pixel brightness differences asynchronously. Recovering brightness from events is appealing since the reconstructed images inherit the high dynamic range (HDR) and high-speed properties of events; hence they can be used in many robotic vision applications and to generate slow-motion HDR videos. However, state-of-the-art methods tackle this problem by training an event-to-image Recurrent Neural Network (RNN), which lacks explainability and is difficult to tune. In this work we show, for the first time, how tackling the combined problem of motion and brightness estimation leads us to formulate event-based image reconstruction as a linear inverse problem that can be solved without training an image reconstruction RNN. Instead, classical and learning-based regularizers are used to solve the problem and remove artifacts from the reconstructed images. The experiments show that the proposed approach generates images with visual quality on par with state-of-the-art methods despite only using data from a short time interval. State-of-the-art results are achieved using an image denoising Convolutional Neural Network (CNN) as the regularization function. The proposed regularized formulation and solvers have a unifying character because they can be applied also to reconstruct brightness from the second derivative. Additionally, the formulation is attractive because it can be naturally combined with super-resolution, motion-segmentation and color demosaicing. Code is available at https://github.com/tub-rip/event_based_image_rec_inverse_problem
    AutoGMap: Learning to Map Large-scale Sparse Graphs on Memristive Crossbars. (arXiv:2111.07684v2 [cs.LG] UPDATED)
    The sparse representation of graphs has shown its great potential for accelerating the computation of graph applications (e.g., Social Networks, Knowledge Graphs) on traditional computing architectures (CPU, GPU, or TPU). But the exploration of large-scale sparse graph computing on processing-in-memory (PIM) platforms (typically with memristive crossbars) is still in its infancy. To implement the computation or storage of large-scale or batch graphs on memristive crossbars, a natural assumption is that a large-scale crossbar is demanded, but with low utilization. Some recent works question this assumption, to avoid the waste of storage and computational resource, the fixed-size or progressively scheduled ``block partition'' schemes are proposed. But these methods are coarse-grained or static, and are not effectively sparsity-aware. This work proposes the dynamic sparsity-aware mapping scheme generating method that models the problem as a sequential decision-making problem which is optimized by reinforcement learning (RL) algorithm (REINFORCE). Our generating model (LSTM, combined with the dynamic-fill mechanism) generates remarkable mapping performance on a small-scale typical graph/matrix data (complete mapping costs 43% area of the original matrix) and two large-scale matrix data (costing 22.5% area on qh882 and 17.1% area on qh1484). Our method may be extended to sparse graph computing on other PIM architectures, not limited to the memristive device-based platforms.
    HighMMT: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning. (arXiv:2203.01311v3 [cs.LG] UPDATED)
    Many real-world problems are inherently multimodal, from the communicative modalities humans use to express social and emotional states to the force, proprioception, and visual sensors ubiquitous on robots. While there has been an explosion of interest in multimodal representation learning, these methods are still largely focused on a small set of modalities, primarily in the language, vision, and audio space. In order to accelerate generalization towards diverse and understudied modalities, this paper studies efficient representation learning for high-modality scenarios. Since adding new models for every new modality or task becomes prohibitively expensive, a critical technical challenge is heterogeneity quantification: how can we measure which modalities encode similar information and interactions in order to permit parameter sharing with previous modalities? We propose two new information-theoretic metrics for heterogeneity quantification: (1) modality heterogeneity studies how similar 2 modalities $\{X_1,X_2\}$ are by measuring how much information can be transferred from $X_1$ to $X_2$, while (2) interaction heterogeneity studies how similarly pairs of modalities $\{X_1,X_2\}, \{X_3,X_4\}$ interact by measuring how much interaction information can be transferred from $\{X_1,X_2\}$ to $\{X_3,X_4\}$. We show the importance of these proposed metrics in high-modality scenarios as a way to automatically prioritize the fusion of modalities that contain unique information or interactions. The result is a single model, HighMMT, that scales up to $10$ modalities and $15$ tasks from $5$ different research areas. Not only does HighMMT outperform prior methods on the tradeoff between performance and efficiency, it also demonstrates a crucial scaling behavior: performance continues to improve with each modality added, and transfers to entirely new modalities and tasks during fine-tuning.
    Gradient flow in the gaussian covariate model: exact solution of learning curves and multiple descent structures. (arXiv:2212.06757v1 [stat.ML])
    A recent line of work has shown remarkable behaviors of the generalization error curves in simple learning models. Even the least-squares regression has shown atypical features such as the model-wise double descent, and further works have observed triple or multiple descents. Another important characteristic are the epoch-wise descent structures which emerge during training. The observations of model-wise and epoch-wise descents have been analytically derived in limited theoretical settings (such as the random feature model) and are otherwise experimental. In this work, we provide a full and unified analysis of the whole time-evolution of the generalization curve, in the asymptotic large-dimensional regime and under gradient-flow, within a wider theoretical setting stemming from a gaussian covariate model. In particular, we cover most cases already disparately observed in the literature, and also provide examples of the existence of multiple descent structures as a function of a model parameter or time. Furthermore, we show that our theoretical predictions adequately match the learning curves obtained by gradient descent over realistic datasets. Technically we compute averages of rational expressions involving random matrices using recent developments in random matrix theory based on "linear pencils". Another contribution, which is also of independent interest in random matrix theory, is a new derivation of related fixed point equations (and an extension there-off) using Dyson brownian motions.
    QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering. (arXiv:2104.06378v5 [cs.CL] UPDATED)
    The problem of answering questions using knowledge from pre-trained language models (LMs) and knowledge graphs (KGs) presents two challenges: given a QA context (question and answer choice), methods need to (i) identify relevant knowledge from large KGs, and (ii) perform joint reasoning over the QA context and KG. In this work, we propose a new model, QA-GNN, which addresses the above challenges through two key innovations: (i) relevance scoring, where we use LMs to estimate the importance of KG nodes relative to the given QA context, and (ii) joint reasoning, where we connect the QA context and KG to form a joint graph, and mutually update their representations through graph neural networks. We evaluate our model on QA benchmarks in the commonsense (CommonsenseQA, OpenBookQA) and biomedical (MedQA-USMLE) domains. QA-GNN outperforms existing LM and LM+KG models, and exhibits capabilities to perform interpretable and structured reasoning, e.g., correctly handling negation in questions.
    FairRoad: Achieving Fairness for Recommender Systems with Optimized Antidote Data. (arXiv:2212.06750v1 [cs.IR])
    Today, recommender systems have played an increasingly important role in shaping our experiences of digital environments and social interactions. However, as recommender systems become ubiquitous in our society, recent years have also witnessed significant fairness concerns for recommender systems. Specifically, studies have shown that recommender systems may inherit or even amplify biases from historical data, and as a result, provide unfair recommendations. To address fairness risks in recommender systems, most of the previous approaches to date are focused on modifying either the existing training data samples or the deployed recommender algorithms, but unfortunately with limited degrees of success. In this paper, we propose a new approach called fair recommendation with optimized antidote data (FairRoad), which aims to improve the fairness performances of recommender systems through the construction of a small and carefully crafted antidote dataset. Toward this end, we formulate our antidote data generation task as a mathematical optimization problem, which minimizes the unfairness of the targeted recommender systems while not disrupting the deployed recommendation algorithms. Extensive experiments show that our proposed antidote data generation algorithm significantly improve the fairness of recommender systems with a small amounts of antidote data.
    MCMC-Interactive Variational Inference. (arXiv:2010.02029v2 [cs.LG] UPDATED)
    Leveraging well-established MCMC strategies, we propose MCMC-interactive variational inference (MIVI) to not only estimate the posterior in a time constrained manner, but also facilitate the design of MCMC transitions. Constructing a variational distribution followed by a short Markov chain that has parameters to learn, MIVI takes advantage of the complementary properties of variational inference and MCMC to encourage mutual improvement. On one hand, with the variational distribution locating high posterior density regions, the Markov chain is optimized within the variational inference framework to efficiently target the posterior despite a small number of transitions. On the other hand, the optimized Markov chain with considerable flexibility guides the variational distribution towards the posterior and alleviates its underestimation of uncertainty. Furthermore, we prove the optimized Markov chain in MIVI admits extrapolation, which means its marginal distribution gets closer to the true posterior as the chain grows. Therefore, the Markov chain can be used separately as an efficient MCMC scheme. Experiments show that MIVI not only accurately and efficiently approximates the posteriors but also facilitates designs of stochastic gradient MCMC and Gibbs sampling transitions.
    A Simple But Powerful Graph Encoder for Temporal Knowledge Graph Completion. (arXiv:2112.07791v2 [cs.LG] UPDATED)
    Knowledge graphs contain rich knowledge about various entities and the relational information among them, while temporal knowledge graphs (TKGs) describe and model the interactions of the entities over time. In this context, automatic temporal knowledge graph completion (TKGC) has gained great interest. Recent TKGC methods integrate advanced deep learning techniques, e.g., Transformers, and achieve superior model performance. However, this also introduces a large number of excessive parameters, which brings a heavier burden for parameter optimization. In this paper, we propose a simple but powerful graph encoder for TKGC, called TARGCN. TARGCN is parameter-efficient, and it extensively explores every entity's temporal context for learning contextualized representations. We find that instead of adopting various kinds of complex modules, it is more beneficial to efficiently capture the temporal contexts of entities. We experiment TARGCN on three benchmark datasets. Our model can achieve a more than 46% relative improvement on the GDELT dataset compared with state-of-the-art TKGC models. Meanwhile, it outperforms the strongest baseline on the ICEWS05-15 dataset with around 18% fewer parameters.
    Factorizer: A Scalable Interpretable Approach to Context Modeling for Medical Image Segmentation. (arXiv:2202.12295v3 [eess.IV] UPDATED)
    Convolutional Neural Networks (CNNs) with U-shaped architectures have dominated medical image segmentation, which is crucial for various clinical purposes. However, the inherent locality of convolution makes CNNs fail to fully exploit global context, essential for better recognition of some structures, e.g., brain lesions. Transformers have recently proven promising performance on vision tasks, including semantic segmentation, mainly due to their capability of modeling long-range dependencies. Nevertheless, the quadratic complexity of attention makes existing Transformer-based models use self-attention layers only after somehow reducing the image resolution, which limits the ability to capture global contexts present at higher resolutions. Therefore, this work introduces a family of models, dubbed Factorizer, which leverages the power of low-rank matrix factorization for constructing an end-to-end segmentation model. Specifically, we propose a linearly scalable approach to context modeling, formulating Nonnegative Matrix Factorization (NMF) as a differentiable layer integrated into a U-shaped architecture. The shifted window technique is also utilized in combination with NMF to effectively aggregate local information. Factorizers compete favorably with CNNs and Transformers in terms of accuracy, scalability, and interpretability, achieving state-of-the-art results on the BraTS dataset for brain tumor segmentation and ISLES'22 dataset for stroke lesion segmentation. Highly meaningful NMF components give an additional interpretability advantage to Factorizers over CNNs and Transformers. Moreover, our ablation studies reveal a distinctive feature of Factorizers that enables a significant speed-up in inference for a trained Factorizer without any extra steps and without sacrificing much accuracy. The code and models are publicly available at https://github.com/pashtari/factorizer.
    Fair Infinitesimal Jackknife: Mitigating the Influence of Biased Training Data Points Without Refitting. (arXiv:2212.06803v1 [cs.LG])
    In consequential decision-making applications, mitigating unwanted biases in machine learning models that yield systematic disadvantage to members of groups delineated by sensitive attributes such as race and gender is one key intervention to strive for equity. Focusing on demographic parity and equality of opportunity, in this paper we propose an algorithm that improves the fairness of a pre-trained classifier by simply dropping carefully selected training data points. We select instances based on their influence on the fairness metric of interest, computed using an infinitesimal jackknife-based approach. The dropping of training points is done in principle, but in practice does not require the model to be refit. Crucially, we find that such an intervention does not substantially reduce the predictive performance of the model but drastically improves the fairness metric. Through careful experiments, we evaluate the effectiveness of the proposed approach on diverse tasks and find that it consistently improves upon existing alternatives.
    RT-1: Robotics Transformer for Real-World Control at Scale. (arXiv:2212.06817v1 [cs.RO])
    By transferring knowledge from large, diverse, task-agnostic datasets, modern machine learning models can solve specific downstream tasks either zero-shot or with small task-specific datasets to a high level of performance. While this capability has been demonstrated in other fields such as computer vision, natural language processing or speech recognition, it remains to be shown in robotics, where the generalization capabilities of the models are particularly critical due to the difficulty of collecting real-world robotic data. We argue that one of the keys to the success of such general robotic models lies with open-ended task-agnostic training, combined with high-capacity architectures that can absorb all of the diverse, robotic data. In this paper, we present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties. We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks. The project's website and videos can be found at robotics-transformer.github.io
    ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages. (arXiv:2212.06742v1 [cs.CL])
    Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa, erecting huge barriers to communication and working efficiency. Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric. In this work, we step towards bridging the gap between multilingual NLs and multilingual PLs for large language models (LLMs). We release ERNIE-Code, a unified pre-trained language model for 116 NLs and 6 PLs. We employ two methods for universal cross-lingual pre-training: span-corruption language modeling that learns patterns from monolingual NL or PL; and pivot-based translation language modeling that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation. We will make our code and pre-trained models publicly available.
    POPNASv3: a Pareto-Optimal Neural Architecture Search Solution for Image and Time Series Classification. (arXiv:2212.06735v1 [cs.LG])
    The automated machine learning (AutoML) field has become increasingly relevant in recent years. These algorithms can develop models without the need for expert knowledge, facilitating the application of machine learning techniques in the industry. Neural Architecture Search (NAS) exploits deep learning techniques to autonomously produce neural network architectures whose results rival the state-of-the-art models hand-crafted by AI experts. However, this approach requires significant computational resources and hardware investments, making it less appealing for real-usage applications. This article presents the third version of Pareto-Optimal Progressive Neural Architecture Search (POPNASv3), a new sequential model-based optimization NAS algorithm targeting different hardware environments and multiple classification tasks. Our method is able to find competitive architectures within large search spaces, while keeping a flexible structure and data processing pipeline to adapt to different tasks. The algorithm employs Pareto optimality to reduce the number of architectures sampled during the search, drastically improving the time efficiency without loss in accuracy. The experiments performed on images and time series classification datasets provide evidence that POPNASv3 can explore a large set of assorted operators and converge to optimal architectures suited for the type of data provided under different scenarios.
    Zero-Shot Motor Health Monitoring by Blind Domain Transition. (arXiv:2212.06154v1 [cs.LG])
    Continuous long-term monitoring of motor health is crucial for the early detection of abnormalities such as bearing faults (up to 51% of motor failures are attributed to bearing faults). Despite numerous methodologies proposed for bearing fault detection, most of them require normal (healthy) and abnormal (faulty) data for training. Even with the recent deep learning (DL) methodologies trained on the labeled data from the same machine, the classification accuracy significantly deteriorates when one or few conditions are altered. Furthermore, their performance suffers significantly or may entirely fail when they are tested on another machine with entirely different healthy and faulty signal patterns. To address this need, in this pilot study, we propose a zero-shot bearing fault detection method that can detect any fault on a new (target) machine regardless of the working conditions, sensor parameters, or fault characteristics. To accomplish this objective, a 1D Operational Generative Adversarial Network (Op-GAN) first characterizes the transition between normal and fault vibration signals of (a) source machine(s) under various conditions, sensor parameters, and fault types. Then for a target machine, the potential faulty signals can be generated, and over its actual healthy and synthesized faulty signals, a compact, and lightweight 1D Self-ONN fault detector can then be trained to detect the real faulty condition in real time whenever it occurs. To validate the proposed approach, a new benchmark dataset is created using two different motors working under different conditions and sensor locations. Experimental results demonstrate that this novel approach can accurately detect any bearing fault achieving an average recall rate of around 89% and 95% on two target machines regardless of its type, severity, and location.
    AutoPV: Automated photovoltaic forecasts with limited information using an ensemble of pre-trained models. (arXiv:2212.06797v1 [cs.LG])
    Accurate PhotoVoltaic (PV) power generation forecasting is vital for the efficient operation of Smart Grids. The automated design of such accurate forecasting models for individual PV plants includes two challenges: First, information about the PV mounting configuration (i.e. inclination and azimuth angles) is often missing. Second, for new PV plants, the amount of historical data available to train a forecasting model is limited (cold-start problem). We address these two challenges by proposing a new method for day-ahead PV power generation forecasts called AutoPV. AutoPV is a weighted ensemble of forecasting models that represent different PV mounting configurations. This representation is achieved by pre-training each forecasting model on a separate PV plant and by scaling the model's output with the peak power rating of the corresponding PV plant. To tackle the cold-start problem, we initially weight each forecasting model in the ensemble equally. To tackle the problem of missing information about the PV mounting configuration, we use new data that become available during operation to adapt the ensemble weights to minimize the forecasting error. AutoPV is advantageous as the unknown PV mounting configuration is implicitly reflected in the ensemble weights, and only the PV plant's peak power rating is required to re-scale the ensemble's output. AutoPV also allows to represent PV plants with panels distributed on different roofs with varying alignments, as these mounting configurations can be reflected proportionally in the weighting. Additionally, the required computing memory is decoupled when scaling AutoPV to hundreds of PV plants, which is beneficial in Smart Grids with limited computing capabilities. For a real-world data set with 11 PV plants, the accuracy of AutoPV is comparable to a model trained on two years of data and outperforms an incrementally trained model.
    Fine-Tuning Transformers: Vocabulary Transfer. (arXiv:2112.14569v2 [cs.CL] UPDATED)
    Transformers are responsible for the vast majority of recent advances in natural language processing. The majority of practical natural language processing applications of these models are typically enabled through transfer learning. This paper studies if corpus-specific tokenization used for fine-tuning improves the resulting performance of the model. Through a series of experiments, we demonstrate that such tokenization combined with the initialization and fine-tuning strategy for the vocabulary tokens speeds up the transfer and boosts the performance of the fine-tuned model. We call this aspect of transfer facilitation vocabulary transfer.
    Improving Accuracy Without Losing Interpretability: A ML Approach for Time Series Forecasting. (arXiv:2212.06620v1 [cs.LG])
    In time series forecasting, decomposition-based algorithms break aggregate data into meaningful components and are therefore appreciated for their particular advantages in interpretability. Recent algorithms often combine machine learning (hereafter ML) methodology with decomposition to improve prediction accuracy. However, incorporating ML is generally considered to sacrifice interpretability inevitably. In addition, existing hybrid algorithms usually rely on theoretical models with statistical assumptions and focus only on the accuracy of aggregate predictions, and thus suffer from accuracy problems, especially in component estimates. In response to the above issues, this research explores the possibility of improving accuracy without losing interpretability in time series forecasting. We first quantitatively define interpretability for data-driven forecasts and systematically review the existing forecasting algorithms from the perspective of interpretability. Accordingly, we propose the W-R algorithm, a hybrid algorithm that combines decomposition and ML from a novel perspective. Specifically, the W-R algorithm replaces the standard additive combination function with a weighted variant and uses ML to modify the estimates of all components simultaneously. We mathematically analyze the theoretical basis of the algorithm and validate its performance through extensive numerical experiments. In general, the W-R algorithm outperforms all decomposition-based and ML benchmarks. Based on P50_QL, the algorithm relatively improves by 8.76% in accuracy on the practical sales forecasts of JD.com and 77.99% on a public dataset of electricity loads. This research offers an innovative perspective to combine the statistical and ML algorithms, and JD.com has implemented the W-R algorithm to make accurate sales predictions and guide its marketing activities.
    The Hateful Memes Challenge Next Move. (arXiv:2212.06655v1 [cs.LG])
    State-of-the-art image and text classification models, such as Convectional Neural Networks and Transformers, have long been able to classify their respective unimodal reasoning satisfactorily with accuracy close to or exceeding human accuracy. However, images embedded with text, such as hateful memes, are hard to classify using unimodal reasoning when difficult examples, such as benign confounders, are incorporated into the data set. We attempt to generate more labeled memes in addition to the Hateful Memes data set from Facebook AI, based on the framework of a winning team from the Hateful Meme Challenge. To increase the number of labeled memes, we explore semi-supervised learning using pseudo-labels for newly introduced, unlabeled memes gathered from the Memotion Dataset 7K. We find that the semi-supervised learning task on unlabeled data required human intervention and filtering and that adding a limited amount of new data yields no extra classification performance.
    OAMixer: Object-aware Mixing Layer for Vision Transformers. (arXiv:2212.06595v1 [cs.CV])
    Patch-based models, e.g., Vision Transformers (ViTs) and Mixers, have shown impressive results on various visual recognition tasks, alternating classic convolutional networks. While the initial patch-based models (ViTs) treated all patches equally, recent studies reveal that incorporating inductive bias like spatiality benefits the representations. However, most prior works solely focused on the location of patches, overlooking the scene structure of images. Thus, we aim to further guide the interaction of patches using the object information. Specifically, we propose OAMixer (object-aware mixing layer), which calibrates the patch mixing layers of patch-based models based on the object labels. Here, we obtain the object labels in unsupervised or weakly-supervised manners, i.e., no additional human-annotating cost is necessary. Using the object labels, OAMixer computes a reweighting mask with a learnable scale parameter that intensifies the interaction of patches containing similar objects and applies the mask to the patch mixing layers. By learning an object-centric representation, we demonstrate that OAMixer improves the classification accuracy and background robustness of various patch-based models, including ViTs, MLP-Mixers, and ConvMixers. Moreover, we show that OAMixer enhances various downstream tasks, including large-scale classification, self-supervised learning, and multi-object recognition, verifying the generic applicability of OAMixer
    Selected Trends in Artificial Intelligence for Space Applications. (arXiv:2212.06662v1 [cs.LG])
    The development and adoption of artificial intelligence (AI) technologies in space applications is growing quickly as the consensus increases on the potential benefits introduced. As more and more aerospace engineers are becoming aware of new trends in AI, traditional approaches are revisited to consider the applications of emerging AI technologies. Already at the time of writing, the scope of AI-related activities across academia, the aerospace industry and space agencies is so wide that an in-depth review would not fit in these pages. In this chapter we focus instead on two main emerging trends we believe capture the most relevant and exciting activities in the field: differentiable intelligence and on-board machine learning. Differentiable intelligence, in a nutshell, refers to works making extensive use of automatic differentiation frameworks to learn the parameters of machine learning or related models. Onboard machine learning considers the problem of moving inference, as well as learning, onboard. Within these fields, we discuss a few selected projects originating from the European Space Agency's (ESA) Advanced Concepts Team (ACT), giving priority to advanced topics going beyond the transposition of established AI techniques and practices to the space domain.
    Spatiotemporal Residual Regularization with Dynamic Mixtures for Traffic Forecasting. (arXiv:2212.06653v1 [cs.LG])
    Existing deep learning-based traffic forecasting models are mainly trained with MSE (or MAE) as the loss function, assuming that residuals/errors follow independent and isotropic Gaussian (or Laplacian) distribution for simplicity. However, this assumption rarely holds for real-world traffic forecasting tasks, where the unexplained residuals are often correlated in both space and time. In this study, we propose Spatiotemporal Residual Regularization by modeling residuals with a dynamic (e.g., time-varying) mixture of zero-mean multivariate Gaussian distribution with learnable spatiotemporal covariance matrices. This approach allows us to directly capture spatiotemporally correlated residuals. For scalability, we model the spatiotemporal covariance for each mixture component using a Kronecker product structure, which significantly reduces the number of parameters and computation complexity. We evaluate the performance of the proposed method on a traffic speed forecasting task. Our results show that, by properly modeling residual distribution, the proposed method not only improves the model performance but also provides interpretable structures.
    TIER: Text-Image Entropy Regularization for CLIP-style models. (arXiv:2212.06710v1 [cs.LG])
    In this paper, we study the effect of a novel regularization scheme on contrastive language-image pre-trained (CLIP) models. Our approach is based on the observation that, in many domains, text tokens should only describe a small number of image regions and, likewise, each image region should correspond to only a few text tokens. In CLIP-style models, this implies that text-token embeddings should have high similarity to only a small number of image-patch embeddings for a given image-text pair. We formalize this observation using a novel regularization scheme that penalizes the entropy of the text-token to image-patch similarity scores. We qualitatively and quantitatively demonstrate that the proposed regularization scheme shrinks the text-token and image-patch similarity scores towards zero, thus achieving the desired effect. We demonstrate the promise of our approach in an important medical context where this underlying hypothesis naturally arises. Using our proposed approach, we achieve state of the art (SOTA) zero-shot performance on all tasks from the CheXpert chest x-ray dataset, outperforming an unregularized version of the model and several recently published self-supervised models.
    On the Evolution of (Hateful) Memes by Means of Multimodal Contrastive Learning. (arXiv:2212.06573v1 [cs.SI])
    The dissemination of hateful memes online has adverse effects on social media platforms and the real world. Detecting hateful memes is challenging, one of the reasons being the evolutionary nature of memes; new hateful memes can emerge by fusing hateful connotations with other cultural ideas or symbols. In this paper, we propose a framework that leverages multimodal contrastive learning models, in particular OpenAI's CLIP, to identify targets of hateful content and systematically investigate the evolution of hateful memes. We find that semantic regularities exist in CLIP-generated embeddings that describe semantic relationships within the same modality (images) or across modalities (images and text). Leveraging this property, we study how hateful memes are created by combining visual elements from multiple images or fusing textual information with a hateful image. We demonstrate the capabilities of our framework for analyzing the evolution of hateful memes by focusing on antisemitic memes, particularly the Happy Merchant meme. Using our framework on a dataset extracted from 4chan, we find 3.3K variants of the Happy Merchant meme, with some linked to specific countries, persons, or organizations. We envision that our framework can be used to aid human moderators by flagging new variants of hateful memes so that moderators can manually verify them and mitigate the problem of hateful content online.
    AWT -- Clustering Meteorological Time Series Using an Aggregated Wavelet Tree. (arXiv:2212.06642v1 [cs.LG])
    Both clustering and outlier detection play an important role for meteorological measurements. We present the AWT algorithm, a clustering algorithm for time series data that also performs implicit outlier detection during the clustering. AWT integrates ideas of several well-known K-Means clustering algorithms. It chooses the number of clusters automatically based on a user-defined threshold parameter, and it can be used for heterogeneous meteorological input data as well as for data sets that exceed the available memory size. We apply AWT to crowd sourced 2-m temperature data with an hourly resolution from the city of Vienna to detect outliers and to investigate if the final clusters show general similarities and similarities with urban land-use characteristics. It is shown that both the outlier detection and the implicit mapping to land-use characteristic is possible with AWT which opens new possible fields of application, specifically in the rapidly evolving field of urban climate and urban weather.
    The Turing Deception. (arXiv:2212.06721v1 [cs.LG])
    This research revisits the classic Turing test and compares recent large language models such as ChatGPT for their abilities to reproduce human-level comprehension and compelling text generation. Two task challenges -- summarization, and question answering -- prompt ChatGPT to produce original content (98-99%) from a single text entry and also sequential questions originally posed by Turing in 1950. The question of a machine fooling a human judge recedes in this work relative to the question of "how would one prove it?" The original contribution of the work presents a metric and simple grammatical set for understanding the writing mechanics of chatbots in evaluating their readability and statistical clarity, engagement, delivery, and overall quality. While Turing's original prose scores at least 14% below the machine-generated output, the question of whether an algorithm displays hints of Turing's truly original thoughts (the "Lovelace 2.0" test) remains unanswered and potentially unanswerable for now.
    A Machine Learning Enhanced Approach for Automated Sunquake Detection in Acoustic Emission Maps. (arXiv:2212.06717v1 [astro-ph.SR])
    Sunquakes are seismic emissions visible on the solar surface, associated with some solar flares. Although discovered in 1998, they have only recently become a more commonly detected phenomenon. Despite the availability of several manual detection guidelines, to our knowledge, the astrophysical data produced for sunquakes is new to the field of Machine Learning. Detecting sunquakes is a daunting task for human operators and this work aims to ease and, if possible, to improve their detection. Thus, we introduce a dataset constructed from acoustic egression-power maps of solar active regions obtained for Solar Cycles 23 and 24 using the holography method. We then present a pedagogical approach to the application of machine learning representation methods for sunquake detection using AutoEncoders, Contrastive Learning, Object Detection and recurrent techniques, which we enhance by introducing several custom domain-specific data augmentation transformations. We address the main challenges of the automated sunquake detection task, namely the very high noise patterns in and outside the active region shadow and the extreme class imbalance given by the limited number of frames that present sunquake signatures. With our trained models, we find temporal and spatial locations of peculiar acoustic emission and qualitatively associate them to eruptive and high energy emission. While noting that these models are still in a prototype stage and there is much room for improvement in metrics and bias levels, we hypothesize that their agreement on example use cases has the potential to enable detection of weak solar acoustic manifestations.
    Edge Computing for Semantic Communication Enabled Metaverse: An Incentive Mechanism Design. (arXiv:2212.06463v1 [cs.GT])
    Semantic communication (SemCom) and edge computing are two disruptive solutions to address emerging requirements of huge data communication, bandwidth efficiency and low latency data processing in Metaverse. However, edge computing resources are often provided by computing service providers and thus it is essential to design appealingly incentive mechanisms for the provision of limited resources. Deep learning (DL)- based auction has recently proposed as an incentive mechanism that maximizes the revenue while holding important economic properties, i.e., individual rationality and incentive compatibility. Therefore, in this work, we introduce the design of the DLbased auction for the computing resource allocation in SemComenabled Metaverse. First, we briefly introduce the fundamentals and challenges of Metaverse. Second, we present the preliminaries of SemCom and edge computing. Third, we review various incentive mechanisms for edge computing resource trading. Fourth, we present the design of the DL-based auction for edge resource allocation in SemCom-enabled Metaverse. Simulation results demonstrate that the DL-based auction improves the revenue while nearly satisfying the individual rationality and incentive compatibility constraints.
    Forecasting Soil Moisture Using Domain Inspired Temporal Graph Convolution Neural Networks To Guide Sustainable Crop Management. (arXiv:2212.06565v1 [cs.LG])
    Climate change, population growth, and water scarcity present unprecedented challenges for agriculture. This project aims to forecast soil moisture using domain knowledge and machine learning for crop management decisions that enable sustainable farming. Traditional methods for predicting hydrological response features require significant computational time and expertise. Recent work has implemented machine learning models as a tool for forecasting hydrological response features, but these models neglect a crucial component of traditional hydrological modeling that spatially close units can have vastly different hydrological responses. In traditional hydrological modeling, units with similar hydrological properties are grouped together and share model parameters regardless of their spatial proximity. Inspired by this domain knowledge, we have constructed a novel domain-inspired temporal graph convolution neural network. Our approach involves clustering units based on time-varying hydrological properties, constructing graph topologies for each cluster, and forecasting soil moisture using graph convolutions and a gated recurrent neural network. We have trained, validated, and tested our method on field-scale time series data consisting of approximately 99,000 hydrological response units spanning 40 years in a case study in northeastern United States. Comparison with existing models illustrates the effectiveness of using domain-inspired clustering with time series graph neural networks. The framework is being deployed as part of a pro bono social impact program. The trained models are being deployed on small-holding farms in central Texas.
    Universal Paralinguistic Speech Representations Using Self-Supervised Conformers. (arXiv:2110.04621v4 [cs.SD] UPDATED)
    Many speech applications require understanding aspects beyond the words being spoken, such as recognizing emotion, detecting whether the speaker is wearing a mask, or distinguishing real from synthetic speech. In this work, we introduce a new state-of-the-art paralinguistic representation derived from large-scale, fully self-supervised training of a 600M+ parameter Conformer-based architecture. We benchmark on a diverse set of speech tasks and demonstrate that simple linear classifiers trained on top of our time-averaged representation outperform nearly all previous results, in some cases by large margins. Our analyses of context-window size demonstrate that, surprisingly, 2 second context-windows achieve 96\% the performance of the Conformers that use the full long-term context on 7 out of 9 tasks. Furthermore, while the best per-task representations are extracted internally in the network, stable performance across several layers allows a single universal representation to reach near optimal performance on all tasks.
    DiffMD: A Geometric Diffusion Model for Molecular Dynamics Simulations. (arXiv:2204.08672v2 [cs.CE] UPDATED)
    Molecular dynamics (MD) has long been the de facto choice for simulating complex atomistic systems from first principles. Recently deep learning models become a popular way to accelerate MD. Notwithstanding, existing models depend on intermediate variables such as the potential energy or force fields to update atomic positions, which requires additional computations to perform back-propagation. To waive this requirement, we propose a novel model called DiffMD by directly estimating the gradient of the log density of molecular conformations. DiffMD relies on a score-based denoising diffusion generative model that perturbs the molecular structure with a conditional noise depending on atomic accelerations and treats conformations at previous timeframes as the prior distribution for sampling. Another challenge of modeling such a conformation generation process is that a molecule is kinetic instead of static, which no prior works have strictly studied. To solve this challenge, we propose an equivariant geometric Transformer as the score function in the diffusion process to calculate corresponding gradients. It incorporates the directions and velocities of atomic motions via 3D spherical Fourier-Bessel representations. With multiple architectural improvements, we outperform state-of-the-art baselines on MD17 and isomers of C7O2H10 datasets. This work contributes to accelerating material and drug discovery.
    Wind power predictions from nowcasts to 4-hour forecasts: a learning approach with variable selection. (arXiv:2204.09362v2 [cs.LG] UPDATED)
    We study short-term prediction of wind speed and wind power (every 10 minutes up to 4 hours ahead). Accurate forecasts for these quantities are crucial to mitigate the negative effects of wind farms' intermittent production on energy systems and markets. We use machine learning to combine outputs from numerical weather prediction models with local observations. The former provide valuable information on higher scales dynamics while the latter gives the model fresher and location-specific data. So as to make the results usable for practitioners, we focus on well-known methods which can handle a high volume of data. We study first variable selection using both a linear technique and a nonlinear one. Then we exploit these results to forecast wind speed and wind power still with an emphasis on linear models versus nonlinear ones. For the wind power prediction, we also compare the indirect approach (wind speed predictions passed through a power curve) and the indirect one (directly predict wind power).
    Generating Contextual Load Profiles Using a Conditional Variational Autoencoder. (arXiv:2209.04056v2 [cs.LG] UPDATED)
    Generating power system states that have similar distribution and dependency to the historical ones is essential for the tasks of system planning and security assessment, especially when the historical data is insufficient. In this paper, we described a generative model for load profiles of industrial and commercial customers, based on the conditional variational autoencoder (CVAE) neural network architecture, which is challenging due to the highly variable nature of such profiles. Generated contextual load profiles were conditioned on the month of the year and typical power exchange with the grid. Moreover, the quality of generations was both visually and statistically evaluated. The experimental results demonstrate our proposed CVAE model can capture temporal features of historical load profiles and generate `realistic' data with satisfying univariate distributions and multivariate dependencies.
    Proximal Policy Optimization Based Reinforcement Learning for Joint Bidding in Energy and Frequency Regulation Markets. (arXiv:2212.06551v1 [eess.SY])
    Driven by the global decarbonization effort, the rapid integration of renewable energy into the conventional electricity grid presents new challenges and opportunities for the battery energy storage system (BESS) participating in the energy market. Energy arbitrage can be a significant source of revenue for the BESS due to the increasing price volatility in the spot market caused by the mismatch between renewable generation and electricity demand. In addition, the Frequency Control Ancillary Services (FCAS) markets established to stabilize the grid can offer higher returns for the BESS due to their capability to respond within milliseconds. Therefore, it is crucial for the BESS to carefully decide how much capacity to assign to each market to maximize the total profit under uncertain market conditions. This paper formulates the bidding problem of the BESS as a Markov Decision Process, which enables the BESS to participate in both the spot market and the FCAS market to maximize profit. Then, Proximal Policy Optimization, a model-free deep reinforcement learning algorithm, is employed to learn the optimal bidding strategy from the dynamic environment of the energy market under a continuous bidding scale. The proposed model is trained and validated using real-world historical data of the Australian National Electricity Market. The results demonstrate that our developed joint bidding strategy in both markets is significantly profitable compared to individual markets.
    How Does Independence Help Generalization? Sample Complexity of ERM on Product Distributions. (arXiv:2212.06422v1 [cs.LG])
    While many classical notions of learnability (e.g., PAC learnability) are distribution-free, utilizing the specific structures of an input distribution may improve learning performance. For example, a product distribution on a multi-dimensional input space has a much simpler structure than a correlated distribution. A recent paper [GHTZ21] shows that the sample complexity of a general learning problem on product distributions is polynomial in the input dimension, which is exponentially smaller than that on correlated distributions. However, the learning algorithm they use is not the standard Empirical Risk Minimization (ERM) algorithm. In this note, we characterize the sample complexity of ERM in a general learning problem on product distributions. We show that, even though product distributions are simpler than correlated distributions, ERM still needs an exponential number of samples to learn on product distributions, instead of a polynomial. This leads to the conclusion that a product distribution by itself does not make a learning problem easier -- an algorithm designed specifically for product distributions is needed.
    Adversarial Attacks and Defences for Skin Cancer Classification. (arXiv:2212.06822v1 [cs.CV])
    There has been a concurrent significant improvement in the medical images used to facilitate diagnosis and the performance of machine learning techniques to perform tasks such as classification, detection, and segmentation in recent years. As a result, a rapid increase in the usage of such systems can be observed in the healthcare industry, for instance in the form of medical image classification systems, where these models have achieved diagnostic parity with human physicians. One such application where this can be observed is in computer vision tasks such as the classification of skin lesions in dermatoscopic images. However, as stakeholders in the healthcare industry, such as insurance companies, continue to invest extensively in machine learning infrastructure, it becomes increasingly important to understand the vulnerabilities in such systems. Due to the highly critical nature of the tasks being carried out by these machine learning models, it is necessary to analyze techniques that could be used to take advantage of these vulnerabilities and methods to defend against them. This paper explores common adversarial attack techniques. The Fast Sign Gradient Method and Projected Descent Gradient are used against a Convolutional Neural Network trained to classify dermatoscopic images of skin lesions. Following that, it also discusses one of the most popular adversarial defense techniques, adversarial training. The performance of the model that has been trained on adversarial examples is then tested against the previously mentioned attacks, and recommendations to improve neural networks robustness are thus provided based on the results of the experiment.
    Self-adaptive algorithms for quasiconvex programming and applications to machine learning. (arXiv:2212.06379v1 [math.OC])
    For solving a broad class of nonconvex programming problems on an unbounded constraint set, we provide a self-adaptive step-size strategy that does not include line-search techniques and establishes the convergence of a generic approach under mild assumptions. Specifically, the objective function may not satisfy the convexity condition. Unlike descent line-search algorithms, it does not need a known Lipschitz constant to figure out how big the first step should be. The crucial feature of this process is the steady reduction of the step size until a certain condition is fulfilled. In particular, it can provide a new gradient projection approach to optimization problems with an unbounded constrained set. The correctness of the proposed method is verified by preliminary results from some computational examples. To demonstrate the effectiveness of the proposed technique for large-scale problems, we apply it to some experiments on machine learning, such as supervised feature selection, multi-variable logistic regressions and neural networks for classification.
    GPViT: A High Resolution Non-Hierarchical Vision Transformer with Group Propagation. (arXiv:2212.06795v1 [cs.CV])
    We present the Group Propagation Vision Transformer (GPViT): a novel nonhierarchical (i.e. non-pyramidal) transformer model designed for general visual recognition with high-resolution features. High-resolution features (or tokens) are a natural fit for tasks that involve perceiving fine-grained details such as detection and segmentation, but exchanging global information between these features is expensive in memory and computation because of the way self-attention scales. We provide a highly efficient alternative Group Propagation Block (GP Block) to exchange global information. In each GP Block, features are first grouped together by a fixed number of learnable group tokens; we then perform Group Propagation where global information is exchanged between the grouped features; finally, global information in the updated grouped features is returned back to the image features through a transformer decoder. We evaluate GPViT on a variety of visual recognition tasks including image classification, semantic segmentation, object detection, and instance segmentation. Our method achieves significant performance gains over previous works across all tasks, especially on tasks that require high-resolution outputs, for example, our GPViT-L3 outperforms Swin Transformer-B by 2.0 mIoU on ADE20K semantic segmentation with only half as many parameters. Code and pre-trained models are available at https://github.com/ChenhongyiYang/GPViT .
    Model-Free Approach to Fair Solar PV Curtailment Using Reinforcement Learning. (arXiv:2212.06542v1 [eess.SY])
    The rapid adoption of residential solar photovoltaics (PV) has resulted in regular overvoltage events, due to correlated reverse power flows. Currently, PV inverters prevent damage to electronics by curtailing energy production in response to overvoltage. However, this disproportionately affects households at the far end of the feeder, leading to an unfair allocation of the potential value of energy produced. Globally optimizing for fair curtailment requires accurate feeder parameters, which are often unknown. This paper investigates reinforcement learning, which gradually optimizes a fair PV curtailment strategy by interacting with the system. We evaluate six fairness metrics on how well they can be learned compared to an optimal solution oracle. We show that all definitions permit efficient learning, suggesting that reinforcement learning is a promising approach to achieving both safe and fair PV coordination.
    AI Model Utilization Measurements For Finding Class Encoding Patterns. (arXiv:2212.06576v1 [cs.LG])
    This work addresses the problems of (a) designing utilization measurements of trained artificial intelligence (AI) models and (b) explaining how training data are encoded in AI models based on those measurements. The problems are motivated by the lack of explainability of AI models in security and safety critical applications, such as the use of AI models for classification of traffic signs in self-driving cars. We approach the problems by introducing theoretical underpinnings of AI model utilization measurement and understanding patterns in utilization-based class encodings of traffic signs at the level of computation graphs (AI models), subgraphs, and graph nodes. Conceptually, utilization is defined at each graph node (computation unit) of an AI model based on the number and distribution of unique outputs in the space of all possible outputs (tensor-states). In this work, utilization measurements are extracted from AI models, which include poisoned and clean AI models. In contrast to clean AI models, the poisoned AI models were trained with traffic sign images containing systematic, physically realizable, traffic sign modifications (i.e., triggers) to change a correct class label to another label in a presence of such a trigger. We analyze class encodings of such clean and poisoned AI models, and conclude with implications for trojan injection and detection.
    On Mini-Batch Training with Varying Length Time Series. (arXiv:2212.06536v1 [cs.LG])
    In real-world time series recognition applications, it is possible to have data with varying length patterns. However, when using artificial neural networks (ANN), it is standard practice to use fixed-sized mini-batches. To do this, time series data with varying lengths are typically normalized so that all the patterns are the same length. Normally, this is done using zero padding or truncation without much consideration. We propose a novel method of normalizing the lengths of the time series in a dataset by exploiting the dynamic matching ability of Dynamic Time Warping (DTW). In this way, the time series lengths in a dataset can be set to a fixed size while maintaining features typical to the dataset. In the experiments, all 11 datasets with varying length time series from the 2018 UCR Time Series Archive are used. We evaluate the proposed method by comparing it with 18 other length normalization methods on a Convolutional Neural Network (CNN), a Long-Short Term Memory network (LSTM), and a Bidirectional LSTM (BLSTM).
    One-shot Machine Teaching: Cost Very Few Examples to Converge Faster. (arXiv:2212.06416v1 [cs.LG])
    Artificial intelligence is to teach machines to take actions like humans. To achieve intelligent teaching, the machine learning community becomes to think about a promising topic named machine teaching where the teacher is to design the optimal (usually minimal) teaching set given a target model and a specific learner. However, previous works usually require numerous teaching examples along with large iterations to guide learners to converge, which is costly. In this paper, we consider a more intelligent teaching paradigm named one-shot machine teaching which costs fewer examples to converge faster. Different from typical teaching, this advanced paradigm establishes a tractable mapping from the teaching set to the model parameter. Theoretically, we prove that this mapping is surjective, which serves to an existence guarantee of the optimal teaching set. Then, relying on the surjective mapping from the teaching set to the parameter, we develop a design strategy of the optimal teaching set under appropriate settings, of which two popular efficiency metrics, teaching dimension and iterative teaching dimension are one. Extensive experiments verify the efficiency of our strategy and further demonstrate the intelligence of this new teaching paradigm.  ( 2 min )
    Can recurrent neural networks learn process model structure?. (arXiv:2212.06430v1 [cs.LG])
    Various methods using machine and deep learning have been proposed to tackle different tasks in predictive process monitoring, forecasting for an ongoing case e.g. the most likely next event or suffix, its remaining time, or an outcome-related variable. Recurrent neural networks (RNNs), and more specifically long short-term memory nets (LSTMs), stand out in terms of popularity. In this work, we investigate the capabilities of such an LSTM to actually learn the underlying process model structure of an event log. We introduce an evaluation framework that combines variant-based resampling and custom metrics for fitness, precision and generalization. We evaluate 4 hypotheses concerning the learning capabilities of LSTMs, the effect of overfitting countermeasures, the level of incompleteness in the training set and the level of parallelism in the underlying process model. We confirm that LSTMs can struggle to learn process model structure, even with simplistic process data and in a very lenient setup. Taking the correct anti-overfitting measures can alleviate the problem. However, these measures did not present themselves to be optimal when selecting hyperparameters purely on predicting accuracy. We also found that decreasing the amount of information seen by the LSTM during training, causes a sharp drop in generalization and precision scores. In our experiments, we could not identify a relationship between the extent of parallelism in the model and the generalization capability, but they do indicate that the process' complexity might have impact.  ( 2 min )
    Score-based Generative Modeling Secretly Minimizes the Wasserstein Distance. (arXiv:2212.06359v1 [cs.LG])
    Score-based generative models are shown to achieve remarkable empirical performances in various applications such as image generation and audio synthesis. However, a theoretical understanding of score-based diffusion models is still incomplete. Recently, Song et al. showed that the training objective of score-based generative models is equivalent to minimizing the Kullback-Leibler divergence of the generated distribution from the data distribution. In this work, we show that score-based models also minimize the Wasserstein distance between them under suitable assumptions on the model. Specifically, we prove that the Wasserstein distance is upper bounded by the square root of the objective function up to multiplicative constants and a fixed constant offset. Our proof is based on a novel application of the theory of optimal transport, which can be of independent interest to the society. Our numerical experiments support our findings. By analyzing our upper bounds, we provide a few techniques to obtain tighter upper bounds.  ( 2 min )
    Over-The-Air Federated Learning Over Scalable Cell-free Massive MIMO. (arXiv:2212.06482v1 [eess.SP])
    Cell-free massive MIMO is emerging as a promising technology for future wireless communication systems, which is expected to offer uniform coverage and high spectral efficiency compared to classical cellular systems. We study in this paper how cell-free massive MIMO can support federated edge learning. Taking advantage of the additive nature of the wireless multiple access channel, over-the-air computation is exploited, where the clients send their local updates simultaneously over the same communication resource. This approach, known as over-the-air federated learning (OTA-FL), is proven to alleviate the communication overhead of federated learning over wireless networks. Considering channel correlation and only imperfect channel state information available at the central server, we propose a practical implementation of OTA-FL over cell-free massive MIMO. The convergence of the proposed implementation is studied analytically and experimentally, confirming the benefits of cell-free massive MIMO for OTA-FL.  ( 2 min )
    Dual Accuracy-Quality-Driven Neural Network for Prediction Interval Generation. (arXiv:2212.06370v1 [cs.LG])
    Accurate uncertainty quantification is necessary to enhance the reliability of deep learning models in real-world applications. In the case of regression tasks, prediction intervals (PIs) should be provided along with the deterministic predictions of deep learning models. Such PIs are useful or "high-quality'' as long as they are sufficiently narrow and capture most of the probability density. In this paper, we present a method to learn prediction intervals for regression-based neural networks automatically in addition to the conventional target predictions. In particular, we train two companion neural networks: one that uses one output, the target estimate, and another that uses two outputs, the upper and lower bounds of the corresponding PI. Our main contribution is the design of a loss function for the PI-generation network that takes into account the output of the target-estimation network and has two optimization objectives: minimizing the mean prediction interval width and ensuring the PI integrity using constraints that maximize the prediction interval probability coverage implicitly. Both objectives are balanced within the loss function using a self-adaptive coefficient. Furthermore, we apply a Monte Carlo-based approach that evaluates the model uncertainty in the learned PIs. Experiments using a synthetic dataset, six benchmark datasets, and a real-world crop yield prediction dataset showed that our method was able to maintain a nominal probability coverage and produce narrower PIs without detriment to its target estimation accuracy when compared to those PIs generated by three state-of-the-art neural-network-based methods.  ( 2 min )
    Leave Graphs Alone: Addressing Over-Squashing without Rewiring. (arXiv:2212.06538v1 [cs.LG])
    Recent works have investigated the role of graph bottlenecks in preventing long-range information propagation in message-passing graph neural networks, causing the so-called `over-squashing' phenomenon. As a remedy, graph rewiring mechanisms have been proposed as preprocessing steps. Graph Echo State Networks (GESNs) are a reservoir computing model for graphs, where node embeddings are recursively computed by an untrained message-passing function. In this paper, we show that GESNs can achieve a significantly better accuracy on six heterophilic node classification tasks without altering the graph connectivity, thus suggesting a different route for addressing the over-squashing problem.  ( 2 min )
    Regularized Optimal Transport Layers for Generalized Global Pooling Operations. (arXiv:2212.06339v1 [cs.LG])
    Global pooling is one of the most significant operations in many machine learning models and tasks, which works for information fusion and structured data (like sets and graphs) representation. However, without solid mathematical fundamentals, its practical implementations often depend on empirical mechanisms and thus lead to sub-optimal, even unsatisfactory performance. In this work, we develop a novel and generalized global pooling framework through the lens of optimal transport. The proposed framework is interpretable from the perspective of expectation-maximization. Essentially, it aims at learning an optimal transport across sample indices and feature dimensions, making the corresponding pooling operation maximize the conditional expectation of input data. We demonstrate that most existing pooling methods are equivalent to solving a regularized optimal transport (ROT) problem with different specializations, and more sophisticated pooling operations can be implemented by hierarchically solving multiple ROT problems. Making the parameters of the ROT problem learnable, we develop a family of regularized optimal transport pooling (ROTP) layers. We implement the ROTP layers as a new kind of deep implicit layer. Their model architectures correspond to different optimization algorithms. We test our ROTP layers in several representative set-level machine learning scenarios, including multi-instance learning (MIL), graph classification, graph set representation, and image classification. Experimental results show that applying our ROTP layers can reduce the difficulty of the design and selection of global pooling -- our ROTP layers may either imitate some existing global pooling methods or lead to some new pooling layers fitting data better. The code is available at \url{https://github.com/SDS-Lab/ROT-Pooling}.  ( 2 min )
    CropCat: Data Augmentation for Smoothing the Feature Distribution of EEG Signals. (arXiv:2212.06413v1 [cs.LG])
    Brain-computer interface (BCI) is a communication system between humans and computers reflecting human intention without using a physical control device. Since deep learning is robust in extracting features from data, research on decoding electroencephalograms by applying deep learning has progressed in the BCI domain. However, the application of deep learning in the BCI domain has issues with a lack of data and overconfidence. To solve these issues, we proposed a novel data augmentation method, CropCat. CropCat consists of two versions, CropCat-spatial and CropCat-temporal. We designed our method by concatenating the cropped data after cropping the data, which have different labels in spatial and temporal axes. In addition, we adjusted the label based on the ratio of cropped length. As a result, the generated data from our proposed method assisted in revising the ambiguous decision boundary into apparent caused by a lack of data. Due to the effectiveness of the proposed method, the performance of the four EEG signal decoding models is improved in two motor imagery public datasets compared to when the proposed method is not applied. Hence, we demonstrate that generated data by CropCat smooths the feature distribution of EEG signals when training the model.  ( 2 min )
    FNDaaS: Content-agnostic Detection of Fake News sites. (arXiv:2212.06492v1 [cs.CY])
    Automatic fake news detection is a challenging problem in misinformation spreading, and it has tremendous real-world political and social impacts. Past studies have proposed machine learning-based methods for detecting such fake news, focusing on different properties of the published news articles, such as linguistic characteristics of the actual content, which however have limitations due to the apparent language barriers. Departing from such efforts, we propose FNDaaS, the first automatic, content-agnostic fake news detection method, that considers new and unstudied features such as network and structural characteristics per news website. This method can be enforced as-a-Service, either at the ISP-side for easier scalability and maintenance, or user-side for better end-user privacy. We demonstrate the efficacy of our method using data crawled from existing lists of 637 fake and 1183 real news websites, and by building and testing a proof of concept system that materializes our proposal. Our analysis of data collected from these websites shows that the vast majority of fake news domains are very young and appear to have lower time periods of an IP associated with their domain than real news ones. By conducting various experiments with machine learning classifiers, we demonstrate that FNDaaS can achieve an AUC score of up to 0.967 on past sites, and up to 77-92% accuracy on newly-flagged ones.  ( 2 min )
    A Review of Off-Policy Evaluation in Reinforcement Learning. (arXiv:2212.06355v1 [stat.ML])
    Reinforcement learning (RL) is one of the most vibrant research frontiers in machine learning and has been recently applied to solve a number of challenging problems. In this paper, we primarily focus on off-policy evaluation (OPE), one of the most fundamental topics in RL. In recent years, a number of OPE methods have been developed in the statistics and computer science literature. We provide a discussion on the efficiency bound of OPE, some of the existing state-of-the-art OPE methods, their statistical properties and some other related research directions that are currently actively explored.  ( 2 min )
    Simplicity Bias Leads to Amplified Performance Disparities. (arXiv:2212.06641v1 [cs.LG])
    The simple idea that not all things are equally difficult has surprising implications when applied in a fairness context. In this work we explore how "difficulty" is model-specific, such that different models find different parts of a dataset challenging. When difficulty correlates with group information, we term this difficulty disparity. Drawing a connection with recent work exploring the inductive bias towards simplicity of SGD-trained models, we show that when such a disparity exists, it is further amplified by commonly-used models. We quantify this amplification factor across a range of settings aiming towards a fuller understanding of the role of model bias. We also present a challenge to the simplifying assumption that "fixing" a dataset is sufficient to ensure unbiased performance.  ( 2 min )
    Scalable and Sample Efficient Distributed Policy Gradient Algorithms in Multi-Agent Networked Systems. (arXiv:2212.06357v1 [cs.MA])
    This paper studies a class of multi-agent reinforcement learning (MARL) problems where the reward that an agent receives depends on the states of other agents, but the next state only depends on the agent's own current state and action. We name it REC-MARL standing for REward-Coupled Multi-Agent Reinforcement Learning. REC-MARL has a range of important applications such as real-time access control and distributed power control in wireless networks. This paper presents a distributed and optimal policy gradient algorithm for REC-MARL. The proposed algorithm is distributed in two aspects: (i) the learned policy is a distributed policy that maps a local state of an agent to its local action and (ii) the learning/training is distributed, during which each agent updates its policy based on its own and neighbors' information. The learned policy is provably optimal among all local policies and its regret bounds depend on the dimension of local states and actions. This distinguishes our result from most existing results on MARL, which often obtain stationary-point policies. The experimental results of our algorithm for the real-time access control and power control in wireless networks show that our policy significantly outperforms the state-of-the-art algorithms and well-known benchmarks.
    Graph Convolutional Networks for Traffic Forecasting with Missing Values. (arXiv:2212.06419v1 [cs.LG])
    Traffic forecasting has attracted widespread attention recently. In reality, traffic data usually contains missing values due to sensor or communication errors. The Spatio-temporal feature in traffic data brings more challenges for processing such missing values, for which the classic techniques (e.g., data imputations) are limited: 1) in temporal axis, the values can be randomly or consecutively missing; 2) in spatial axis, the missing values can happen on one single sensor or on multiple sensors simultaneously. Recent models powered by Graph Neural Networks achieved satisfying performance on traffic forecasting tasks. However, few of them are applicable to such a complex missing-value context. To this end, we propose GCN-M, a Graph Convolutional Network model with the ability to handle the complex missing values in the Spatio-temporal context. Particularly, we jointly model the missing value processing and traffic forecasting tasks, considering both local Spatio-temporal features and global historical patterns in an attention-based memory network. We propose as well a dynamic graph learning module based on the learned local-global features. The experimental results on real-life datasets show the reliability of our proposed method.  ( 2 min )
    Improving generalization in reinforcement learning through forked agents. (arXiv:2212.06451v1 [cs.AI])
    An eco-system of agents each having their own policy with some, but limited, generalizability has proven to be a reliable approach to increase generalization across procedurally generated environments. In such an approach, new agents are regularly added to the eco-system when encountering a new environment that is outside of the scope of the eco-system. The speed of adaptation and general effectiveness of the eco-system approach highly depends on the initialization of new agents. In this paper we propose different techniques for such initialization and study their impact. We then rework the ecosystem setup to use forked agents which brings better results than the initial eco-system approach with a drastically reduced number of training cycles.  ( 2 min )
    Numerical Stability of DeepGOPlus Inference. (arXiv:2212.06361v1 [cs.LG])
    Convolutional neural networks (CNNs) are currently among the most widely-used neural networks available and achieve state-of-the-art performance for many problems. While originally applied to computer vision tasks, CNNs work well with any data with a spatial relationship, besides images, and have been applied to different fields. However, recent works have highlighted how CNNs, like other deep learning models, are sensitive to noise injection which can jeopardise their performance. This paper quantifies the numerical uncertainty of the floating point arithmetic inaccuracies of the inference stage of DeepGOPlus, a CNN that predicts protein function, in order to determine its numerical stability. In addition, this paper investigates the possibility to use reduced-precision floating point formats for DeepGOPlus inference to reduce memory consumption and latency. This is achieved with Monte Carlo Arithmetic, a technique that experimentally quantifies floating point operation errors and VPREC, a tool that emulates results with customizable floating point precision formats. Focus is placed on the inference stage as it is the main deliverable of the DeepGOPlus model that will be used across environments and therefore most likely be subjected to the most amount of noise. Furthermore, studies have shown that the inference stage is the part of the model which is most disposed to being scaled down in terms of reduced precision. All in all, it has been found that the numerical uncertainty of the DeepGOPlus CNN is very low at its current numerical precision format, but the model cannot currently be reduced to a lower precision that might render it more lightweight.  ( 2 min )
    Dual adaptive training of photonic neural networks. (arXiv:2212.06141v1 [cs.LG])
    Photonic neural network (PNN) is a remarkable analog artificial intelligence (AI) accelerator that computes with photons instead of electrons to feature low latency, high energy efficiency, and high parallelism. However, the existing training approaches cannot address the extensive accumulation of systematic errors in large-scale PNNs, resulting in a significant decrease in model performance in physical systems. Here, we propose dual adaptive training (DAT) that allows the PNN model to adapt to substantial systematic errors and preserves its performance during the deployment. By introducing the systematic error prediction networks with task-similarity joint optimization, DAT achieves the high similarity mapping between the PNN numerical models and physical systems and high-accurate gradient calculations during the dual backpropagation training. We validated the effectiveness of DAT by using diffractive PNNs and interference-based PNNs on image classification tasks. DAT successfully trained large-scale PNNs under major systematic errors and preserved the model classification accuracies comparable to error-free systems. The results further demonstrated its superior performance over the state-of-the-art in situ training approaches. DAT provides critical support for constructing large-scale PNNs to achieve advanced architectures and can be generalized to other types of AI systems with analog computing errors.  ( 2 min )
    PPO-UE: Proximal Policy Optimization via Uncertainty-Aware Exploration. (arXiv:2212.06343v1 [cs.LG])
    Proximal Policy Optimization (PPO) is a highly popular policy-based deep reinforcement learning (DRL) approach. However, we observe that the homogeneous exploration process in PPO could cause an unexpected stability issue in the training phase. To address this issue, we propose PPO-UE, a PPO variant equipped with self-adaptive uncertainty-aware explorations (UEs) based on a ratio uncertainty level. The proposed PPO-UE is designed to improve convergence speed and performance with an optimized ratio uncertainty level. Through extensive sensitivity analysis by varying the ratio uncertainty level, our proposed PPO-UE considerably outperforms the baseline PPO in Roboschool continuous control tasks.  ( 2 min )
    Reliable extrapolation of deep neural operators informed by physics or sparse observations. (arXiv:2212.06347v1 [cs.LG])
    Deep neural operators can learn nonlinear mappings between infinite-dimensional function spaces via deep neural networks. As promising surrogate solvers of partial differential equations (PDEs) for real-time prediction, deep neural operators such as deep operator networks (DeepONets) provide a new simulation paradigm in science and engineering. Pure data-driven neural operators and deep learning models, in general, are usually limited to interpolation scenarios, where new predictions utilize inputs within the support of the training set. However, in the inference stage of real-world applications, the input may lie outside the support, i.e., extrapolation is required, which may result to large errors and unavoidable failure of deep learning models. Here, we address this challenge of extrapolation for deep neural operators. First, we systematically investigate the extrapolation behavior of DeepONets by quantifying the extrapolation complexity via the 2-Wasserstein distance between two function spaces and propose a new behavior of bias-variance trade-off for extrapolation with respect to model capacity. Subsequently, we develop a complete workflow, including extrapolation determination, and we propose five reliable learning methods that guarantee a safe prediction under extrapolation by requiring additional information -- the governing PDEs of the system or sparse new observations. The proposed methods are based on either fine-tuning a pre-trained DeepONet or multifidelity learning. We demonstrate the effectiveness of the proposed framework for various types of parametric PDEs. Our systematic comparisons provide practical guidelines for selecting a proper extrapolation method depending on the available information, desired accuracy, and required inference speed.  ( 2 min )
    Considerations for Differentially Private Learning with Large-Scale Public Pretraining. (arXiv:2212.06470v1 [cs.LG])
    The performance of differentially private machine learning can be boosted significantly by leveraging the transfer learning capabilities of non-private models pretrained on large public datasets. We critically review this approach. We primarily question whether the use of large Web-scraped datasets should be viewed as differential-privacy-preserving. We caution that publicizing these models pretrained on Web data as "private" could lead to harm and erode the public's trust in differential privacy as a meaningful definition of privacy. Beyond the privacy considerations of using public data, we further question the utility of this paradigm. We scrutinize whether existing machine learning benchmarks are appropriate for measuring the ability of pretrained models to generalize to sensitive domains, which may be poorly represented in public Web data. Finally, we notice that pretraining has been especially impactful for the largest available models -- models sufficiently large to prohibit end users running them on their own devices. Thus, deploying such models today could be a net loss for privacy, as it would require (private) data to be outsourced to a more compute-powerful third party. We conclude by discussing potential paths forward for the field of private learning, as public pretraining becomes more popular and powerful.  ( 2 min )
    A Statistical Model for Predicting Generalization in Few-Shot Classification. (arXiv:2212.06461v1 [cs.LG])
    The estimation of the generalization error of classifiers often relies on a validation set. Such a set is hardly available in few-shot learning scenarios, a highly disregarded shortcoming in the field. In these scenarios, it is common to rely on features extracted from pre-trained neural networks combined with distance-based classifiers such as nearest class mean. In this work, we introduce a Gaussian model of the feature distribution. By estimating the parameters of this model, we are able to predict the generalization error on new classification tasks with few samples. We observe that accurate distance estimates between class-conditional densities are the key to accurate estimates of the generalization performance. Therefore, we propose an unbiased estimator for these distances and integrate it in our numerical analysis. We show that our approach outperforms alternatives such as the leave-one-out cross-validation strategy in few-shot settings.  ( 2 min )
    Auto-labelling of Bug Report using Natural Language Processing. (arXiv:2212.06334v1 [cs.SE])
    The exercise of detecting similar bug reports in bug tracking systems is known as duplicate bug report detection. Having prior knowledge of a bug report's existence reduces efforts put into debugging problems and identifying the root cause. Rule and Query-based solutions recommend a long list of potential similar bug reports with no clear ranking. In addition, triage engineers are less motivated to spend time going through an extensive list. Consequently, this deters the use of duplicate bug report retrieval solutions. In this paper, we have proposed a solution using a combination of NLP techniques. Our approach considers unstructured and structured attributes of a bug report like summary, description and severity, impacted products, platforms, categories, etc. It uses a custom data transformer, a deep neural network, and a non-generalizing machine learning method to retrieve existing identical bug reports. We have performed numerous experiments with significant data sources containing thousands of bug reports and showcased that the proposed solution achieves a high retrieval accuracy of 70% for recall@5.  ( 2 min )
    How to select an objective function using information theory. (arXiv:2212.06566v1 [cs.LG])
    Science tests competing theories or models by evaluating the similarity of their predictions against observational experience. Thus, how we measure similarity fundamentally determines what we learn. In machine learning and scientific modeling, similarity metrics are used as objective functions. A classic example being mean squared error, which is the optimal measure of similarity when errors are normally distributed and independent and identically distributed (iid). In many cases, however, the error distribution is neither normal nor iid, so it is left to the scientist to determine an appropriate objective. Here, we review how information theory can guide that selection, then demonstrate the approach with a simple hydrologic model.
    Nonparametric Independent Component Analysis for the Sources with Mixed Spectra. (arXiv:2212.06327v1 [stat.ML])
    Independent component analysis (ICA) is a blind source separation method to recover source signals of interest from their mixtures. Most existing ICA procedures assume independent sampling. Second-order-statistics-based source separation methods have been developed based on parametric time series models for the mixtures from the autocorrelated sources. However, the second-order-statistics-based methods cannot separate the sources accurately when the sources have temporal autocorrelations with mixed spectra. To address this issue, we propose a new ICA method by estimating spectral density functions and line spectra of the source signals using cubic splines and indicator functions, respectively. The mixed spectra and the mixing matrix are estimated by maximizing the Whittle likelihood function. We illustrate the performance of the proposed method through simulation experiments and an EEG data application. The numerical results indicate that our approach outperforms existing ICA methods, including SOBI algorithms. In addition, we investigate the asymptotic behavior of the proposed method.  ( 2 min )
    Minimax Optimal Estimation of Stability Under Distribution Shift. (arXiv:2212.06338v1 [stat.ML])
    The performance of decision policies and prediction models often deteriorates when applied to environments different from the ones seen during training. To ensure reliable operation, we propose and analyze the stability of a system under distribution shift, which is defined as the smallest change in the underlying environment that causes the system's performance to deteriorate beyond a permissible threshold. In contrast to standard tail risk measures and distributionally robust losses that require the specification of a plausible magnitude of distribution shift, the stability measure is defined in terms of a more intuitive quantity: the level of acceptable performance degradation. We develop a minimax optimal estimator of stability and analyze its convergence rate, which exhibits a fundamental phase shift behavior. Our characterization of the minimax convergence rate shows that evaluating stability against large performance degradation incurs a statistical cost. Empirically, we demonstrate the practical utility of our stability framework by using it to compare system designs on problems where robustness to distribution shift is critical.  ( 2 min )
    Agnostic Learning for Packing Machine Stoppage Prediction in Smart Factories. (arXiv:2212.06288v1 [cs.LG])
    The cyber-physical convergence is opening up new business opportunities for industrial operators. The need for deep integration of the cyber and the physical worlds establishes a rich business agenda towards consolidating new system and network engineering approaches. This revolution would not be possible without the rich and heterogeneous sources of data, as well as the ability of their intelligent exploitation, mainly due to the fact that data will serve as a fundamental resource to promote Industry 4.0. One of the most fruitful research and practice areas emerging from this data-rich, cyber-physical, smart factory environment is the data-driven process monitoring field, which applies machine learning methodologies to enable predictive maintenance applications. In this paper, we examine popular time series forecasting techniques as well as supervised machine learning algorithms in the applied context of Industry 4.0, by transforming and preprocessing the historical industrial dataset of a packing machine's operational state recordings (real data coming from the production line of a manufacturing plant from the food and beverage domain). In our methodology, we use only a single signal concerning the machine's operational status to make our predictions, without considering other operational variables or fault and warning signals, hence its characterization as ``agnostic''. In this respect, the results demonstrate that the adopted methods achieve a quite promising performance on three targeted use cases.  ( 2 min )
    Policy learning for many outcomes of interest: Combining optimal policy trees with multi-objective Bayesian optimisation. (arXiv:2212.06312v1 [cs.LG])
    Methods for learning optimal policies use causal machine learning models to create human-interpretable rules for making choices around the allocation of different policy interventions. However, in realistic policy-making contexts, decision-makers often care about trade-offs between outcomes, not just singlemindedly maximising utility for one outcome. This paper proposes an approach termed Multi-Objective Policy Learning (MOPoL) which combines optimal decision trees for policy learning with a multi-objective Bayesian optimisation approach to explore the trade-off between multiple outcomes. It does this by building a Pareto frontier of non-dominated models for different hyperparameter settings. The key here is that a low-cost surrogate function can be an accurate proxy for the very computationally costly optimal tree in terms of expected regret. This surrogate can be fit many times with different hyperparameter values to proxy the performance of the optimal model. The method is applied to a real-world case-study of conditional cash transfers in Morocco where hybrid (partially optimal, partially greedy) policy trees provide good performance as a surrogate for optimal trees while being computationally cheap enough to feasibly fit a Pareto frontier.  ( 2 min )
    Interpretable Diabetic Retinopathy Diagnosis based on Biomarker Activation Map. (arXiv:2212.06299v1 [eess.IV])
    Deep learning classifiers provide the most accurate means of automatically diagnosing diabetic retinopathy (DR) based on optical coherence tomography (OCT) and its angiography (OCTA). The power of these models is attributable in part to the inclusion of hidden layers that provide the complexity required to achieve a desired task. However, hidden layers also render algorithm outputs difficult to interpret. Here we introduce a novel biomarker activation map (BAM) framework based on generative adversarial learning that allows clinicians to verify and understand classifiers decision-making. A data set including 456 macular scans were graded as non-referable or referable DR based on current clinical standards. A DR classifier that was used to evaluate our BAM was first trained based on this data set. The BAM generation framework was designed by combing two U-shaped generators to provide meaningful interpretability to this classifier. The main generator was trained to take referable scans as input and produce an output that would be classified by the classifier as non-referable. The BAM is then constructed as the difference image between the output and input of the main generator. To ensure that the BAM only highlights classifier-utilized biomarkers an assistant generator was trained to do the opposite, producing scans that would be classified as referable by the classifier from non-referable scans. The generated BAMs highlighted known pathologic features including nonperfusion area and retinal fluid. A fully interpretable classifier based on these highlights could help clinicians better utilize and verify automated DR diagnosis.  ( 2 min )
    Mixed Supervision of Histopathology Improves Prostate Cancer Classification from MRI. (arXiv:2212.06336v1 [eess.IV])
    Non-invasive prostate cancer detection from MRI has the potential to revolutionize patient care by providing early detection of clinically-significant disease (ISUP grade group >= 2), but has thus far shown limited positive predictive value. To address this, we present an MRI-based deep learning method for predicting clinically significant prostate cancer applicable to a patient population with subsequent ground truth biopsy results ranging from benign pathology to ISUP grade group~5. Specifically, we demonstrate that mixed supervision via diverse histopathological ground truth improves classification performance despite the cost of reduced concordance with image-based segmentation. That is, where prior approaches have utilized pathology results as ground truth derived from targeted biopsies and whole-mount prostatectomy to strongly supervise the localization of clinically significant cancer, our approach also utilizes weak supervision signals extracted from nontargeted systematic biopsies with regional localization to improve overall performance. Our key innovation is performing regression by distribution rather than simply by value, enabling use of additional pathology findings traditionally ignored by deep learning strategies. We evaluated our model on a dataset of 973 (testing n=160) multi-parametric prostate MRI exams collected at UCSF from 2015-2018 followed by MRI/ultrasound fusion (targeted) biopsy and systematic (nontargeted) biopsy of the prostate gland, demonstrating that deep networks trained with mixed supervision of histopathology can significantly exceed the performance of the Prostate Imaging-Reporting and Data System (PI-RADS) clinical standard for prostate MRI interpretation.  ( 2 min )
    AFLGuard: Byzantine-robust Asynchronous Federated Learning. (arXiv:2212.06325v1 [cs.CR])
    Federated learning (FL) is an emerging machine learning paradigm, in which clients jointly learn a model with the help of a cloud server. A fundamental challenge of FL is that the clients are often heterogeneous, e.g., they have different computing powers, and thus the clients may send model updates to the server with substantially different delays. Asynchronous FL aims to address this challenge by enabling the server to update the model once any client's model update reaches it without waiting for other clients' model updates. However, like synchronous FL, asynchronous FL is also vulnerable to poisoning attacks, in which malicious clients manipulate the model via poisoning their local data and/or model updates sent to the server. Byzantine-robust FL aims to defend against poisoning attacks. In particular, Byzantine-robust FL can learn an accurate model even if some clients are malicious and have Byzantine behaviors. However, most existing studies on Byzantine-robust FL focused on synchronous FL, leaving asynchronous FL largely unexplored. In this work, we bridge this gap by proposing AFLGuard, a Byzantine-robust asynchronous FL method. We show that, both theoretically and empirically, AFLGuard is robust against various existing and adaptive poisoning attacks (both untargeted and targeted). Moreover, AFLGuard outperforms existing Byzantine-robust asynchronous FL methods.  ( 2 min )
    Variance-Reduced Conservative Policy Iteration. (arXiv:2212.06283v1 [cs.LG])
    We study the sample complexity of reducing reinforcement learning to a sequence of empirical risk minimization problems over the policy space. Such reductions-based algorithms exhibit local convergence in the function space, as opposed to the parameter space for policy gradient algorithms, and thus are unaffected by the possibly non-linear or discontinuous parameterization of the policy class. We propose a variance-reduced variant of Conservative Policy Iteration that improves the sample complexity of producing a $\varepsilon$-functional local optimum from $O(\varepsilon^{-4})$ to $O(\varepsilon^{-3})$. Under state-coverage and policy-completeness assumptions, the algorithm enjoys $\varepsilon$-global optimality after sampling $O(\varepsilon^{-2})$ times, improving upon the previously established $O(\varepsilon^{-3})$ sample requirement.  ( 2 min )
    Linear Convergence of ISTA and FISTA. (arXiv:2212.06319v1 [math.OC])
    In this paper, we revisit the class of iterative shrinkage-thresholding algorithms (ISTA) for solving the linear inverse problem with sparse representation, which arises in signal and image processing. It is shown in the numerical experiment to deblur an image that the convergence behavior in the logarithmic-scale ordinate tends to be linear instead of logarithmic, approximating to be flat. Making meticulous observations, we find that the previous assumption for the smooth part to be convex weakens the least-square model. Specifically, assuming the smooth part to be strongly convex is more reasonable for the least-square model, even though the image matrix is probably ill-conditioned. Furthermore, we improve the pivotal inequality tighter for composite optimization with the smooth part to be strongly convex instead of general convex, which is first found in [Li et al., 2022]. Based on this pivotal inequality, we generalize the linear convergence to composite optimization in both the objective value and the squared proximal subgradient norm. Meanwhile, we set a simple ill-conditioned matrix which is easy to compute the singular values instead of the original blur matrix. The new numerical experiment shows the proximal generalization of Nesterov's accelerated gradient descent (NAG) for the strongly convex function has a faster linear convergence rate than ISTA. Based on the tighter pivotal inequality, we also generalize the faster linear convergence rate to composite optimization, in both the objective value and the squared proximal subgradient norm, by taking advantage of the well-constructed Lyapunov function with a slight modification and the phase-space representation based on the high-resolution differential equation framework from the implicit-velocity scheme.  ( 2 min )
    Privacy-Preserving Collaborative Learning through Feature Extraction. (arXiv:2212.06322v1 [cs.LG])
    We propose a framework in which multiple entities collaborate to build a machine learning model while preserving privacy of their data. The approach utilizes feature embeddings from shared/per-entity feature extractors transforming data into a feature space for cooperation between entities. We propose two specific methods and compare them with a baseline method. In Shared Feature Extractor (SFE) Learning, the entities use a shared feature extractor to compute feature embeddings of samples. In Locally Trained Feature Extractor (LTFE) Learning, each entity uses a separate feature extractor and models are trained using concatenated features from all entities. As a baseline, in Cooperatively Trained Feature Extractor (CTFE) Learning, the entities train models by sharing raw data. Secure multi-party algorithms are utilized to train models without revealing data or features in plain text. We investigate the trade-offs among SFE, LTFE, and CTFE in regard to performance, privacy leakage (using an off-the-shelf membership inference attack), and computational cost. LTFE provides the most privacy, followed by SFE, and then CTFE. Computational cost is lowest for SFE and the relative speed of CTFE and LTFE depends on network architecture. CTFE and LTFE provide the best accuracy. We use MNIST, a synthetic dataset, and a credit card fraud detection dataset for evaluations.  ( 2 min )
    Data Leakage via Access Patterns of Sparse Features in Deep Learning-based Recommendation Systems. (arXiv:2212.06264v1 [cs.CE])
    Online personalized recommendation services are generally hosted in the cloud where users query the cloud-based model to receive recommended input such as merchandise of interest or news feed. State-of-the-art recommendation models rely on sparse and dense features to represent users' profile information and the items they interact with. Although sparse features account for 99% of the total model size, there was not enough attention paid to the potential information leakage through sparse features. These sparse features are employed to track users' behavior, e.g., their click history, object interactions, etc., potentially carrying each user's private information. Sparse features are represented as learned embedding vectors that are stored in large tables, and personalized recommendation is performed by using a specific user's sparse feature to index through the tables. Even with recently-proposed methods that hides the computation happening in the cloud, an attacker in the cloud may be able to still track the access patterns to the embedding tables. This paper explores the private information that may be learned by tracking a recommendation model's sparse feature access patterns. We first characterize the types of attacks that can be carried out on sparse features in recommendation models in an untrusted cloud, followed by a demonstration of how each of these attacks leads to extracting users' private information or tracking users by their behavior over time.  ( 2 min )
    Test-time Adaptation vs. Training-time Generalization: A Case Study in Human Instance Segmentation using Keypoints Estimation. (arXiv:2212.06242v1 [cs.CV])
    We consider the problem of improving the human instance segmentation mask quality for a given test image using keypoints estimation. We compare two alternative approaches. The first approach is a test-time adaptation (TTA) method, where we allow test-time modification of the segmentation network's weights using a single unlabeled test image. In this approach, we do not assume test-time access to the labeled source dataset. More specifically, our TTA method consists of using the keypoints estimates as pseudo labels and backpropagating them to adjust the backbone weights. The second approach is a training-time generalization (TTG) method, where we permit offline access to the labeled source dataset but not the test-time modification of weights. Furthermore, we do not assume the availability of any images from or knowledge about the target domain. Our TTG method consists of augmenting the backbone features with those generated by the keypoints head and feeding the aggregate vector to the mask head. Through a comprehensive set of ablations, we evaluate both approaches and identify several factors limiting the TTA gains. In particular, we show that in the absence of a significant domain shift, TTA may hurt and TTG show only a small gain in performance, whereas for a large domain shift, TTA gains are smaller and dependent on the heuristics used, while TTG gains are larger and robust to architectural choices.  ( 2 min )
    An adaptive human-in-the-loop approach to emission detection of Additive Manufacturing processes and active learning with computer vision. (arXiv:2212.06153v1 [cs.LG])
    Recent developments in in-situ monitoring and process control in Additive Manufacturing (AM), also known as 3D-printing, allows the collection of large amounts of emission data during the build process of the parts being manufactured. This data can be used as input into 3D and 2D representations of the 3D-printed parts. However the analysis and use, as well as the characterization of this data still remains a manual process. The aim of this paper is to propose an adaptive human-in-the-loop approach using Machine Learning techniques that automatically inspect and annotate the emissions data generated during the AM process. More specifically, this paper will look at two scenarios: firstly, using convolutional neural networks (CNNs) to automatically inspect and classify emission data collected by in-situ monitoring and secondly, applying Active Learning techniques to the developed classification model to construct a human-in-the-loop mechanism in order to accelerate the labeling process of the emission data. The CNN-based approach relies on transfer learning and fine-tuning, which makes the approach applicable to other industrial image patterns. The adaptive nature of the approach is enabled by uncertainty sampling strategy to automatic selection of samples to be presented to human experts for annotation.  ( 2 min )
    Learning Disturbances Online for Risk-Aware Control: Risk-Aware Flight with Less Than One Minute of Data. (arXiv:2212.06253v1 [eess.SY])
    Recent advances in safety-critical risk-aware control are predicated on apriori knowledge of the disturbances a system might face. This paper proposes a method to efficiently learn these disturbances online, in a risk-aware context. First, we introduce the concept of a Surface-at-Risk, a risk measure for stochastic processes that extends Value-at-Risk -- a commonly utilized risk measure in the risk-aware controls community. Second, we model the norm of the state discrepancy between the model and the true system evolution as a scalar-valued stochastic process and determine an upper bound to its Surface-at-Risk via Gaussian Process Regression. Third, we provide theoretical results on the accuracy of our fitted surface subject to mild assumptions that are verifiable with respect to the data sets collected during system operation. Finally, we experimentally verify our procedure by augmenting a drone's controller and highlight performance increases achieved via our risk-aware approach after collecting less than a minute of operating data.  ( 2 min )
    Autoregressive Bandits. (arXiv:2212.06251v1 [cs.LG])
    Autoregressive processes naturally arise in a large variety of real-world scenarios, including e.g., stock markets, sell forecasting, weather prediction, advertising, and pricing. When addressing a sequential decision-making problem in such a context, the temporal dependence between consecutive observations should be properly accounted for converge to the optimal decision policy. In this work, we propose a novel online learning setting, named Autoregressive Bandits (ARBs), in which the observed reward follows an autoregressive process of order $k$, whose parameters depend on the action the agent chooses, within a finite set of $n$ actions. Then, we devise an optimistic regret minimization algorithm AutoRegressive Upper Confidence Bounds (AR-UCB) that suffers regret of order $\widetilde{\mathcal{O}} \left( \frac{(k+1)^{3/2}\sqrt{nT}}{(1-\Gamma)^2} \right)$, being $T$ the optimization horizon and $\Gamma < 1$ an index of the stability of the system. Finally, we present a numerical validation in several synthetic and one real-world setting, in comparison with general and specific purpose bandit baselines showing the advantages of the proposed approach.  ( 2 min )
    Quantum Phase Recognition using Quantum Tensor Networks. (arXiv:2212.06207v1 [quant-ph])
    Machine learning (ML) has recently facilitated many advances in solving problems related to many-body physical systems. Given the intrinsic quantum nature of these problems, it is natural to speculate that quantum-enhanced machine learning will enable us to unveil even greater details than we currently have. With this motivation, this paper examines a quantum machine learning approach based on shallow variational ansatz inspired by tensor networks for supervised learning tasks. In particular, we first look at the standard image classification tasks using the Fashion-MNIST dataset and study the effect of repeating tensor network layers on ansatz's expressibility and performance. Finally, we use this strategy to tackle the problem of quantum phase recognition for the transverse-field Ising and Heisenberg spin models in one and two dimensions, where we were able to reach $\geq 98\%$ test-set accuracies with both multi-scale entanglement renormalization ansatz (MERA) and tree tensor network (TTN) inspired parametrized quantum circuits.  ( 2 min )
    Mortality Prediction Models with Clinical Notes Using Sparse Attention at the Word and Sentence Levels. (arXiv:2212.06267v1 [cs.CL])
    Intensive Care in-hospital mortality prediction has various clinical applications. Neural prediction models, especially when capitalising on clinical notes, have been put forward as improvement on currently existing models. However, to be acceptable these models should be performant and transparent. This work studies different attention mechanisms for clinical neural prediction models in terms of their discrimination and calibration. Specifically, we investigate sparse attention as an alternative to dense attention weights in the task of in-hospital mortality prediction from clinical notes. We evaluate the attention mechanisms based on: i) local self-attention over words in a sentence, and ii) global self-attention with a transformer architecture across sentences. We demonstrate that the sparse mechanism approach outperforms the dense one for the local self-attention in terms of predictive performance with a publicly available dataset, and puts higher attention to prespecified relevant directive words. The performance at the sentence level, however, deteriorates as sentences including the influential directive words tend to be dropped all together.  ( 2 min )
    Synthetic Image Data for Deep Learning. (arXiv:2212.06232v1 [cs.CV])
    Realistic synthetic image data rendered from 3D models can be used to augment image sets and train image classification semantic segmentation models. In this work, we explore how high quality physically-based rendering and domain randomization can efficiently create a large synthetic dataset based on production 3D CAD models of a real vehicle. We use this dataset to quantify the effectiveness of synthetic augmentation using U-net and Double-U-net models. We found that, for this domain, synthetic images were an effective technique for augmenting limited sets of real training data. We observed that models trained on purely synthetic images had a very low mean prediction IoU on real validation images. We also observed that adding even very small amounts of real images to a synthetic dataset greatly improved accuracy, and that models trained on datasets augmented with synthetic images were more accurate than those trained on real images alone. Finally, we found that in use cases that benefit from incremental training or model specialization, pretraining a base model on synthetic images provided a sizeable reduction in the training cost of transfer learning, allowing up to 90\% of the model training to be front-loaded.  ( 2 min )
    You Only Need a Good Embeddings Extractor to Fix Spurious Correlations. (arXiv:2212.06254v1 [cs.CV])
    Spurious correlations in training data often lead to robustness issues since models learn to use them as shortcuts. For example, when predicting whether an object is a cow, a model might learn to rely on its green background, so it would do poorly on a cow on a sandy background. A standard dataset for measuring state-of-the-art on methods mitigating this problem is Waterbirds. The best method (Group Distributionally Robust Optimization - GroupDRO) currently achieves 89\% worst group accuracy and standard training from scratch on raw images only gets 72\%. GroupDRO requires training a model in an end-to-end manner with subgroup labels. In this paper, we show that we can achieve up to 90\% accuracy without using any sub-group information in the training set by simply using embeddings from a large pre-trained vision model extractor and training a linear classifier on top of it. With experiments on a wide range of pre-trained models and pre-training datasets, we show that the capacity of the pre-training model and the size of the pre-training dataset matters. Our experiments reveal that high capacity vision transformers perform better compared to high capacity convolutional neural networks, and larger pre-training dataset leads to better worst-group accuracy on the spurious correlation dataset.  ( 2 min )
    Forecasting formation of a Tropical Cyclone Using Reanalysis Data. (arXiv:2212.06149v1 [physics.ao-ph])
    The tropical cyclone formation process is one of the most complex natural phenomena which is governed by various atmospheric, oceanographic, and geographic factors that varies with time and space. Despite several years of research, accurately predicting tropical cyclone formation remains a challenging task. While the existing numerical models have inherent limitations, the machine learning models fail to capture the spatial and temporal dimensions of the causal factors behind TC formation. In this study, a deep learning model has been proposed that can forecast the formation of a tropical cyclone with a lead time of up to 60 hours with high accuracy. The model uses the high-resolution reanalysis data ERA5 (ECMWF reanalysis 5th generation), and best track data IBTrACS (International Best Track Archive for Climate Stewardship) to forecast tropical cyclone formation in six ocean basins of the world. For 60 hours lead time the models achieve an accuracy in the range of 86.9% - 92.9% across the six ocean basins. The model takes about 5-15 minutes of training time depending on the ocean basin, and the amount of data used and can predict within seconds, thereby making it suitable for real-life usage.  ( 2 min )
    Utilizing Mutations to Evaluate Interpretability of Neural Networks on Genomic Data. (arXiv:2212.06151v1 [q-bio.GN])
    Even though deep neural networks (DNNs) achieve state-of-the-art results for a number of problems involving genomic data, getting DNNs to explain their decision-making process has been a major challenge due to their black-box nature. One way to get DNNs to explain their reasoning for prediction is via attribution methods which are assumed to highlight the parts of the input that contribute to the prediction the most. Given the existence of numerous attribution methods and a lack of quantitative results on the fidelity of those methods, selection of an attribution method for sequence-based tasks has been mostly done qualitatively. In this work, we take a step towards identifying the most faithful attribution method by proposing a computational approach that utilizes point mutations. Providing quantitative results on seven popular attribution methods, we find Layerwise Relevance Propagation (LRP) to be the most appropriate one for translation initiation, with LRP identifying two important biological features for translation: the integrity of Kozak sequence as well as the detrimental effects of premature stop codons.  ( 2 min )
    Fairify: Fairness Verification of Neural Networks. (arXiv:2212.06140v1 [cs.LG])
    Fairness of machine learning (ML) software has become a major concern in the recent past. Although recent research on testing and improving fairness have demonstrated impact on real-world software, providing fairness guarantee in practice is still lacking. Certification of ML models is challenging because of the complex decision-making process of the models. In this paper, we proposed Fairify, the first SMT-based approach to verify individual fairness property in neural network (NN) models. Individual fairness ensures that any two similar individuals get similar treatment irrespective of their protected attributes e.g., race, sex, age. Verifying this fairness property is hard because of its global nature and the presence of non-linear computation nodes in NN. We proposed sound approach to make individual fairness verification tractable for the developers. The key idea is that many neurons in the NN always remain inactive when a smaller part of the input domain is considered. So, Fairify leverages white-box access to the models in production and then apply formal analysis based pruning. Our approach adopts input partitioning and then prunes the NN for each partition to provide fairness certification or counterexample. We leveraged interval arithmetic and activation heuristic of the neurons to perform the pruning as necessary. We evaluated Fairify on 25 real-world neural networks collected from four different sources, and demonstrated the effectiveness, scalability and performance over baseline and closely related work. Fairify is also configurable based on the domain and size of the NN. Our novel formulation of the problem can answer targeted verification queries with relaxations and counterexamples, which have practical implications.  ( 2 min )
    Towards Better Long-range Time Series Forecasting using Generative Forecasting. (arXiv:2212.06142v1 [cs.LG])
    Long-range time series forecasting is usually based on one of two existing forecasting strategies: Direct Forecasting and Iterative Forecasting, where the former provides low bias, high variance forecasts and the latter leads to low variance, high bias forecasts. In this paper, we propose a new forecasting strategy called Generative Forecasting (GenF), which generates synthetic data for the next few time steps and then makes long-range forecasts based on generated and observed data. We theoretically prove that GenF is able to better balance the forecasting variance and bias, leading to a much smaller forecasting error. We implement GenF via three components: (i) a novel conditional Wasserstein Generative Adversarial Network (GAN) based generator for synthetic time series data generation, called CWGAN-TS. (ii) a transformer based predictor, which makes long-range predictions using both generated and observed data. (iii) an information theoretic clustering algorithm to improve the training of both the CWGAN-TS and the transformer based predictor. The experimental results on five public datasets demonstrate that GenF significantly outperforms a diverse range of state-of-the-art benchmarks and classical approaches. Specifically, we find a 5% - 11% improvement in predictive performance (mean absolute error) while having a 15% - 50% reduction in parameters compared to the benchmarks. Lastly, we conduct an ablation study to further explore and demonstrate the effectiveness of the components comprising GenF.  ( 2 min )
    Accelerating Dataset Distillation via Model Augmentation. (arXiv:2212.06152v1 [cs.LG])
    Dataset Distillation (DD), a newly emerging field, aims at generating much smaller and high-quality synthetic datasets from large ones. Existing DD methods based on gradient matching achieve leading performance; however, they are extremely computationally intensive as they require continuously optimizing a dataset among thousands of randomly initialized models. In this paper, we assume that training the synthetic data with diverse models leads to better generalization performance. Thus we propose two \textbf{model augmentation} techniques, ~\ie using \textbf{early-stage models} and \textbf{weight perturbation} to learn an informative synthetic set with significantly reduced training cost. Extensive experiments demonstrate that our method achieves up to 20$\times$ speedup and comparable performance on par with state-of-the-art baseline methods.  ( 2 min )
    Optimizing Learning Rate Schedules for Iterative Pruning of Deep Neural Networks. (arXiv:2212.06144v1 [cs.LG])
    The importance of learning rate (LR) schedules on network pruning has been observed in a few recent works. As an example, Frankle and Carbin (2019) highlighted that winning tickets (i.e., accuracy preserving subnetworks) can not be found without applying a LR warmup schedule and Renda, Frankle and Carbin (2020) demonstrated that rewinding the LR to its initial state at the end of each pruning cycle improves performance. In this paper, we go one step further by first providing a theoretical justification for the surprising effect of LR schedules. Next, we propose a LR schedule for network pruning called SILO, which stands for S-shaped Improved Learning rate Optimization. The advantages of SILO over existing state-of-the-art (SOTA) LR schedules are two-fold: (i) SILO has a strong theoretical motivation and dynamically adjusts the LR during pruning to improve generalization. Specifically, SILO increases the LR upper bound (max_lr) in an S-shape. This leads to an improvement of 2% - 4% in extensive experiments with various types of networks (e.g., Vision Transformers, ResNet) on popular datasets such as ImageNet, CIFAR-10/100. (ii) In addition to the strong theoretical motivation, SILO is empirically optimal in the sense of matching an Oracle, which exhaustively searches for the optimal value of max_lr via grid search. We find that SILO is able to precisely adjust the value of max_lr to be within the Oracle optimized interval, resulting in performance competitive with the Oracle with significantly lower complexity.  ( 2 min )
    CPMLHO:Hyperparameter Tuning via Cutting Plane and Mixed-Level Optimization. (arXiv:2212.06150v1 [cs.LG])
    The hyperparameter optimization of neural network can be expressed as a bilevel optimization problem. The bilevel optimization is used to automatically update the hyperparameter, and the gradient of the hyperparameter is the approximate gradient based on the best response function. Finding the best response function is very time consuming. In this paper we propose CPMLHO, a new hyperparameter optimization method using cutting plane method and mixed-level objective function.The cutting plane is added to the inner layer to constrain the space of the response function. To obtain more accurate hypergradient,the mixed-level can flexibly adjust the loss function by using the loss of the training set and the verification set. Compared to existing methods, the experimental results show that our method can automatically update the hyperparameters in the training process, and can find more superior hyperparameters with higher accuracy and faster convergence.  ( 2 min )
    Improving Mutual Information based Feature Selection by Boosting Unique Relevance. (arXiv:2212.06143v1 [cs.LG])
    Mutual Information (MI) based feature selection makes use of MI to evaluate each feature and eventually shortlists a relevant feature subset, in order to address issues associated with high-dimensional datasets. Despite the effectiveness of MI in feature selection, we notice that many state-of-the-art algorithms disregard the so-called unique relevance (UR) of features, and arrive at a suboptimal selected feature subset which contains a non-negligible number of redundant features. We point out that the heart of the problem is that all these MIBFS algorithms follow the criterion of Maximize Relevance with Minimum Redundancy (MRwMR), which does not explicitly target UR. This motivates us to augment the existing criterion with the objective of boosting unique relevance (BUR), leading to a new criterion called MRwMR-BUR. Depending on the task being addressed, MRwMR-BUR has two variants, termed MRwMR-BUR-KSG and MRwMR-BUR-CLF, which estimate UR differently. MRwMR-BUR-KSG estimates UR via a nearest-neighbor based approach called the KSG estimator and is designed for three major tasks: (i) Classification Performance. (ii) Feature Interpretability. (iii) Classifier Generalization. MRwMR-BUR-CLF estimates UR via a classifier based approach. It adapts UR to different classifiers, further improving the competitiveness of MRwMR-BUR for classification performance oriented tasks. The performance of both MRwMR-BUR-KSG and MRwMR-BUR-CLF is validated via experiments using six public datasets and three popular classifiers. Specifically, as compared to MRwMR, the proposed MRwMR-BUR-KSG improves the test accuracy by 2% - 3% with 25% - 30% fewer features being selected, without increasing the algorithm complexity. MRwMR-BUR-CLF further improves the classification performance by 3.8%- 5.5% (relative to MRwMR), and it also outperforms three popular classifier dependent feature selection methods.  ( 2 min )
  • Open

    HardVis: Visual Analytics to Handle Instance Hardness Using Undersampling and Oversampling Techniques. (arXiv:2203.15753v3 [cs.LG] UPDATED)
    Despite the tremendous advances in machine learning (ML), training with imbalanced data still poses challenges in many real-world applications. Among a series of diverse techniques to solve this problem, sampling algorithms are regarded as an efficient solution. However, the problem is more fundamental, with many works emphasizing the importance of instance hardness. This issue refers to the significance of managing unsafe or potentially noisy instances that are more likely to be misclassified and serve as the root cause of poor classification performance. This paper introduces HardVis, a visual analytics system designed to handle instance hardness mainly in imbalanced classification scenarios. Our proposed system assists users in visually comparing different distributions of data types, selecting types of instances based on local characteristics that will later be affected by the active sampling method, and validating which suggestions from undersampling or oversampling techniques are beneficial for the ML model. Additionally, rather than uniformly undersampling/oversampling a specific class, we allow users to find and sample easy and difficult to classify training instances from all classes. Users can explore subsets of data from different perspectives to decide all those parameters, while HardVis keeps track of their steps and evaluates the model's predictive performance in a test set separately. The end result is a well-balanced data set that boosts the predictive power of the ML model. The efficacy and effectiveness of HardVis are demonstrated with a hypothetical usage scenario and a use case. Finally, we also look at how useful our system is based on feedback we received from ML experts.  ( 3 min )
    Wassmap: Wasserstein Isometric Mapping for Image Manifold Learning. (arXiv:2204.06645v2 [cs.LG] UPDATED)
    In this paper, we propose Wasserstein Isometric Mapping (Wassmap), a nonlinear dimensionality reduction technique that provides solutions to some drawbacks in existing global nonlinear dimensionality reduction algorithms in imaging applications. Wassmap represents images via probability measures in Wasserstein space, then uses pairwise Wasserstein distances between the associated measures to produce a low-dimensional, approximately isometric embedding. We show that the algorithm is able to exactly recover parameters of some image manifolds including those generated by translations or dilations of a fixed generating measure. Additionally, we show that a discrete version of the algorithm retrieves parameters from manifolds generated from discrete measures by providing a theoretical bridge to transfer recovery results from functional data to discrete data. Testing of the proposed algorithms on various image data manifolds show that Wassmap yields good embeddings compared with other global and local techniques.  ( 2 min )
    Multi-armed Bandit Learning on a Graph. (arXiv:2209.09419v2 [cs.LG] UPDATED)
    The multi-armed bandit(MAB) problem is a simple yet powerful framework that has been extensively studied in the context of decision-making under uncertainty. In many real-world applications, such as robotic applications, selecting an arm corresponds to a physical action that constrains the choices of the next available arms (actions). Motivated by this, we study an extension of MAB called the graph bandit, where an agent travels over a graph to maximize the reward collected from different nodes. The graph defines the agent's freedom in selecting the next available nodes at each step. We assume the graph structure is fully available, but the reward distributions are unknown. Built upon an offline graph-based planning algorithm and the principle of optimism, we design a learning algorithm, \texttt{G-UCB}, that balances long-term exploration-exploitation using the principle of optimism. We show that our proposed algorithm achieves $O(\sqrt{|S|T\log(T)}+D|S|\log T)$ learning regret, where $|S|$ is the number of nodes and $D$ is the diameter of the graph, which matches the theoretical lower bound $\Omega(\sqrt{|S|T})$ up to logarithmic factors. To our knowledge, this result is among the first tight regret bounds in non-episodic, un-discounted learning problems with known deterministic transitions. Numerical experiments confirm that our algorithm outperforms several benchmarks.  ( 2 min )
    A Framework for Benchmarking Clustering Algorithms. (arXiv:2209.09493v2 [cs.LG] UPDATED)
    The evaluation of clustering algorithms can involve running them on a variety of benchmark problems, and comparing their outputs to the reference, ground-truth groupings provided by experts. Unfortunately, many research papers and graduate theses consider only a small number of datasets. Also, the fact that there can be many equally valid ways to cluster a given problem set is rarely taken into account. In order to overcome these limitations, we have developed a framework whose aim is to introduce a consistent methodology for testing clustering algorithms. Furthermore, we have aggregated, polished, and standardised many clustering benchmark dataset collections referred to across the machine learning and data mining literature, and included new datasets of different dimensionalities, sizes, and cluster types. An interactive datasets explorer, the documentation of the Python API, a description of the ways to interact with the framework from other programming languages such as R or MATLAB, and other details are all provided at .  ( 2 min )
    The Unreasonable Effectiveness of Deep Evidential Regression. (arXiv:2205.10060v2 [cs.LG] UPDATED)
    There is a significant need for principled uncertainty reasoning in machine learning systems as they are increasingly deployed in safety-critical domains. A new approach with uncertainty-aware regression-based neural networks (NNs), based on learning evidential distributions for aleatoric and epistemic uncertainties, shows promise over traditional deterministic methods and typical Bayesian NNs, notably with the capabilities to disentangle aleatoric and epistemic uncertainties. Despite some empirical success of Deep Evidential Regression (DER), there are important gaps in the mathematical foundation that raise the question of why the proposed technique seemingly works. We detail the theoretical shortcomings and analyze the performance on synthetic and real-world data sets, showing that Deep Evidential Regression is a heuristic rather than an exact uncertainty quantification. We go on to propose corrections and redefinitions of how aleatoric and epistemic uncertainties should be extracted from NNs.  ( 2 min )
    Near-optimal fitting of ellipsoids to random points. (arXiv:2208.09493v3 [cs.DS] UPDATED)
    Given independent standard Gaussian points $v_1, \ldots, v_n$ in dimension $d$, for what values of $(n, d)$ does there exist with high probability an origin-symmetric ellipsoid that simultaneously passes through all of the points? This basic problem of fitting an ellipsoid to random points has connections to low-rank matrix decompositions, independent component analysis, and principal component analysis. Based on strong numerical evidence, Saunderson, Parrilo, and Willsky [Proc. of Conference on Decision and Control, pp. 6031-6036, 2013] conjecture that the ellipsoid fitting problem transitions from feasible to infeasible as the number of points $n$ increases, with a sharp threshold at $n \sim d^2/4$. We resolve this conjecture up to logarithmic factors by constructing a fitting ellipsoid for some $n = \Omega( \, d^2/\mathrm{polylog}(d) \,)$, improving prior work of Ghosh et al. [Proc. of Symposium on Foundations of Computer Science, pp. 954-965, 2020] that requires $n = o(d^{3/2})$. Our proof demonstrates feasibility of the least squares construction of Saunderson et al. using a convenient decomposition of a certain non-standard random matrix and a careful analysis of its Neumann expansion via the theory of graph matrices.  ( 2 min )
    Regression modelling of spatiotemporal extreme U.S. wildfires via partially-interpretable neural networks. (arXiv:2208.07581v3 [stat.ML] UPDATED)
    Risk management in many environmental settings requires an understanding of the mechanisms that drive extreme events. Useful metrics for quantifying such risk are extreme quantiles of response variables conditioned on predictor variables that describe, e.g., climate, biosphere and environmental states. Typically these quantiles lie outside the range of observable data and so, for estimation, require specification of parametric extreme value models within a regression framework. Classical approaches in this context utilise linear or additive relationships between predictor and response variables and suffer in either their predictive capabilities or computational efficiency; moreover, their simplicity is unlikely to capture the truly complex structures that lead to the creation of extreme wildfires. In this paper, we propose a new methodological framework for performing extreme quantile regression using artificial neutral networks, which are able to capture complex non-linear relationships and scale well to high-dimensional data. The ``black box" nature of neural networks means that they lack the desirable trait of interpretability often favoured by practitioners; thus, we unify linear, and additive, regression methodology with deep learning to create partially-interpretable neural networks that can be used for statistical inference but retain high prediction accuracy. To complement this methodology, we further propose a novel point process model for extreme values which overcomes the finite lower-endpoint problem associated with the generalised extreme value class of distributions. Efficacy of our unified framework is illustrated on U.S. wildfire data with a high-dimensional predictor set and we illustrate vast improvements in predictive performance over linear and spline-based regression techniques.  ( 2 min )
    Formal limitations of sample-wise information-theoretic generalization bounds. (arXiv:2205.06915v2 [cs.LG] UPDATED)
    Some of the tightest information-theoretic generalization bounds depend on the average information between the learned hypothesis and a single training example. However, these sample-wise bounds were derived only for expected generalization gap. We show that even for expected squared generalization gap no such sample-wise information-theoretic bounds exist. The same is true for PAC-Bayes and single-draw bounds. Remarkably, PAC-Bayes, single-draw and expected squared generalization gap bounds that depend on information in pairs of examples exist.  ( 2 min )
    Nonparametric Independent Component Analysis for the Sources with Mixed Spectra. (arXiv:2212.06327v1 [stat.ML])
    Independent component analysis (ICA) is a blind source separation method to recover source signals of interest from their mixtures. Most existing ICA procedures assume independent sampling. Second-order-statistics-based source separation methods have been developed based on parametric time series models for the mixtures from the autocorrelated sources. However, the second-order-statistics-based methods cannot separate the sources accurately when the sources have temporal autocorrelations with mixed spectra. To address this issue, we propose a new ICA method by estimating spectral density functions and line spectra of the source signals using cubic splines and indicator functions, respectively. The mixed spectra and the mixing matrix are estimated by maximizing the Whittle likelihood function. We illustrate the performance of the proposed method through simulation experiments and an EEG data application. The numerical results indicate that our approach outperforms existing ICA methods, including SOBI algorithms. In addition, we investigate the asymptotic behavior of the proposed method.  ( 2 min )
    Ship Performance Monitoring using Machine-learning. (arXiv:2110.03594v2 [stat.ML] UPDATED)
    The hydrodynamic performance of a sea-going ship varies over its lifespan due to factors like marine fouling and the condition of the anti-fouling paint system. In order to accurately estimate the power demand and fuel consumption for a planned voyage, it is important to assess the hydrodynamic performance of the ship. The current work uses machine-learning (ML) methods to estimate the hydrodynamic performance of a ship using the onboard recorded in-service data. Three ML methods, NL-PCR, NL-PLSR and probabilistic ANN, are calibrated using the data from two sister ships. The calibrated models are used to extract the varying trend in ship's hydrodynamic performance over time and predict the change in performance through several propeller and hull cleaning events. The predicted change in performance is compared with the corresponding values estimated using the fouling friction coefficient ($\Delta C_F$). The ML methods are found to be performing well while modelling the hydrodynamic state variables of the ships with probabilistic ANN model performing the best, but the results from NL-PCR and NL-PLSR are not far behind, indicating that it may be possible to use simple methods to solve such problems with the help of domain knowledge.  ( 2 min )
    Linear Convergence of ISTA and FISTA. (arXiv:2212.06319v1 [math.OC])
    In this paper, we revisit the class of iterative shrinkage-thresholding algorithms (ISTA) for solving the linear inverse problem with sparse representation, which arises in signal and image processing. It is shown in the numerical experiment to deblur an image that the convergence behavior in the logarithmic-scale ordinate tends to be linear instead of logarithmic, approximating to be flat. Making meticulous observations, we find that the previous assumption for the smooth part to be convex weakens the least-square model. Specifically, assuming the smooth part to be strongly convex is more reasonable for the least-square model, even though the image matrix is probably ill-conditioned. Furthermore, we improve the pivotal inequality tighter for composite optimization with the smooth part to be strongly convex instead of general convex, which is first found in [Li et al., 2022]. Based on this pivotal inequality, we generalize the linear convergence to composite optimization in both the objective value and the squared proximal subgradient norm. Meanwhile, we set a simple ill-conditioned matrix which is easy to compute the singular values instead of the original blur matrix. The new numerical experiment shows the proximal generalization of Nesterov's accelerated gradient descent (NAG) for the strongly convex function has a faster linear convergence rate than ISTA. Based on the tighter pivotal inequality, we also generalize the faster linear convergence rate to composite optimization, in both the objective value and the squared proximal subgradient norm, by taking advantage of the well-constructed Lyapunov function with a slight modification and the phase-space representation based on the high-resolution differential equation framework from the implicit-velocity scheme.  ( 2 min )
    A Statistical Model for Predicting Generalization in Few-Shot Classification. (arXiv:2212.06461v1 [cs.LG])
    The estimation of the generalization error of classifiers often relies on a validation set. Such a set is hardly available in few-shot learning scenarios, a highly disregarded shortcoming in the field. In these scenarios, it is common to rely on features extracted from pre-trained neural networks combined with distance-based classifiers such as nearest class mean. In this work, we introduce a Gaussian model of the feature distribution. By estimating the parameters of this model, we are able to predict the generalization error on new classification tasks with few samples. We observe that accurate distance estimates between class-conditional densities are the key to accurate estimates of the generalization performance. Therefore, we propose an unbiased estimator for these distances and integrate it in our numerical analysis. We show that our approach outperforms alternatives such as the leave-one-out cross-validation strategy in few-shot settings.  ( 2 min )
    Multi-objective robust optimization using adaptive surrogate models for problems with mixed continuous-categorical parameters. (arXiv:2203.01996v2 [stat.ME] UPDATED)
    Explicitly accounting for uncertainties is paramount to the safety of engineering structures. Optimization which is often carried out at the early stage of the structural design offers an ideal framework for this task. When the uncertainties are mainly affecting the objective function, robust design optimization is traditionally considered. This work further assumes the existence of multiple and competing objective functions that need to be dealt with simultaneously. The optimization problem is formulated by considering quantiles of the objective functions which allows for the combination of both optimality and robustness in a single metric. By introducing the concept of common random numbers, the resulting nested optimization problem may be solved using a general-purpose solver, herein the non-dominated sorting genetic algorithm (NSGA-II). The computational cost of such an approach is however a serious hurdle to its application in real-world problems. We therefore propose a surrogate-assisted approach using Kriging as an inexpensive approximation of the associated computational model. The proposed approach consists of sequentially carrying out NSGA-II while using an adaptively built Kriging model to estimate the quantiles. Finally, the methodology is adapted to account for mixed categorical-continuous parameters as the applications involve the selection of qualitative design parameters as well. The methodology is first applied to two analytical examples showing its efficiency. The third application relates to the selection of optimal renovation scenarios of a building considering both its life cycle cost and environmental impact. It shows that when it comes to renovation, the heating system replacement should be the priority.  ( 2 min )
    Gradient flow in the gaussian covariate model: exact solution of learning curves and multiple descent structures. (arXiv:2212.06757v1 [stat.ML])
    A recent line of work has shown remarkable behaviors of the generalization error curves in simple learning models. Even the least-squares regression has shown atypical features such as the model-wise double descent, and further works have observed triple or multiple descents. Another important characteristic are the epoch-wise descent structures which emerge during training. The observations of model-wise and epoch-wise descents have been analytically derived in limited theoretical settings (such as the random feature model) and are otherwise experimental. In this work, we provide a full and unified analysis of the whole time-evolution of the generalization curve, in the asymptotic large-dimensional regime and under gradient-flow, within a wider theoretical setting stemming from a gaussian covariate model. In particular, we cover most cases already disparately observed in the literature, and also provide examples of the existence of multiple descent structures as a function of a model parameter or time. Furthermore, we show that our theoretical predictions adequately match the learning curves obtained by gradient descent over realistic datasets. Technically we compute averages of rational expressions involving random matrices using recent developments in random matrix theory based on "linear pencils". Another contribution, which is also of independent interest in random matrix theory, is a new derivation of related fixed point equations (and an extension there-off) using Dyson brownian motions.  ( 2 min )
    Wind power predictions from nowcasts to 4-hour forecasts: a learning approach with variable selection. (arXiv:2204.09362v2 [cs.LG] UPDATED)
    We study short-term prediction of wind speed and wind power (every 10 minutes up to 4 hours ahead). Accurate forecasts for these quantities are crucial to mitigate the negative effects of wind farms' intermittent production on energy systems and markets. We use machine learning to combine outputs from numerical weather prediction models with local observations. The former provide valuable information on higher scales dynamics while the latter gives the model fresher and location-specific data. So as to make the results usable for practitioners, we focus on well-known methods which can handle a high volume of data. We study first variable selection using both a linear technique and a nonlinear one. Then we exploit these results to forecast wind speed and wind power still with an emphasis on linear models versus nonlinear ones. For the wind power prediction, we also compare the indirect approach (wind speed predictions passed through a power curve) and the indirect one (directly predict wind power).  ( 2 min )
    Decentralized Stochastic Multi-Player Multi-Armed Walking Bandits. (arXiv:2212.06279v1 [cs.LG])
    Multi-player multi-armed bandit is an increasingly relevant decision-making problem, motivated by applications to cognitive radio systems. Most research for this problem focuses exclusively on the settings that players have \textit{full access} to all arms and receive no reward when pulling the same arm. Hence all players solve the same bandit problem with the goal of maximizing their cumulative reward. However, these settings neglect several important factors in many real-world applications, where players have \textit{limited access} to \textit{a dynamic local subset of arms} (i.e., an arm could sometimes be ``walking'' and not accessible to the player). To this end, this paper proposes a \textit{multi-player multi-armed walking bandits} model, aiming to address aforementioned modeling issues. The goal now is to maximize the reward, however, players can only pull arms from the local subset and only collect a full reward if no other players pull the same arm. We adopt Upper Confidence Bound (UCB) to deal with the exploration-exploitation tradeoff and employ distributed optimization techniques to properly handle collisions. By carefully integrating these two techniques, we propose a decentralized algorithm with near-optimal guarantee on the regret, and can be easily implemented to obtain competitive empirical performance.  ( 2 min )
    A Review of Off-Policy Evaluation in Reinforcement Learning. (arXiv:2212.06355v1 [stat.ML])
    Reinforcement learning (RL) is one of the most vibrant research frontiers in machine learning and has been recently applied to solve a number of challenging problems. In this paper, we primarily focus on off-policy evaluation (OPE), one of the most fundamental topics in RL. In recent years, a number of OPE methods have been developed in the statistics and computer science literature. We provide a discussion on the efficiency bound of OPE, some of the existing state-of-the-art OPE methods, their statistical properties and some other related research directions that are currently actively explored.  ( 2 min )
    MAntRA: A framework for model agnostic reliability analysis. (arXiv:2212.06303v1 [stat.ME])
    We propose a novel model agnostic data-driven reliability analysis framework for time-dependent reliability analysis. The proposed approach -- referred to as MAntRA -- combines interpretable machine learning, Bayesian statistics, and identifying stochastic dynamic equation to evaluate reliability of stochastically-excited dynamical systems for which the governing physics is \textit{apriori} unknown. A two-stage approach is adopted: in the first stage, an efficient variational Bayesian equation discovery algorithm is developed to determine the governing physics of an underlying stochastic differential equation (SDE) from measured output data. The developed algorithm is efficient and accounts for epistemic uncertainty due to limited and noisy data, and aleatoric uncertainty because of environmental effect and external excitation. In the second stage, the discovered SDE is solved using a stochastic integration scheme and the probability failure is computed. The efficacy of the proposed approach is illustrated on three numerical examples. The results obtained indicate the possible application of the proposed approach for reliability analysis of in-situ and heritage structures from on-site measurements.  ( 2 min )
    Considerations for Differentially Private Learning with Large-Scale Public Pretraining. (arXiv:2212.06470v1 [cs.LG])
    The performance of differentially private machine learning can be boosted significantly by leveraging the transfer learning capabilities of non-private models pretrained on large public datasets. We critically review this approach. We primarily question whether the use of large Web-scraped datasets should be viewed as differential-privacy-preserving. We caution that publicizing these models pretrained on Web data as "private" could lead to harm and erode the public's trust in differential privacy as a meaningful definition of privacy. Beyond the privacy considerations of using public data, we further question the utility of this paradigm. We scrutinize whether existing machine learning benchmarks are appropriate for measuring the ability of pretrained models to generalize to sensitive domains, which may be poorly represented in public Web data. Finally, we notice that pretraining has been especially impactful for the largest available models -- models sufficiently large to prohibit end users running them on their own devices. Thus, deploying such models today could be a net loss for privacy, as it would require (private) data to be outsourced to a more compute-powerful third party. We conclude by discussing potential paths forward for the field of private learning, as public pretraining becomes more popular and powerful.  ( 2 min )
    MCMC-Interactive Variational Inference. (arXiv:2010.02029v2 [cs.LG] UPDATED)
    Leveraging well-established MCMC strategies, we propose MCMC-interactive variational inference (MIVI) to not only estimate the posterior in a time constrained manner, but also facilitate the design of MCMC transitions. Constructing a variational distribution followed by a short Markov chain that has parameters to learn, MIVI takes advantage of the complementary properties of variational inference and MCMC to encourage mutual improvement. On one hand, with the variational distribution locating high posterior density regions, the Markov chain is optimized within the variational inference framework to efficiently target the posterior despite a small number of transitions. On the other hand, the optimized Markov chain with considerable flexibility guides the variational distribution towards the posterior and alleviates its underestimation of uncertainty. Furthermore, we prove the optimized Markov chain in MIVI admits extrapolation, which means its marginal distribution gets closer to the true posterior as the chain grows. Therefore, the Markov chain can be used separately as an efficient MCMC scheme. Experiments show that MIVI not only accurately and efficiently approximates the posteriors but also facilitates designs of stochastic gradient MCMC and Gibbs sampling transitions.  ( 2 min )
    Regularized Optimal Transport Layers for Generalized Global Pooling Operations. (arXiv:2212.06339v1 [cs.LG])
    Global pooling is one of the most significant operations in many machine learning models and tasks, which works for information fusion and structured data (like sets and graphs) representation. However, without solid mathematical fundamentals, its practical implementations often depend on empirical mechanisms and thus lead to sub-optimal, even unsatisfactory performance. In this work, we develop a novel and generalized global pooling framework through the lens of optimal transport. The proposed framework is interpretable from the perspective of expectation-maximization. Essentially, it aims at learning an optimal transport across sample indices and feature dimensions, making the corresponding pooling operation maximize the conditional expectation of input data. We demonstrate that most existing pooling methods are equivalent to solving a regularized optimal transport (ROT) problem with different specializations, and more sophisticated pooling operations can be implemented by hierarchically solving multiple ROT problems. Making the parameters of the ROT problem learnable, we develop a family of regularized optimal transport pooling (ROTP) layers. We implement the ROTP layers as a new kind of deep implicit layer. Their model architectures correspond to different optimization algorithms. We test our ROTP layers in several representative set-level machine learning scenarios, including multi-instance learning (MIL), graph classification, graph set representation, and image classification. Experimental results show that applying our ROTP layers can reduce the difficulty of the design and selection of global pooling -- our ROTP layers may either imitate some existing global pooling methods or lead to some new pooling layers fitting data better. The code is available at \url{https://github.com/SDS-Lab/ROT-Pooling}.  ( 2 min )
    Minimax Optimal Estimation of Stability Under Distribution Shift. (arXiv:2212.06338v1 [stat.ML])
    The performance of decision policies and prediction models often deteriorates when applied to environments different from the ones seen during training. To ensure reliable operation, we propose and analyze the stability of a system under distribution shift, which is defined as the smallest change in the underlying environment that causes the system's performance to deteriorate beyond a permissible threshold. In contrast to standard tail risk measures and distributionally robust losses that require the specification of a plausible magnitude of distribution shift, the stability measure is defined in terms of a more intuitive quantity: the level of acceptable performance degradation. We develop a minimax optimal estimator of stability and analyze its convergence rate, which exhibits a fundamental phase shift behavior. Our characterization of the minimax convergence rate shows that evaluating stability against large performance degradation incurs a statistical cost. Empirically, we demonstrate the practical utility of our stability framework by using it to compare system designs on problems where robustness to distribution shift is critical.  ( 2 min )
    Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning. (arXiv:2110.15501v2 [stat.ML] UPDATED)
    Evaluating the performance of an ongoing policy plays a vital role in many areas such as medicine and economics, to provide crucial instruction on the early-stop of the online experiment and timely feedback from the environment. Policy evaluation in online learning thus attracts increasing attention by inferring the mean outcome of the optimal policy (i.e., the value) in real-time. Yet, such a problem is particularly challenging due to the dependent data generated in the online environment, the unknown optimal policy, and the complex exploration and exploitation trade-off in the adaptive experiment. In this paper, we aim to overcome these difficulties in policy evaluation for online learning. We explicitly derive the probability of exploration that quantifies the probability of exploring the non-optimal actions under commonly used bandit algorithms. We use this probability to conduct valid inference on the online conditional mean estimator under each action and develop the doubly robust interval estimation (DREAM) method to infer the value under the estimated optimal policy in online learning. The proposed value estimator provides double protection on the consistency and is asymptotically normal with a Wald-type confidence interval provided. Extensive simulations and real data applications are conducted to demonstrate the empirical validity of the proposed DREAM method.  ( 2 min )
    Autoregressive Bandits. (arXiv:2212.06251v1 [cs.LG])
    Autoregressive processes naturally arise in a large variety of real-world scenarios, including e.g., stock markets, sell forecasting, weather prediction, advertising, and pricing. When addressing a sequential decision-making problem in such a context, the temporal dependence between consecutive observations should be properly accounted for converge to the optimal decision policy. In this work, we propose a novel online learning setting, named Autoregressive Bandits (ARBs), in which the observed reward follows an autoregressive process of order $k$, whose parameters depend on the action the agent chooses, within a finite set of $n$ actions. Then, we devise an optimistic regret minimization algorithm AutoRegressive Upper Confidence Bounds (AR-UCB) that suffers regret of order $\widetilde{\mathcal{O}} \left( \frac{(k+1)^{3/2}\sqrt{nT}}{(1-\Gamma)^2} \right)$, being $T$ the optimization horizon and $\Gamma < 1$ an index of the stability of the system. Finally, we present a numerical validation in several synthetic and one real-world setting, in comparison with general and specific purpose bandit baselines showing the advantages of the proposed approach.  ( 2 min )
    Accelerated structured matrix factorization. (arXiv:2212.06504v1 [stat.ME])
    Matrix factorization exploits the idea that, in complex high-dimensional data, the actual signal typically lies in lower-dimensional structures. These lower dimensional objects provide useful insight, with interpretability favored by sparse structures. Sparsity, in addition, is beneficial in terms of regularization and, thus, to avoid over-fitting. By exploiting Bayesian shrinkage priors, we devise a computationally convenient approach for high-dimensional matrix factorization. The dependence between row and column entities is modeled by inducing flexible sparse patterns within factors. The availability of external information is accounted for in such a way that structures are allowed while not imposed. Inspired by boosting algorithms, we pair the the proposed approach with a numerical strategy relying on a sequential inclusion and estimation of low-rank contributions, with data-driven stopping rule. Practical advantages of the proposed approach are demonstrated by means of a simulation study and the analysis of soccer heatmaps obtained from new generation tracking data.  ( 2 min )

  • Open

    Doug Finke - Quantum Computing Industry Trends
    submitted by /u/timothy-ventura [link] [comments]  ( 50 min )
    Luma Labs just came out with text-to-3d model website, I made a small video covering the it. So many uses for gamedev as someone who doesn't know Blender!!
    submitted by /u/AnonTopat [link] [comments]  ( 51 min )
    No A.I ART - A Protest Against Generated Art!
    submitted by /u/anselemnkoro [link] [comments]  ( 51 min )
    A true "fireside" chat on Generative AI
    submitted by /u/Repeat-or [link] [comments]  ( 51 min )
    I’ve been dipping into AI art generation and want to get better at image creation/manipulation, any suggestions for software?
    I recently made art for a project that someone bought off me and I want to use the money I earned to get better software. What is high quality software that I can invest in? I know about Dalle-2 and nvidia Canvas, but what else is there? submitted by /u/Giham [link] [comments]  ( 51 min )
    ChatGPT is awesome
    My take on ChatGPT from the perspective of a software engineer. It’s amazing, but it won’t replace our jobs (yet). submitted by /u/dhines5 [link] [comments]  ( 54 min )
    re:ChatGPT
    The magic of human language blinds us 2 its sociobiological humble origins. We could probably speaks 10s of thousands of yrs before leaving evidence and, probably only slightly more sophisticated than dogs initially, and it was just brute force trial and error that lead to Shakespeare. submitted by /u/Emergency_Address_51 [link] [comments]  ( 47 min )
    Google won’t launch ChatGPT rival because of ‘reputational risk’
    submitted by /u/Mk_Makanaki [link] [comments]  ( 54 min )
    AI Dream 126 - AI Manifestation (3/6)
    submitted by /u/LordPewPew777 [link] [comments]  ( 47 min )
    I created Digital Humans tech
    Hey guys! I made an app where you can chat with the AI-powered digital twin of a real person, it’s called “Get Cheezy With Dr. Aaron Ozee”.Aaron Ozee is a celebrity writer and children’s book author, but also he’s a friend of mine and it was cool to work together to create something so fundamentally new. The avatar is super realistic, he moves, sounds, and talks like Aaron himself, all because AI models, such as TTS, lipsync and the conversational engine based on LLM. Here is an app, if you want to try: https://apps.apple.com/app/get-cheezy-with-dr-aaron-ozee/id1642331303 Next, I want to make a tool where everyone can create their own character and share it with friends and followers. I believe this will revolutionize 1-to-many direct communication. I would love to hear your feedback, do you think this technology can scale our most valuable resource — time? submitted by /u/mynameisJura [link] [comments]  ( 48 min )
    AI won't replace your job. It will save you a lot of time. Do you agree?
    Need an teacher? Use ChatGPT Need an designer? Use Midjourney Need an voice actor? Use Synthesis Need a copywriter? Use Copy AI Need an assistant? Use Alexa You can use AI and save thousands of hours. submitted by /u/TheVellerShow [link] [comments]  ( 57 min )
    Free Inpainting Tool With Stable Diffusion! LAMA-CLEANER!
    submitted by /u/PuppetHere [link] [comments]  ( 53 min )
    Will/should AI be regulated?
    Seeing AI do so many wonderful things in no time is beautiful but also scary It is cool to see it replace a profession when you are not from that field but if you are it will leave you without. It can and will do wonderful things but at the rate it is advancing (which will only get faster) most jobs will stop existing, what will people even do for a living when AI can do everything, faster and better? (INB4 universal basic income The goverments don't want to gift you money, they live from you and if they did give you money chances are it would be under some kind of strict social credit slavery that will barely allow you to live (if they outright don't start killing people). Communist utopias always end with everybody owning nothing or being disposable tools of the state. submitted by /u/Absolutelynobody54 [link] [comments]  ( 51 min )
    Google’s Monopoly On Search Could Be Coming To An End
    submitted by /u/liquidocelotYT [link] [comments]  ( 51 min )
    Are you a researcher, programmer, artist, physicist, or just tinkering with AI tools? Come join us; we are a Discord Community called Learn AI Together with just over 30'000 amazing members! Ask questions, find colleagues, share your projects, learn together, and much more!
    Programming is way more fun when you learn/work with someone. Help each other, ask questions, brainstorm, etc. There is just so much benefit to joining a community when you are in this field, especially when you cannot find the question you are looking for on stack overflow! 😉 This is the same thing with AI, which is why I created a Discord server two years ago. Where anyone learning or working in the field could come and share their projects, learn together, work together, and much more. The community is now close to 30'000 members, which is unbelievable! Likewise, if you are just tinkering with ChatGPT, DALLE or MidJourney. Come join us and share your creations and the projects/companies/products you build (or find your next co-founder)! So glad to see it growing and see everyone so active. We have partnered with Towards AI to provide qualitative events, live streams, a community newsletter, free courses following recent developments, job opportunities, and more! p.s. we are always looking for contributors to our different projects (answer questions, moderation, help with open-source resources, podcast hosts...). Please reach out to me if interested! We also have some budget or cool merch we can send out :) Come join us if you are in the AI field ! https://discord.gg/learnaitogether submitted by /u/OnlyProggingForFun [link] [comments]  ( 53 min )
    How can AI contribute to art historical analysis and research?
    submitted by /u/Effective-Divide-828 [link] [comments]  ( 54 min )
    I used Stable Diffusion to Draw One Piece Characters and This Happened...
    submitted by /u/Ziinxx [link] [comments]  ( 47 min )
    The Landscape of AI Tools
    submitted by /u/arnolds112 [link] [comments]  ( 51 min )
    What would you suggest on this? I'm working on a project for creating a trivia game with crowdsourced trivias (a question with multiple answers with only one correct).
    I would like to implement machine learning services for the following propose. verify correctness of answer. verify and correct grammar. look for outdated question (ex: what team Verstapen is racing for?) Desirable: Block offensive questions. I'm open to comment and suggestions! submitted by /u/WillPatagonia [link] [comments]  ( 51 min )
    AI Art Galleries
    Hey guys, I´m doing a research project on AI Art and I need a lot of artwork from different years all the way from the 1970s to today. Does anyone know how I can find galleries of AI art with the year they were created in? Thanks! submitted by /u/Airikiskul [link] [comments]  ( 55 min )
    What are the best ai writers for long content?
    I use copy.ai to create full articles in minutes and, for now, it is my favourite. However, I would like to explore other options and I would love to read suggestions. submitted by /u/Luisvzoa [link] [comments]  ( 48 min )
    Best Artificial Intelligence books for beginners to experts to read in 2022
    submitted by /u/Lakshmireddys [link] [comments]  ( 47 min )
    How I became Supreme Leader of North Korea
    submitted by /u/J4k3zz [link] [comments]  ( 54 min )
    The problem isn’t AI, it’s requiring us to work to live
    submitted by /u/jamesj [link] [comments]  ( 79 min )
    dam got rejected in 4s 😥
    submitted by /u/TXR_TUBE [link] [comments]  ( 53 min )
    Is there a competent chat AI that will run on an older lower power machine, dedicated only to running the bot?
    I have several old dell optiplex 7010 machines (core i3) with as much as 8GB RAM each) that I can use, but ZERO knowledge of where to even start. TIA submitted by /u/copycat042 [link] [comments]  ( 49 min )
  • Open

    [D] Why are there no good generative music AIs?
    My theory: no good datasets, as opposed to image datasets like LAION harder/illegal to get music datasets. Shady methods are usually required to get large music datasets (like torrenting). The only music datasets I've found are classical, and even then, very limited as performances of classical music are still copyrighted. Therefore, large companies like OpenAI/Google are unable to take the risk in making a good generative music AI due to legal reasons. Startups have a better chance because they have less to lose and can better hide the fact that they trained their model with copyrighted material. Other than that, I don't believe audio is more challenging to process than images because the complete audio file can be reduced to its spectrogram, which is just a 2D image. TLDR: No good datasets submitted by /u/happyhammy [link] [comments]  ( 70 min )
    [D] Understanding batch sizes are larger learning rates (Myrtle AI)
    I'm trying to understand the following breakdown of batch sizes in the realm of high learning rates and fast model training from this post: https://myrtle.ai/learn/how-to-train-your-resnet-2-mini-batches/ Specifically, this bit: The results above suggest that if one wishes to train a neural network at high learning rates then there are two regimes to consider. For the current model and dataset, at batch size 128 we are safely in the regime where forgetfulness dominates and we should either focus on methods to reduce this (e.g. using larger models with sparse updates or perhaps natural gradient descent), or we should push batch sizes higher. At batch size 512 we enter the regime where curvature effects dominate and the focus should shift to mitigating these. In combination with the …  ( 66 min )
    [R] Talking About Large Language Models - Murray Shanahan 2022
    Paper: https://arxiv.org/abs/2212.03551 Twitter expanation: https://twitter.com/mpshanahan/status/1601641313933221888 Reddit discussion: https://www.reddit.com/r/agi/comments/zi0ks0/talking_about_large_language_models/ Abstract: Thanks to rapid progress in artificial intelligence, we have entered an era when technology and philosophy intersect in interesting ways. Sitting squarely at the centre of this intersection are large language models (LLMs). The more adept LLMs become at mimicking human language, the more vulnerable we become to anthropomorphism, to seeing the systems in which they are embedded as more human-like than they really are.This trend is amplified by the natural tendency to use philosophically loaded terms, such as "knows", "believes", and "thinks", when describin…  ( 64 min )
    [P] Inseq: A Toolkit for Interpreting Language Generation Models
    We recently open-sourced Inseq, a Python library built on top of 🤗 transformers and Pytorch, aimed at democratizing and commoditizing post-hoc interpretability analysis for sequence generation models. https://github.com/inseq-team/inseq Inseq supports thousands of 🤗 decoder-only and seq2seq models, with various attribution methods already baked in and many more to come. Attributing MetaAI's Galactica writing LaTeX formulas or GoogleAI Flan-T5 doing commonsense reasoning now takes only 3 lines of code! The Inseq CLI improves the user experience when conducting global analyses by enabling batched attribution of examples and even entire datasets from the 🤗 Hub directly from the console. Inseq is beginner-friendly but also fully extensible for advanced use cases, supporting attribution of custom functions and the extraction of step scores during generation. With Inseq, we aim to centralize and standardize some practices of the interpretability community working on NLG and NMT, to enable fair and reproducible evaluation. The project is still in its infancy, and feedback/contributions are very much appreciated! submitted by /u/SubstantialDig6663 [link] [comments]  ( 60 min )
    [Discussion]Using RL to create sensor networks
    so I'm working on a project where multiple sensors relay data to a central node in a synchronous fashion for real time data capture, the main aim being to ensure the data is in sync, all done via BLE. Was wondering how I could use RL to create a control algorithm? Maybe cooperative MARL network where each sensor is the agent or along these lines, or maybe even some other learning algorithm. Would appreciate advice from y'all, previous articles and works are welcome. Thanks!! submitted by /u/TittyMcSwag619 [link] [comments]  ( 63 min )
    [P] Release of lightly 1.2.39 - A python library for self-supervised learning
    Another year of has passed, and we’ve seen exciting progress in research around self-supervised learning in computer vision. We’re very excited that some of the recent models such as Masked Autoencoders (MAE) or Masked Siamese Networks (MSN) have been added to our OSS framework. The framework is also more and more used in research, ranging from medical imaging labs to big tech companies. Although we only have limited resources, we’re happy to make at least a small contribution to the community. The framework is built on top of PyTorch and is compatible with frameworks such as PyTorch Lightning for scaling across multiple GPUs.We are curious to hear your feedback. submitted by /u/igorsusmelj [link] [comments]  ( 70 min )
    [R] Trying to recover recent paper about activity flow
    I am trying to recall a recent paper about deep learning activity flow where the authors introduced a penalty term which helps activity flow in the network. The authors show that this helps avoid the vanishing gradient problem and also show that even with poor initialization, they can train well because of their proposed method. ​ The paper proposed an auxiliary loss at the activation level which was able to overcome poor weight initialization and use of sigmoid or tanh activation functions. I have been searching all day and can't find it. I think I originally found it on https://papers.labml.ai/papers/weekly/ submitted by /u/Ok-Teacher-22 [link] [comments]  ( 67 min )
    [P] Implemented Vision Transformers 🚀 from scratch using TensorFlow 2.x
    Hello Everyone 👋, I just implemented the paper named AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE popularly known as the vision transformer paper. This paper uses a Transformer encoder for image recognition. It achieves state-of-the-art performance without using convolutional layers given that we have a huge dataset and enough computational resources. Below I am sharing my implementation of this paper, please have a look and give it a 🌟 if you like it. This implementation provides easy-to-read code for understanding how the model works internally. My implementation: GitHub Link Thanks for your attention. 😀 submitted by /u/TensorDudee [link] [comments]  ( 69 min )
    [Research] Graph Embeddings for Graph shape?
    I am solving a graph-level problem. I want to fit graph embeddings to a learn-to-rank NN to rank the graphs by their "quality". The "quality" of the graphs is determined by whether they have certain shape or structure, say they have self-loops and no loose end, has many split nodes and merging nodes etc. The node and edge features are not in consideration. To my understanding, graph embeddings are best suited for graph similarity comparison, are there any techniques that can fit my use case? submitted by /u/J00Nnn [link] [comments]  ( 58 min )
    [D] What would happen if you normalize each sample on its on before sending it to the neural net?
    The standard method is to normalize the entire dataset (the training part) then send it to the model to train on. However I’ve noticed that in this manner the model doesn’t really work well when dealing with values outside the range it was trained on. So how about normalizing each sample between a fixed range, say 0 to 1 and then sending them in. Of course the testing data and the values to predict on would also be normalized in the same way. Would it change the neural network for the better or worse? submitted by /u/xylont [link] [comments]  ( 63 min )
    [D] Looking for a lightweight, simple network that can ingest unorganized pointclouds and produce 6dof poses
    As above. I know there are a deluge of papers out there, but i am looking for a modern but lightweight network that can consume unorganized pt clouds, ideally in batch form (though i am not sure how this will work if points are of different sizes?) and produce 6dof pose, + 3 dof dimensions optionally. I assume it's going to be some kind of lightweight PointNet++ type architecture, but would be great if i can be pointed to some resources submitted by /u/soulslicer0 [link] [comments]  ( 66 min )
  • Open

    Automatically identify languages in multi-lingual audio using Amazon Transcribe
    If you operate in a country with multiple official languages or across multiple regions, your audio files can contain different languages. Participants may be speaking entirely different languages or may switch between languages. Consider a customer service call to report a problem in an area with a substantial multi-lingual population. Although the conversation could begin […]  ( 6 min )
    Translate multiple source language documents to multiple target languages using Amazon Translate
    Enterprises need to translate business-critical content such as marketing materials, instruction manuals, and product catalogs across multiple languages to communicate with a global audience of customers, partners, and stakeholders. Identifying the source language in each document before calling a translate job creates complexities and adds another step to your workflow. For example, an international product […]  ( 5 min )
  • Open

    Who Said What? Recorder's On-device Solution for Labeling Speakers
    Posted by Quan Wang, Senior Staff Software Engineer, and Fan Zhang, Staff Software Engineer, Google In 2019 we launched Recorder, an audio recording app for Pixel phones that helps users create, manage, and edit audio recordings. It leverages recent developments in on-device machine learning to transcribe speech, recognize audio events, suggest tags for titles, and help users navigate transcripts. Nonetheless, some Recorder users found it difficult to navigate long recordings that have multiple speakers because it's not clear who said what. During the Made By Google event this year, we announced the "speaker labels" feature for the Recorder app. This opt-in feature annotates a recording transcript with unique and anonymous labels for each speaker (e.g., "Speaker 1", "Speaker 2", etc.)…  ( 91 min )
  • Open

    Machine learning and the arts: A creative continuum
    CAST Visiting Artist Andreas Refsgaard engages the MIT community in the ethics and play of creative coding.  ( 9 min )
  • Open

    Poisson distribution tail bounds
    Yesterday Terence Tao published a blog post on bounds for the Poisson probability distribution. Specifically, he wrote about Bennett’s inequalities and a refinement that he developed or at least made explicit. Tao writes This observation is not difficult and is implicitly in the literature … I was not able to find a clean version of […] Poisson distribution tail bounds first appeared on John D. Cook.  ( 5 min )
    Mentally calculating the day of the week in 2023
    Mentally calculating the day of the week will be especially easy in 2023. The five-step process discussed here reduces to three steps in 2023. One of the steps involves leap years, and 2023 is not a leap year. Another step involves calculating and adding in the “year share,” and the year share for 2023 is […] Mentally calculating the day of the week in 2023 first appeared on John D. Cook.  ( 6 min )
  • Open

    I used Stable Diffusion to Draw One Piece Characters and This Happened...
    submitted by /u/Ziinxx [link] [comments]  ( 49 min )
    What would happen if you normalize each sample on its on before sending it to the neural net?
    The standard method is to normalize the entire dataset (the training part) then send it to the model to train on. However I’ve noticed that in this manner the model doesn’t really work well when dealing with values outside the range it was trained on. So how about normalizing each sample between a fixed range, say 0 to 1 and then sending them in. Of course the testing data and the values to predict on would also be normalized in the same way. Would it change the neural network for the better or worse? submitted by /u/xylont [link] [comments]  ( 65 min )
  • Open

    Using RL to create sensor networks
    Hey so I'm working on a project where multiple sensors relay data to a central node in a synchronous fashion for real time data capture, the main aim being to ensure the data is in sync, all done via BLE. Was wondering how I could use RL to create a control algorithm? Maybe cooperative MARL network where each sensor is the agent or along these lines. Would appreciate advice from y'all, previous articles and works are welcome. Thanks!! submitted by /u/TittyMcSwag619 [link] [comments]  ( 50 min )
    Question on custom environment setup [openai-gym]
    Hi, I'm new to reinforcement learning (currently reading the Sutton book while trying to create a few things). I'm trying to design a custom environment using OpenAI Gym. Due to the lack of courses, etc., I'm reading the documents to have a deeper understanding of how to design such environments. I came by an example, the so-called gym-any-trade environment. I saw how the developer created this environment using a Pandas dataframe to have the information needed. To me, it seems that each row of the dataframe to be used in this environment contains a time point with stock prices. I wasn't able, however, to see any places in which this code was iterating over the dataframe rows (like a loop "for each row in DF..." or something like that). It only creates a window of observation rows but does not iterate over the rows of the original DF explicitly. So my question is: does the gym-open-air environment iterate over the dataframe rows per default? submitted by /u/tuliosarmento [link] [comments]  ( 56 min )
  • Open

    Tensor-based Sequential Learning via Hankel Matrix Representation for Next Item Recommendations. (arXiv:2212.05720v1 [cs.LG])
    Self-attentive transformer models have recently been shown to solve the next item recommendation task very efficiently. The learned attention weights capture sequential dynamics in user behavior and generalize well. Motivated by the special structure of learned parameter space, we question if it is possible to mimic it with an alternative and more lightweight approach. We develop a new tensor factorization-based model that ingrains the structural knowledge about sequential data within the learning process. We demonstrate how certain properties of a self-attention network can be reproduced with our approach based on special Hankel matrix representation. The resulting model has a shallow linear architecture and compares competitively to its neural counterpart.  ( 2 min )
    SchNetPack 2.0: A neural network toolbox for atomistic machine learning. (arXiv:2212.05517v1 [physics.chem-ph])
    SchNetPack is a versatile neural networks toolbox that addresses both the requirements of method development and application of atomistic machine learning. Version 2.0 comes with an improved data pipeline, modules for equivariant neural networks as well as a PyTorch implementation of molecular dynamics. An optional integration with PyTorch Lightning and the Hydra configuration framework powers a flexible command-line interface. This makes SchNetPack 2.0 easily extendable with custom code and ready for complex training task such as generation of 3d molecular structures.  ( 2 min )
    Retire: Robust Expectile Regression in High Dimensions. (arXiv:2212.05562v1 [stat.ME])
    High-dimensional data can often display heterogeneity due to heteroscedastic variance or inhomogeneous covariate effects. Penalized quantile and expectile regression methods offer useful tools to detect heteroscedasticity in high-dimensional data. The former is computationally challenging due to the non-smooth nature of the check loss, and the latter is sensitive to heavy-tailed error distributions. In this paper, we propose and study (penalized) robust expectile regression (retire), with a focus on iteratively reweighted $\ell_1$-penalization which reduces the estimation bias from $\ell_1$-penalization and leads to oracle properties. Theoretically, we establish the statistical properties of the retire estimator under two regimes: (i) low-dimensional regime in which $d \ll n$; (ii) high-dimensional regime in which $s\ll n\ll d$ with $s$ denoting the number of significant predictors. In the high-dimensional setting, we carefully characterize the solution path of the iteratively reweighted $\ell_1$-penalized retire estimation, adapted from the local linear approximation algorithm for folded-concave regularization. Under a mild minimum signal strength condition, we show that after as many as $\log(\log d)$ iterations the final iterate enjoys the oracle convergence rate. At each iteration, the weighted $\ell_1$-penalized convex program can be efficiently solved by a semismooth Newton coordinate descent algorithm. Numerical studies demonstrate the competitive performance of the proposed procedure compared with either non-robust or quantile regression based alternatives.  ( 2 min )
    On an Interpretation of ResNets via Solution Constructions. (arXiv:2212.05663v1 [cs.LG])
    This paper first constructs a typical solution of ResNets for multi-category classifications by the principle of gate-network controls and deep-layer classifications, from which a general interpretation of the ResNet architecture is given and the performance mechanism is explained. We then use more solutions to further demonstrate the generality of that interpretation. The universal-approximation capability of ResNets is proved.  ( 2 min )
    GWRBoost:A geographically weighted gradient boosting method for explainable quantification of spatially-varying relationships. (arXiv:2212.05814v1 [cs.LG])
    The geographically weighted regression (GWR) is an essential tool for estimating the spatial variation of relationships between dependent and independent variables in geographical contexts. However, GWR suffers from the problem that classical linear regressions, which compose the GWR model, are more prone to be underfitting, especially for significant volume and complex nonlinear data, causing inferior comparative performance. Nevertheless, some advanced models, such as the decision tree and the support vector machine, can learn features from complex data more effectively while they cannot provide explainable quantification for the spatial variation of localized relationships. To address the above issues, we propose a geographically gradient boosting weighted regression model, GWRBoost, that applies the localized additive model and gradient boosting optimization method to alleviate underfitting problems and retains explainable quantification capability for spatially-varying relationships between geographically located variables. Furthermore, we formulate the computation method of the Akaike information score for the proposed model to conduct the comparative analysis with the classic GWR algorithm. Simulation experiments and the empirical case study are applied to prove the efficient performance and practical value of GWRBoost. The results show that our proposed model can reduce the RMSE by 18.3\% in parameter estimation accuracy and AICc by 67.3\% in the goodness of fit.  ( 2 min )
    Hybrid Censored Quantile Regression Forest to Assess the Heterogeneous Effects. (arXiv:2212.05672v1 [stat.ME])
    In many applications, heterogeneous treatment effects on a censored response variable are of primary interest, and it is natural to evaluate the effects at different quantiles (e.g., median). The large number of potential effect modifiers, the unknown structure of the treatment effects, and the presence of right censoring pose significant challenges. In this paper, we develop a hybrid forest approach called Hybrid Censored Quantile Regression Forest (HCQRF) to assess the heterogeneous effects varying with high-dimensional variables. The hybrid estimation approach takes advantage of the random forests and the censored quantile regression. We propose a doubly-weighted estimation procedure that consists of a redistribution-of-mass weight to handle censoring and an adaptive nearest neighbor weight derived from the forest to handle high-dimensional effect functions. We propose a variable importance decomposition to measure the impact of a variable on the treatment effect function. Extensive simulation studies demonstrate the efficacy and stability of HCQRF. The result of the simulation study also convinces us of the effectiveness of the variable importance decomposition. We apply HCQRF to a clinical trial of colorectal cancer. We achieve insightful estimations of the treatment effect and meaningful variable importance results. The result of the variable importance also confirms the necessity of the decomposition.  ( 2 min )
    Explainable Performance. (arXiv:2212.05866v1 [stat.ML])
    We introduce the XPER (eXplainable PERformance) methodology to measure the specific contribution of the input features to the predictive or economic performance of a model. Our methodology offers several advantages. First, it is both model-agnostic and performance metric-agnostic. Second, XPER is theoretically founded as it is based on Shapley values. Third, the interpretation of the benchmark, which is inherent in any Shapley value decomposition, is meaningful in our context. Fourth, XPER is not plagued by model specification error, as it does not require re-estimating the model. Fifth, it can be implemented either at the model level or at the individual level. In an application based on auto loans, we find that performance can be explained by a surprisingly small number of features. XPER decompositions are rather stable across metrics, yet some feature contributions switch sign across metrics. Our analysis also shows that explaining model forecasts and model performance are two distinct tasks.  ( 2 min )
    Neural Continuous-Time Markov Models. (arXiv:2212.05378v1 [stat.ML])
    Continuous-time Markov chains are used to model stochastic systems where transitions can occur at irregular times, e.g., birth-death processes, chemical reaction networks, population dynamics, and gene regulatory networks. We develop a method to learn a continuous-time Markov chain's transition rate functions from fully observed time series. In contrast with existing methods, our method allows for transition rates to depend nonlinearly on both state variables and external covariates. The Gillespie algorithm is used to generate trajectories of stochastic systems where propensity functions (reaction rates) are known. Our method can be viewed as the inverse: given trajectories of a stochastic reaction network, we generate estimates of the propensity functions. While previous methods used linear or log-linear methods to link transition rates to covariates, we use neural networks, increasing the capacity and potential accuracy of learned models. In the chemical context, this enables the method to learn propensity functions from non-mass-action kinetics. We test our method with synthetic data generated from a variety of systems with known transition rates. We show that our method learns these transition rates with considerably more accuracy than log-linear methods, in terms of mean absolute error between ground truth and predicted transition rates. We also demonstrate an application of our methods to open-loop control of a continuous-time Markov chain.  ( 2 min )
    Industry-Scale Orchestrated Federated Learning for Drug Discovery. (arXiv:2210.08871v2 [cs.LG] UPDATED)
    To apply federated learning to drug discovery we developed a novel platform in the context of European Innovative Medicines Initiative (IMI) project MELLODDY (grant n{\deg}831472), which was comprised of 10 pharmaceutical companies, academic research labs, large industrial companies and startups. The MELLODDY platform was the first industry-scale platform to enable the creation of a global federated model for drug discovery without sharing the confidential data sets of the individual partners. The federated model was trained on the platform by aggregating the gradients of all contributing partners in a cryptographic, secure way following each training iteration. The platform was deployed on an Amazon Web Services (AWS) multi-account architecture running Kubernetes clusters in private subnets. Organisationally, the roles of the different partners were codified as different rights and permissions on the platform and administrated in a decentralized way. The MELLODDY platform generated new scientific discoveries which are described in a companion paper.
    Extra-Newton: A First Approach to Noise-Adaptive Accelerated Second-Order Methods. (arXiv:2211.01832v2 [math.OC] UPDATED)
    This work proposes a universal and adaptive second-order method for minimizing second-order smooth, convex functions. Our algorithm achieves $O(\sigma / \sqrt{T})$ convergence when the oracle feedback is stochastic with variance $\sigma^2$, and improves its convergence to $O( 1 / T^3)$ with deterministic oracles, where $T$ is the number of iterations. Our method also interpolates these rates without knowing the nature of the oracle apriori, which is enabled by a parameter-free adaptive step-size that is oblivious to the knowledge of smoothness modulus, variance bounds and the diameter of the constrained set. To our knowledge, this is the first universal algorithm with such global guarantees within the second-order optimization literature.
    Quasi Black-Box Variational Inference with Natural Gradients for Bayesian Learning. (arXiv:2205.11568v3 [stat.ML] UPDATED)
    We develop an optimization algorithm suitable for Bayesian learning in complex models. Our approach relies on natural gradient updates within a general black-box framework for efficient training with limited model-specific derivations. It applies within the class of exponential-family variational posterior distributions, for which we extensively discuss the Gaussian case for which the updates have a rather simple form. Our Quasi Black-box Variational Inference (QBVI) framework is readily applicable to a wide class of Bayesian inference problems and is of simple implementation as the updates of the variational posterior do not involve gradients with respect to the model parameters, nor the prescription of the Fisher information matrix. We develop QBVI under different hypotheses for the posterior covariance matrix, discuss details about its robust and feasible implementation, and provide a number of real-world applications to demonstrate its effectiveness.
    Optimal Learning Rates for Regularized Least-Squares with a Fourier Capacity Condition. (arXiv:2204.07856v3 [math.ST] UPDATED)
    We derive minimax adaptive rates for a new, broad class of Tikhonov-regularized learning problems in Hilbert scales under general source conditions. Our analysis does not require the regression function to be contained in the hypothesis class, and most notably does not employ the conventional \textit{a priori} assumptions on kernel eigendecay. Using the theory of interpolation, we demonstrate that the spectrum of the Mercer operator can be inferred in the presence of "tight'' $L^{\infty}$ embeddings of suitable Hilbert scales. Our analysis utilizes a new Fourier capacity condition, characterizes the optimal Lorentz range space of a modified Mercer operator in certain parameter regimes.
    A law of adversarial risk, interpolation, and label noise. (arXiv:2207.03933v2 [stat.ML] UPDATED)
    In supervised learning, it has been shown that label noise in the data can be interpolated without penalties on test accuracy. We show that interpolating label noise induces adversarial vulnerability, and prove the first theorem showing the relationship between label noise and adversarial risk for any data distribution. Our results are almost tight if we do not make any assumptions on the inductive bias of the learning algorithm. We then investigate how different components of this problem affect this result including properties of the distribution. We also discuss non-uniform label noise distributions; and prove a new theorem showing uniform label noise induces nearly as large an adversarial risk as the worst poisoning with the same noise rate. Then, we provide theoretical and empirical evidence that uniform label noise is more harmful than typical real-world label noise. Finally, we show how inductive biases amplify the effect of label noise and argue the need for future work in this direction.
    Distributional neural networks for electricity price forecasting. (arXiv:2207.02832v2 [q-fin.ST] UPDATED)
    We present a novel approach to probabilistic electricity price forecasting which utilizes distributional neural networks. The model structure is based on a deep neural network that contains a so-called probability layer. The network's output is a parametric distribution with 2 (normal) or 4 (Johnson's SU) parameters. In a forecasting study involving day-ahead electricity prices in the German market, our approach significantly outperforms state-of-the-art benchmarks, including LASSO-estimated regressions and deep neural networks combined with Quantile Regression Averaging. The obtained results not only emphasize the importance of higher moments when modeling volatile electricity prices, but also -- given that probabilistic forecasting is the essence of risk management -- provide important implications for managing portfolios in the power sector.
    The universal approximation theorem for complex-valued neural networks. (arXiv:2012.03351v2 [math.FA] UPDATED)
    We generalize the classical universal approximation theorem for neural networks to the case of complex-valued neural networks. Precisely, we consider feedforward networks with a complex activation function $\sigma : \mathbb{C} \to \mathbb{C}$ in which each neuron performs the operation $\mathbb{C}^N \to \mathbb{C}, z \mapsto \sigma(b + w^T z)$ with weights $w \in \mathbb{C}^N$ and a bias $b \in \mathbb{C}$, and with $\sigma$ applied componentwise. We completely characterize those activation functions $\sigma$ for which the associated complex networks have the universal approximation property, meaning that they can uniformly approximate any continuous function on any compact subset of $\mathbb{C}^d$ arbitrarily well. Unlike the classical case of real networks, the set of "good activation functions" which give rise to networks with the universal approximation property differs significantly depending on whether one considers deep networks or shallow networks: For deep networks with at least two hidden layers, the universal approximation property holds as long as $\sigma$ is neither a polynomial, a holomorphic function, or an antiholomorphic function. Shallow networks, on the other hand, are universal if and only if the real part or the imaginary part of $\sigma$ is not a polyharmonic function.
    Nonparametric Learning of Two-Layer ReLU Residual Units. (arXiv:2008.07648v3 [cs.LG] UPDATED)
    We describe an algorithm that learns two-layer residual units using rectified linear unit (ReLU) activation: suppose the input $\mathbf{x}$ is from a distribution with support space $\mathbb{R}^d$ and the ground-truth generative model is a residual unit of this type, given by $\mathbf{y} = \boldsymbol{B}^\ast\left[\left(\boldsymbol{A}^\ast\mathbf{x}\right)^+ + \mathbf{x}\right]$, where ground-truth network parameters $\boldsymbol{A}^\ast \in \mathbb{R}^{d\times d}$ represent a full-rank matrix with nonnegative entries and $\boldsymbol{B}^\ast \in \mathbb{R}^{m\times d}$ is full-rank with $m \geq d$ and for $\boldsymbol{c} \in \mathbb{R}^d$, $[\boldsymbol{c}^{+}]_i = \max\{0, c_i\}$. We design layer-wise objectives as functionals whose analytic minimizers express the exact ground-truth network in terms of its parameters and nonlinearities. Following this objective landscape, learning residual units from finite samples can be formulated using convex optimization of a nonparametric function: for each layer, we first formulate the corresponding empirical risk minimization (ERM) as a positive semi-definite quadratic program (QP), then we show the solution space of the QP can be equivalently determined by a set of linear inequalities, which can then be efficiently solved by linear programming (LP). We further prove the strong statistical consistency of our algorithm, and demonstrate its robustness and sample efficiency through experimental results on synthetic data and a set of benchmark regression datasets.
    Resource-Efficient Neural Networks for Embedded Systems. (arXiv:2001.03048v2 [stat.ML] UPDATED)
    While machine learning is traditionally a resource intensive task, embedded systems, autonomous navigation, and the vision of the Internet of Things fuel the interest in resource-efficient approaches. These approaches aim for a carefully chosen trade-off between performance and resource consumption in terms of computation and energy. The development of such approaches is among the major challenges in current machine learning research and key to ensure a smooth transition of machine learning technology from a scientific environment with virtually unlimited computing resources into everyday's applications. In this article, we provide an overview of the current state of the art of machine learning techniques facilitating these real-world requirements. In particular, we focus on deep neural networks (DNNs), the predominant machine learning models of the past decade. We give a comprehensive overview of the vast literature that can be mainly split into three non-mutually exclusive categories: (i) quantized neural networks, (ii) network pruning, and (iii) structural efficiency. These techniques can be applied during training or as post-processing, and they are widely used to reduce the computational demands in terms of memory footprint, inference speed, and energy efficiency. We also briefly discuss different concepts of embedded hardware for DNNs and their compatibility with machine learning techniques as well as potential for energy and latency reduction. We substantiate our discussion with experiments on well-known benchmark datasets using compression techniques (quantization, pruning) for a set of resource-constrained embedded systems, such as CPUs, GPUs and FPGAs. The obtained results highlight the difficulty of finding good trade-offs between resource efficiency and predictive performance.
    VO$Q$L: Towards Optimal Regret in Model-free RL with Nonlinear Function Approximation. (arXiv:2212.06069v1 [cs.LG])
    We study time-inhomogeneous episodic reinforcement learning (RL) under general function approximation and sparse rewards. We design a new algorithm, Variance-weighted Optimistic $Q$-Learning (VO$Q$L), based on $Q$-learning and bound its regret assuming completeness and bounded Eluder dimension for the regression function class. As a special case, VO$Q$L achieves $\tilde{O}(d\sqrt{HT}+d^6H^{5})$ regret over $T$ episodes for a horizon $H$ MDP under ($d$-dimensional) linear function approximation, which is asymptotically optimal. Our algorithm incorporates weighted regression-based upper and lower bounds on the optimal value function to obtain this improved regret. The algorithm is computationally efficient given a regression oracle over the function class, making this the first computationally tractable and statistically optimal approach for linear MDPs.
    Isotropic Gaussian Processes on Finite Spaces of Graphs. (arXiv:2211.01689v2 [stat.ML] UPDATED)
    We propose a principled way to define Gaussian process priors on various sets of unweighted graphs: directed or undirected, with or without loops. We endow each of these sets with a geometric structure, inducing the notions of closeness and symmetries, by turning them into a vertex set of an appropriate metagraph. Building on this, we describe the class of priors that respect this structure and are analogous to the Euclidean isotropic processes, like squared exponential or Mat\'ern. We propose an efficient computational technique for the ostensibly intractable problem of evaluating these priors' kernels, making such Gaussian processes usable within the usual toolboxes and downstream applications. We go further to consider sets of equivalence classes of unweighted graphs and define the appropriate versions of priors thereon. We prove a hardness result, showing that in this case, exact kernel computation cannot be performed efficiently. However, we propose a simple Monte Carlo approximation for handling moderately sized cases. Inspired by applications in chemistry, we illustrate the proposed techniques on a real molecular property prediction task in the small data regime.
    Auto-Encoding Variational Bayes. (arXiv:1312.6114v11 [stat.ML] UPDATED)
    How can we perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and large datasets? We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case. Our contributions are two-fold. First, we show that a reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods. Second, we show that for i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (also called a recognition model) to the intractable posterior using the proposed lower bound estimator. Theoretical advantages are reflected in experimental results.
    Continuous Conditional Generative Adversarial Networks: Novel Empirical Losses and Label Input Mechanisms. (arXiv:2011.07466v8 [cs.CV] UPDATED)
    This work proposes the continuous conditional generative adversarial network (CcGAN), the first generative model for image generation conditional on continuous, scalar conditions (termed regression labels). Existing conditional GANs (cGANs) are mainly designed for categorical conditions (eg, class labels); conditioning on regression labels is mathematically distinct and raises two fundamental problems:(P1) Since there may be very few (even zero) real images for some regression labels, minimizing existing empirical versions of cGAN losses (aka empirical cGAN losses) often fails in practice;(P2) Since regression labels are scalar and infinitely many, conventional label input methods are not applicable. The proposed CcGAN solves the above problems, respectively, by (S1) reformulating existing empirical cGAN losses to be appropriate for the continuous scenario; and (S2) proposing a naive label input (NLI) method and an improved label input (ILI) method to incorporate regression labels into the generator and the discriminator. The reformulation in (S1) leads to two novel empirical discriminator losses, termed the hard vicinal discriminator loss (HVDL) and the soft vicinal discriminator loss (SVDL) respectively, and a novel empirical generator loss. The error bounds of a discriminator trained with HVDL and SVDL are derived under mild assumptions in this work. Two new benchmark datasets (RC-49 and Cell-200) and a novel evaluation metric (Sliding Fr\'echet Inception Distance) are also proposed for this continuous scenario. Our experiments on the Circular 2-D Gaussians, RC-49, UTKFace, Cell-200, and Steering Angle datasets show that CcGAN is able to generate diverse, high-quality samples from the image distribution conditional on a given regression label. Moreover, in these experiments, CcGAN substantially outperforms cGAN both visually and quantitatively.
    Bivariate Causal Discovery for Categorical Data via Classification with Optimal Label Permutation. (arXiv:2209.08579v2 [stat.ML] UPDATED)
    Causal discovery for quantitative data has been extensively studied but less is known for categorical data. We propose a novel causal model for categorical data based on a new classification model, termed classification with optimal label permutation (COLP). By design, COLP is a parsimonious classifier, which gives rise to a provably identifiable causal model. A simple learning algorithm via comparing likelihood functions of causal and anti-causal models suffices to learn the causal direction. Through experiments with synthetic and real data, we demonstrate the favorable performance of the proposed COLP-based causal model compared to state-of-the-art methods. We also make available an accompanying R package COLP, which contains the proposed causal discovery algorithm and a benchmark dataset of categorical cause-effect pairs.
    Moving Metric Detection and Alerting System at eBay. (arXiv:2004.02360v2 [cs.CY] UPDATED)
    At eBay, there are thousands of product health metrics for different domain teams to monitor. We built a two-phase alerting system to notify users with actionable alerts based on anomaly detection and alert retrieval. In the first phase, we developed an efficient anomaly detection algorithm, called Moving Metric Detector (MMD), to identify potential alerts among metrics with distribution agnostic criteria. In the second alert retrieval phase, we built additional logic with feedbacks to select valid actionable alerts with point-wise ranking model and business rules. Compared with other trend and seasonality decomposition methods, our decomposer is faster and better to detect anomalies in unsupervised cases. Our two-phase approach dramatically improves alert precision and avoids alert spamming in eBay production.
    CausalEGM: a general causal inference framework by encoding generative modeling. (arXiv:2212.05925v1 [stat.ML])
    Although understanding and characterizing causal effects have become essential in observational studies, it is challenging when the confounders are high-dimensional. In this article, we develop a general framework $\textit{CausalEGM}$ for estimating causal effects by encoding generative modeling, which can be applied in both binary and continuous treatment settings. Under the potential outcome framework with unconfoundedness, we establish a bidirectional transformation between the high-dimensional confounders space and a low-dimensional latent space where the density is known (e.g., multivariate normal distribution). Through this, CausalEGM simultaneously decouples the dependencies of confounders on both treatment and outcome and maps the confounders to the low-dimensional latent space. By conditioning on the low-dimensional latent features, CausalEGM can estimate the causal effect for each individual or the average causal effect within a population. Our theoretical analysis shows that the excess risk for CausalEGM can be bounded through empirical process theory. Under an assumption on encoder-decoder networks, the consistency of the estimate can be guaranteed. In a series of experiments, CausalEGM demonstrates superior performance over existing methods for both binary and continuous treatments. Specifically, we find CausalEGM to be substantially more powerful than competing methods in the presence of large sample sizes and high dimensional confounders. The software of CausalEGM is freely available at https://github.com/SUwonglab/CausalEGM.
    Corruption-Robust Algorithms with Uncertainty Weighting for Nonlinear Contextual Bandits and Markov Decision Processes. (arXiv:2212.05949v1 [stat.ML])
    Despite the significant interest and progress in reinforcement learning (RL) problems with adversarial corruption, current works are either confined to the linear setting or lead to an undesired $\tilde{O}(\sqrt{T}\zeta)$ regret bound, where $T$ is the number of rounds and $\zeta$ is the total amount of corruption. In this paper, we consider the contextual bandit with general function approximation and propose a computationally efficient algorithm to achieve a regret of $\tilde{O}(\sqrt{T}+\zeta)$. The proposed algorithm relies on the recently developed uncertainty-weighted least-squares regression from linear contextual bandit \citep{he2022nearly} and a new weighted estimator of uncertainty for the general function class. In contrast to the existing analysis that heavily relies on the linear structure, we develop a novel technique to control the sum of weighted uncertainty, thus establishing the final regret bounds. We then generalize our algorithm to the episodic MDP setting and first achieve an additive dependence on the corruption level $\zeta$ in the scenario of general function approximation. Notably, our algorithms achieve regret bounds either nearly match the performance lower bound or improve the existing methods for all the corruption levels and in both known and unknown $\zeta$ cases.
    Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision Processes. (arXiv:2212.06132v1 [cs.LG])
    We study reinforcement learning (RL) with linear function approximation. For episodic time-inhomogeneous linear Markov decision processes (linear MDPs) whose transition dynamic can be parameterized as a linear function of a given feature mapping, we propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $\tilde O(d\sqrt{H^3K})$, where $d$ is the dimension of the feature mapping, $H$ is the planning horizon, and $K$ is the number of episodes. Our algorithm is based on a weighted linear regression scheme with a carefully designed weight, which depends on a new variance estimator that (1) directly estimates the variance of the \emph{optimal} value function, (2) monotonically decreases with respect to the number of episodes to ensure a better estimation accuracy, and (3) uses a rare-switching policy to update the value function estimator to control the complexity of the estimated value function class. Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
    Semi-Discrete Normalizing Flows through Differentiable Tessellation. (arXiv:2203.06832v4 [cs.LG] UPDATED)
    Mapping between discrete and continuous distributions is a difficult task and many have had to resort to heuristical approaches. We propose a tessellation-based approach that directly learns quantization boundaries in a continuous space, complete with exact likelihood evaluations. This is done through constructing normalizing flows on convex polytopes parameterized using a simple homeomorphism with an efficient log determinant Jacobian. We explore this approach in two application settings, mapping from discrete to continuous and vice versa. Firstly, a Voronoi dequantization allows automatically learning quantization boundaries in a multidimensional space. The location of boundaries and distances between regions can encode useful structural relations between the quantized discrete values. Secondly, a Voronoi mixture model has near-constant computation cost for likelihood evaluation regardless of the number of mixture components. Empirically, we show improvements over existing methods across a range of structured data modalities.
    Stochastic Optimization for Spectral Risk Measures. (arXiv:2212.05149v1 [stat.ML])
    Spectral risk objectives - also called $L$-risks - allow for learning systems to interpolate between optimizing average-case performance (as in empirical risk minimization) and worst-case performance on a task. We develop stochastic algorithms to optimize these quantities by characterizing their subdifferential and addressing challenges such as biasedness of subgradient estimates and non-smoothness of the objective. We show theoretically and experimentally that out-of-the-box approaches such as stochastic subgradient and dual averaging are hindered by bias and that our approach outperforms them.
    Debiased Machine Learning of Set-Identified Linear Models. (arXiv:1712.10024v5 [stat.ML] UPDATED)
    This paper provides estimation and inference methods for an identified set's boundary (i.e., support function) where the selection among a very large number of covariates is based on modern regularized tools. I characterize the boundary using a semiparametric moment equation. Combining Neyman-orthogonality and sample splitting ideas, I construct a root-N consistent, uniformly asymptotically Gaussian estimator of the boundary and propose a multiplier bootstrap procedure to conduct inference. I apply this result to the partially linear model, the partially linear IV model and the average partial derivative with an interval-valued outcome.
    Concentration of Random Feature Matrices in High-Dimensions. (arXiv:2204.06935v2 [stat.ML] UPDATED)
    The spectra of random feature matrices provide essential information on the conditioning of the linear system used in random feature regression problems and are thus connected to the consistency and generalization of random feature models. Random feature matrices are asymmetric rectangular nonlinear matrices depending on two input variables, the data and the weights, which can make their characterization challenging. We consider two settings for the two input variables, either both are random variables or one is a random variable and the other is well-separated, i.e. there is a minimum distance between points. With conditions on the dimension, the complexity ratio, and the sampling variance, we show that the singular values of these matrices concentrate near their full expectation and near one with high-probability. In particular, since the dimension depends only on the logarithm of the number of random weights or the number of data points, our complexity bounds can be achieved even in moderate dimensions for many practical setting. The theoretical results are verified with numerical experiments.
    Improving Self-Supervised Learning by Characterizing Idealized Representations. (arXiv:2209.06235v2 [cs.LG] UPDATED)
    Despite the empirical successes of self-supervised learning (SSL) methods, it is unclear what characteristics of their representations lead to high downstream accuracies. In this work, we characterize properties that SSL representations should ideally satisfy. Specifically, we prove necessary and sufficient conditions such that for any task invariant to given data augmentations, desired probes (e.g., linear or MLP) trained on that representation attain perfect accuracy. These requirements lead to a unifying conceptual framework for improving existing SSL methods and deriving new ones. For contrastive learning, our framework prescribes simple but significant improvements to previous methods such as using asymmetric projection heads. For non-contrastive learning, we use our framework to derive a simple and novel objective. Our resulting SSL algorithms outperform baselines on standard benchmarks, including SwAV+multicrops on linear probing of ImageNet.
    Improving Expert Predictions with Prediction Sets. (arXiv:2201.12006v4 [cs.LG] UPDATED)
    Automated decision support systems promise to help human experts solve tasks more efficiently and accurately. However, existing systems typically require experts to understand when to cede agency to the system or when to exercise their own agency. Moreover, if the experts develop a misplaced trust in the system, their performance may worsen. In this work, we lift the above requirement and develop automated decision support systems that, by design, do not require experts to understand when each of their recommendations is accurate to improve their performance. To this end, we focus on multiclass classification tasks and consider an automated decision support system that, for each data sample, uses a classifier to recommend a subset of labels to a human expert. We first show that, by looking at the design of such a system from the perspective of conformal prediction, we can ensure that the probability that the recommended subset of labels contains the true label matches almost exactly a target probability value with high probability. Then, we develop an efficient and near-optimal search method to find the target probability value under which the expert benefits the most from using our system. Experiments on synthetic and real data demonstrate that our system can help the experts make more accurate predictions and is robust to the accuracy of the classifier it relies on.
    Corruption-tolerant Algorithms for Generalized Linear Models. (arXiv:2212.05430v1 [cs.LG])
    This paper presents SVAM (Sequential Variance-Altered MLE), a unified framework for learning generalized linear models under adversarial label corruption in training data. SVAM extends to tasks such as least squares regression, logistic regression, and gamma regression, whereas many existing works on learning with label corruptions focus only on least squares regression. SVAM is based on a novel variance reduction technique that may be of independent interest and works by iteratively solving weighted MLEs over variance-altered versions of the GLM objective. SVAM offers provable model recovery guarantees superior to the state-of-the-art for robust regression even when a constant fraction of training labels are adversarially corrupted. SVAM also empirically outperforms several existing problem-specific techniques for robust regression and classification. Code for SVAM is available at https://github.com/purushottamkar/svam/
    New Paradigms for Exploiting Parallel Experiments in Bayesian Optimization. (arXiv:2210.01071v3 [stat.ML] UPDATED)
    Bayesian optimization (BO) is one of the most effective methods for closed-loop experimental design and black-box optimization. However, a key limitation of BO is that it is an inherently sequential algorithm (one experiment is proposed per round) and thus cannot directly exploit high-throughput (parallel) experiments. Diverse modifications to the BO framework have been proposed in the literature to enable exploitation of parallel experiments but such approaches are limited in the degree of parallelization that they can achieve and can lead to redundant experiments (thus wasting resources and potentially compromising performance). In this work, we present new parallel BO paradigms that exploit the structure of the system to partition the design space. Specifically, we propose an approach that partitions the design space by following the level sets of the performance function and an approach that exploits partially-separable structures of the performance function found. We conduct extensive numerical experiments using a reactor case study to benchmark the effectiveness of these approaches against a variety of state-of-the-art parallel algorithms reported in the literature. Our computational results show that our approaches significantly reduce the required search time and increase the probability of finding a global (rather than local) solution.
    Acceptance Rates of Invertible Neural Networks on Electron Spectra from Near-Critical Laser-Plasmas: A Comparison. (arXiv:2212.05836v1 [physics.plasm-ph])
    While the interaction of ultra-intense ultra-short laser pulses with near- and overcritical plasmas cannot be directly observed, experimentally accessible quantities (observables) often only indirectly give information about the underlying plasma dynamics. Furthermore, the information provided by observables is incomplete, making the inverse problem highly ambiguous. Therefore, in order to infer plasma dynamics as well as experimental parameter, the full distribution over parameters given an observation needs to considered, requiring that models are flexible and account for the information lost in the forward process. Invertible Neural Networks (INNs) have been designed to efficiently model both the forward and inverse process, providing the full conditional posterior given a specific measurement. In this work, we benchmark INNs and standard statistical methods on synthetic electron spectra. First, we provide experimental results with respect to the acceptance rate, where our results show increases in acceptance rates up to a factor of 10. Additionally, we show that this increased acceptance rate also results in an increased speed-up for INNs to the same extent. Lastly, we propose a composite algorithm that utilizes INNs and promises low runtimes while preserving high accuracy.
    Differentiable Programming \`a la Moreau. (arXiv:2012.15458v2 [math.OC] UPDATED)
    The notion of a Moreau envelope is central to the analysis of first-order optimization algorithms for machine learning. Yet, it has not been developed and extended to be applied to a deep network and, more broadly, to a machine learning system with a differentiable programming implementation. We define a compositional calculus adapted to Moreau envelopes and show how to integrate it within differentiable programming. The proposed framework casts in a mathematical optimization framework several variants of gradient back-propagation related to the idea of the propagation of virtual targets.
    Weather2vec: Representation Learning for Causal Inference with Non-Local Confounding in Air Pollution and Climate Studies. (arXiv:2209.12316v2 [cs.LG] UPDATED)
    Estimating the causal effects of a spatially-varying intervention on a spatially-varying outcome may be subject to non-local confounding (NLC), a phenomenon that can bias estimates when the treatments and outcomes of a given unit are dictated in part by the covariates of other nearby units. In particular, NLC is a challenge for evaluating the effects of environmental policies and climate events on health-related outcomes such as air pollution exposure. This paper first formalizes NLC using the potential outcomes framework, providing a comparison with the related phenomenon of causal interference. Then, it proposes a broadly applicable framework, termed "weather2vec", that uses the theory of balancing scores to learn representations of non-local information into a scalar or vector defined for each observational unit, which is subsequently used to adjust for confounding in conjunction with causal inference methods. The framework is evaluated in a simulation study and two case studies on air pollution where the weather is an (inherently regional) known confounder.
    Causal, Bayesian, & Non-parametric Modeling of the SARS-CoV-2 Viral Load Distribution vs. Patient's Age. (arXiv:2105.13483v2 [stat.AP] UPDATED)
    The viral load of patients infected with SARS-CoV-2 varies on logarithmic scales and possibly with age. Controversial claims have been made in the literature regarding whether the viral load distribution actually depends on the age of the patients. Such a dependence would have implications for the COVID-19 spreading mechanism, the age-dependent immune system reaction, and thus for policymaking. We hereby develop a method to analyze viral-load distribution data as a function of the patients' age within a flexible, non-parametric, hierarchical, Bayesian, and causal model. The causal nature of the developed reconstruction additionally allows to test for bias in the data. This could be due to, e.g., bias in patient-testing and data collection or systematic errors in the measurement of the viral load. We perform these tests by calculating the Bayesian evidence for each implied possible causal direction. The possibility of testing for bias in data collection and identifying causal directions can be very useful in other contexts as well. For this reason we make our model freely available. When applied to publicly available age and SARS-CoV-2 viral load data, we find a statistically significant increase in the viral load with age, but only for one of the two analyzed datasets. If we consider this dataset, and based on the current understanding of viral load's impact on patients' infectivity, we expect a non-negligible difference in the infectivity of different age groups. This difference is nonetheless too small to justify considering any age group as noninfectious.
    Optimal high-dimensional and nonparametric distributed testing under communication constraints. (arXiv:2202.00968v3 [math.ST] UPDATED)
    We derive minimax testing errors in a distributed framework where the data is split over multiple machines and their communication to a central machine is limited to $b$ bits. We investigate both the $d$- and infinite-dimensional signal detection problem under Gaussian white noise. We also derive distributed testing algorithms reaching the theoretical lower bounds. Our results show that distributed testing is subject to fundamentally different phenomena that are not observed in distributed estimation. Among our findings, we show that testing protocols that have access to shared randomness can perform strictly better in some regimes than those that do not. We also observe that consistent nonparametric distributed testing is always possible, even with as little as $1$-bit of communication and the corresponding test outperforms the best local test using only the information available at a single local machine. Furthermore, we also derive adaptive nonparametric distributed testing strategies and the corresponding theoretical lower bounds.
    Random Feature Models for Learning Interacting Dynamical Systems. (arXiv:2212.05591v1 [cs.LG])
    Particle dynamics and multi-agent systems provide accurate dynamical models for studying and forecasting the behavior of complex interacting systems. They often take the form of a high-dimensional system of differential equations parameterized by an interaction kernel that models the underlying attractive or repulsive forces between agents. We consider the problem of constructing a data-based approximation of the interacting forces directly from noisy observations of the paths of the agents in time. The learned interaction kernels are then used to predict the agents behavior over a longer time interval. The approximation developed in this work uses a randomized feature algorithm and a sparse randomized feature approach. Sparsity-promoting regression provides a mechanism for pruning the randomly generated features which was observed to be beneficial when one has limited data, in particular, leading to less overfitting than other approaches. In addition, imposing sparsity reduces the kernel evaluation cost which significantly lowers the simulation cost for forecasting the multi-agent systems. Our method is applied to various examples, including first-order systems with homogeneous and heterogeneous interactions, second order homogeneous systems, and a new sheep swarming system.
    What Makes A Good Fisherman? Linear Regression under Self-Selection Bias. (arXiv:2205.03246v2 [math.ST] UPDATED)
    In the classical setting of self-selection, the goal is to learn $k$ models, simultaneously from observations $(x^{(i)}, y^{(i)})$ where $y^{(i)}$ is the output of one of $k$ underlying models on input $x^{(i)}$. In contrast to mixture models, where we observe the output of a randomly selected model, here the observed model depends on the outputs themselves, and is determined by some known selection criterion. For example, we might observe the highest output, the smallest output, or the median output of the $k$ models. In known-index self-selection, the identity of the observed model output is observable; in unknown-index self-selection, it is not. Self-selection has a long history in Econometrics and applications in various theoretical and applied fields, including treatment effect estimation, imitation learning, learning from strategically reported data, and learning from markets at disequilibrium. In this work, we present the first computationally and statistically efficient estimation algorithms for the most standard setting of this problem where the models are linear. In the known-index case, we require poly$(1/\varepsilon, k, d)$ sample and time complexity to estimate all model parameters to accuracy $\varepsilon$ in $d$ dimensions, and can accommodate quite general selection criteria. In the more challenging unknown-index case, even the identifiability of the linear models (from infinitely many samples) was not known. We show three results in this case for the commonly studied $\max$ self-selection criterion: (1) we show that the linear models are indeed identifiable, (2) for general $k$ we provide an algorithm with poly$(d) \exp(\text{poly}(k))$ sample and time complexity to estimate the regression parameters up to error $1/\text{poly}(k)$, and (3) for $k = 2$ we provide an algorithm for any error $\varepsilon$ and poly$(d, 1/\varepsilon)$ sample and time complexity.
    Multi-Dimensional Self Attention based Approach for Remaining Useful Life Estimation. (arXiv:2212.05772v1 [cs.LG])
    Remaining Useful Life (RUL) estimation plays a critical role in Prognostics and Health Management (PHM). Traditional machine health maintenance systems are often costly, requiring sufficient prior expertise, and are difficult to fit into highly complex and changing industrial scenarios. With the widespread deployment of sensors on industrial equipment, building the Industrial Internet of Things (IIoT) to interconnect these devices has become an inexorable trend in the development of the digital factory. Using the device's real-time operational data collected by IIoT to get the estimated RUL through the RUL prediction algorithm, the PHM system can develop proactive maintenance measures for the device, thus, reducing maintenance costs and decreasing failure times during operation. This paper carries out research into the remaining useful life prediction model for multi-sensor devices in the IIoT scenario. We investigated the mainstream RUL prediction models and summarized the basic steps of RUL prediction modeling in this scenario. On this basis, a data-driven approach for RUL estimation is proposed in this paper. It employs a Multi-Head Attention Mechanism to fuse the multi-dimensional time-series data output from multiple sensors, in which the attention on features is used to capture the interactions between features and attention on sequences is used to learn the weights of time steps. Then, the Long Short-Term Memory Network is applied to learn the features of time series. We evaluate the proposed model on two benchmark datasets (C-MAPSS and PHM08), and the results demonstrate that it outperforms the state-of-art models. Moreover, through the interpretability of the multi-head attention mechanism, the proposed model can provide a preliminary explanation of engine degradation. Therefore, this approach is promising for predictive maintenance in IIoT scenarios.
    State-Augmented Learnable Algorithms for Resource Management in Wireless Networks. (arXiv:2207.02242v2 [cs.LG] UPDATED)
    We consider resource management problems in multi-user wireless networks, which can be cast as optimizing a network-wide utility function, subject to constraints on the long-term average performance of users across the network. We propose a state-augmented algorithm for solving the aforementioned radio resource management (RRM) problems, where, alongside the instantaneous network state, the RRM policy takes as input the set of dual variables corresponding to the constraints, which evolve depending on how much the constraints are violated during execution. We theoretically show that the proposed state-augmented algorithm leads to feasible and near-optimal RRM decisions. Moreover, focusing on the problem of wireless power control using graph neural network (GNN) parameterizations, we demonstrate the superiority of the proposed RRM algorithm over baseline methods across a suite of numerical experiments.
    Estimators of Entropy and Information via Inference in Probabilistic Models. (arXiv:2202.12363v4 [stat.ML] UPDATED)
    Estimating information-theoretic quantities such as entropy and mutual information is central to many problems in statistics and machine learning, but challenging in high dimensions. This paper presents estimators of entropy via inference (EEVI), which deliver upper and lower bounds on many information quantities for arbitrary variables in a probabilistic generative model. These estimators use importance sampling with proposal distribution families that include amortized variational inference and sequential Monte Carlo, which can be tailored to the target model and used to squeeze true information values with high accuracy. We present several theoretical properties of EEVI and demonstrate scalability and efficacy on two problems from the medical domain: (i) in an expert system for diagnosing liver disorders, we rank medical tests according to how informative they are about latent diseases, given a pattern of observed symptoms and patient attributes; and (ii) in a differential equation model of carbohydrate metabolism, we find optimal times to take blood glucose measurements that maximize information about a diabetic patient's insulin sensitivity, given their meal and medication schedule.
    Double Robustness for Complier Parameters and a Semiparametric Test for Complier Characteristics. (arXiv:1909.05244v7 [stat.ML] UPDATED)
    We propose a semiparametric test to evaluate (i) whether different instruments induce subpopulations of compliers with the same observable characteristics on average, and (ii) whether compliers have observable characteristics that are the same as the full population on average. The test is a flexible robustness check for the external validity of instruments. We use it to reinterpret the difference in LATE estimates that Angrist and Evans (1998) obtain when using different instrumental variables. To justify the test, we characterize the doubly robust moment for Abadie (2003)'s class of complier parameters, and we analyze a machine learning update to $\kappa$ weighting.
    Distributional regression and its evaluation with the CRPS: Bounds and convergence of the minimax risk. (arXiv:2205.04360v4 [math.ST] UPDATED)
    The theoretical advances on the properties of scoring rules over the past decades have broadened the use of scoring rules in probabilistic forecasting. In meteorological forecasting, statistical postprocessing techniques are essential to improve the forecasts made by deterministic physical models. Numerous state-of-the-art statistical postprocessing techniques are based on distributional regression evaluated with the Continuous Ranked Probability Score (CRPS). However, theoretical properties of such evaluation with the CRPS have solely considered the unconditional framework (i.e. without covariates) and infinite sample sizes. We extend these results and study the rate of convergence in terms of CRPS of distributional regression methods. We find the optimal minimax rate of convergence for a given class of distributions and show that the k-nearest neighbor method and the kernel method reach this optimal minimax rate.
    Generalization in Deep Learning. (arXiv:1710.05468v7 [stat.ML] UPDATED)
    This paper provides theoretical insights into why and how deep learning can generalize well, despite its large capacity, complexity, possible algorithmic instability, nonrobustness, and sharp minima, responding to an open question in the literature. We also discuss approaches to provide non-vacuous generalization guarantees for deep learning. Based on theoretical observations, we propose new open problems and discuss the limitations of our results.
    Statistical guarantees for sparse deep learning. (arXiv:2212.05427v1 [cs.LG])
    Neural networks are becoming increasingly popular in applications, but our mathematical understanding of their potential and limitations is still limited. In this paper, we further this understanding by developing statistical guarantees for sparse deep learning. In contrast to previous work, we consider different types of sparsity, such as few active connections, few active nodes, and other norm-based types of sparsity. Moreover, our theories cover important aspects that previous theories have neglected, such as multiple outputs, regularization, and l2-loss. The guarantees have a mild dependence on network widths and depths, which means that they support the application of sparse but wide and deep networks from a statistical perspective. Some of the concepts and tools that we use in our derivations are uncommon in deep learning and, hence, might be of additional interest.
  • Open

    Text Mining-Based Patent Analysis for Automated Rule Checking in AEC. (arXiv:2212.05891v1 [cs.IR])
    Automated rule checking (ARC), which is expected to promote the efficiency of the compliance checking process in the architecture, engineering, and construction (AEC) industry, is gaining increasing attention. Throwing light on the ARC application hotspots and forecasting its trends are useful to the related research and drive innovations. Therefore, this study takes the patents from the database of the Derwent Innovations Index database (DII) and China national knowledge infrastructure (CNKI) as data sources and then carried out a three-step analysis including (1) quantitative characteristics (i.e., annual distribution analysis) of patents, (2) identification of ARC topics using a latent Dirichlet allocation (LDA) and, (3) SNA-based co-occurrence analysis of ARC topics. The results show that the research hotspots and trends of Chinese and English patents are different. The contributions of this study have three aspects: (1) an approach to a comprehensive analysis of patents by integrating multiple text mining methods (i.e., SNA and LDA) is introduced ; (2) the application hotspots and development trends of ARC are reviewed based on patent analysis; and (3) a signpost for technological development and innovation of ARC is provided.
    Perspectives of Non-Expert Users on Cyber Security and Privacy: An Analysis of Online Discussions on Twitter. (arXiv:2206.02156v2 [cs.CR] UPDATED)
    Current research on users` perspectives of cyber security and privacy related to traditional and smart devices at home is very active, but the focus is often more on specific modern devices such as mobile and smart IoT devices in a home context. In addition, most were based on smaller-scale empirical studies such as online surveys and interviews. We endeavour to fill these research gaps by conducting a larger-scale study based on a real-world dataset of 413,985 tweets posted by non-expert users on Twitter in six months of three consecutive years (January and February in 2019, 2020 and 2021). Two machine learning-based classifiers were developed to identify the 413,985 tweets. We analysed this dataset to understand non-expert users` cyber security and privacy perspectives, including the yearly trend and the impact of the COVID-19 pandemic. We applied topic modelling, sentiment analysis and qualitative analysis of selected tweets in the dataset, leading to various interesting findings. For instance, we observed a 54% increase in non-expert users` tweets on cyber security and/or privacy related topics in 2021, compared to before the start of global COVID-19 lockdowns (January 2019 to February 2020). We also observed an increased level of help-seeking tweets during the COVID-19 pandemic. Our analysis revealed a diverse range of topics discussed by non-expert users across the three years, including VPNs, Wi-Fi, smartphones, laptops, smart home devices, financial security, and security and privacy issues involving different stakeholders. Overall negative sentiment was observed across almost all topics non-expert users discussed on Twitter in all the three years. Our results confirm the multi-faceted nature of non-expert users` perspectives on cyber security and privacy and call for more holistic, comprehensive and nuanced research on different facets of such perspectives.
    Physics-Informed Model-Based Reinforcement Learning. (arXiv:2212.02179v2 [cs.LG] UPDATED)
    We apply reinforcement learning (RL) to robotics. One of the drawbacks of traditional RL algorithms has been their poor sample efficiency. One approach to improve the sample efficiency is model-based RL. In our model-based RL algorithm, we learn a model of the environment, use it to generate imaginary trajectories and backpropagate through them to update the policy, exploiting the differentiability of the model. Intuitively, learning more accurate models should lead to better performance. Recently, there has been growing interest in developing better deep neural network based dynamics models for physical systems, through better inductive biases. We focus on robotic systems undergoing rigid body motion. We compare two versions of our model-based RL algorithm, one which uses a standard deep neural network based dynamics model and the other which uses a much more accurate, physics-informed neural network based dynamics model. We show that, in model-based RL, model accuracy mainly matters in environments that are sensitive to initial conditions. In these environments, the physics-informed version of our algorithm achieves significantly better average-return and sample efficiency. In environments that are not sensitive to initial conditions, both versions of our algorithm achieve similar average-return, while the physics-informed version achieves better sample efficiency. We measure the sensitivity to initial conditions using the finite-time maximal Lyapunov exponent. We also show that, in challenging environments, where we need a lot of samples to learn, physics-informed model-based RL can achieve better average-return than state-of-the-art model-free RL algorithms such as Soft Actor-Critic, by generating accurate imaginary data.
    Fairness Reprogramming. (arXiv:2209.10222v4 [cs.LG] UPDATED)
    Despite a surge of recent advances in promoting machine Learning (ML) fairness, the existing mainstream approaches mostly require retraining or finetuning the entire weights of the neural network to meet the fairness criteria. However, this is often infeasible in practice for those large-scale trained models due to large computational and storage costs, low data efficiency, and model privacy issues. In this paper, we propose a new generic fairness learning paradigm, called FairReprogram, which incorporates the model reprogramming technique. Specifically, FairReprogram considers the case where models can not be changed and appends to the input a set of perturbations, called the fairness trigger, which is tuned towards the fairness criteria under a min-max formulation. We further introduce an information-theoretic framework that explains why and under what conditions fairness goals can be achieved using the fairness trigger. We show both theoretically and empirically that the fairness trigger can effectively obscure demographic biases in the output prediction of fixed ML models by providing false demographic information that hinders the model from utilizing the correct demographic information to make the prediction. Extensive experiments on both NLP and CV datasets demonstrate that our method can achieve better fairness improvements than retraining-based methods with far less data dependency under two widely-used fairness criteria. Codes are available at https://github.com/UCSB-NLP-Chang/Fairness-Reprogramming.git.
    Extra-Newton: A First Approach to Noise-Adaptive Accelerated Second-Order Methods. (arXiv:2211.01832v2 [math.OC] UPDATED)
    This work proposes a universal and adaptive second-order method for minimizing second-order smooth, convex functions. Our algorithm achieves $O(\sigma / \sqrt{T})$ convergence when the oracle feedback is stochastic with variance $\sigma^2$, and improves its convergence to $O( 1 / T^3)$ with deterministic oracles, where $T$ is the number of iterations. Our method also interpolates these rates without knowing the nature of the oracle apriori, which is enabled by a parameter-free adaptive step-size that is oblivious to the knowledge of smoothness modulus, variance bounds and the diameter of the constrained set. To our knowledge, this is the first universal algorithm with such global guarantees within the second-order optimization literature.
    Systematic Generalization and Emergent Structures in Transformers Trained on Structured Tasks. (arXiv:2210.00400v2 [cs.LG] UPDATED)
    Transformer networks have seen great success in natural language processing and machine vision, where task objectives such as next word prediction and image classification benefit from nuanced context sensitivity across high-dimensional inputs. However, there is an ongoing debate about how and when transformers can acquire highly structured behavior and achieve systematic generalization. Here, we explore how well a causal transformer can perform a set of algorithmic tasks, including copying, sorting, and hierarchical compositions of these operations. We demonstrate strong generalization to sequences longer than those used in training by replacing the standard positional encoding typically used in transformers with labels arbitrarily paired with items in the sequence. We search for the layer and head configuration sufficient to solve these tasks, then probe for signs of systematic processing in latent representations and attention patterns. We show that two-layer transformers learn reliable solutions to multi-level problems, develop signs of task decomposition, and encode input items in a way that encourages the exploitation of shared computation across related tasks. These results provide key insights into how attention layers support structured computation both within a task and across multiple tasks.
    Isotropic Gaussian Processes on Finite Spaces of Graphs. (arXiv:2211.01689v2 [stat.ML] UPDATED)
    We propose a principled way to define Gaussian process priors on various sets of unweighted graphs: directed or undirected, with or without loops. We endow each of these sets with a geometric structure, inducing the notions of closeness and symmetries, by turning them into a vertex set of an appropriate metagraph. Building on this, we describe the class of priors that respect this structure and are analogous to the Euclidean isotropic processes, like squared exponential or Mat\'ern. We propose an efficient computational technique for the ostensibly intractable problem of evaluating these priors' kernels, making such Gaussian processes usable within the usual toolboxes and downstream applications. We go further to consider sets of equivalence classes of unweighted graphs and define the appropriate versions of priors thereon. We prove a hardness result, showing that in this case, exact kernel computation cannot be performed efficiently. However, we propose a simple Monte Carlo approximation for handling moderately sized cases. Inspired by applications in chemistry, we illustrate the proposed techniques on a real molecular property prediction task in the small data regime.
    Pishgu: Universal Path Prediction Network Architecture for Real-time Cyber-physical Edge Systems. (arXiv:2210.08057v2 [cs.CV] UPDATED)
    Path prediction is an essential task for many real-world Cyber-Physical Systems (CPS) applications, from autonomous driving and traffic monitoring/management to pedestrian/worker safety. These real-world CPS applications need a robust, lightweight path prediction that can provide a universal network architecture for multiple subjects (e.g., pedestrians and vehicles) from different perspectives. However, most existing algorithms are tailor-made for a unique subject with a specific camera perspective and scenario. This article presents Pishgu, a universal lightweight network architecture, as a robust and holistic solution for path prediction. Pishgu's architecture can adapt to multiple path prediction domains with different subjects (vehicles, pedestrians), perspectives (bird's-eye, high-angle), and scenes (sidewalk, highway). Our proposed architecture captures the inter-dependencies within the subjects in each frame by taking advantage of Graph Isomorphism Networks and the attention module. We separately train and evaluate the efficacy of our architecture on three different CPS domains across multiple perspectives (vehicle bird's-eye view, pedestrian bird's-eye view, and human high-angle view). Pishgu outperforms state-of-the-art solutions in the vehicle bird's-eye view domain by 42% and 61% and pedestrian high-angle view domain by 23% and 22% in terms of ADE and FDE, respectively. Additionally, we analyze the domain-specific details for various datasets to understand their effect on path prediction and model interpretation. Finally, we report the latency and throughput for all three domains on multiple embedded platforms showcasing the robustness and adaptability of Pishgu for real-world integration into CPS applications.
    Parameter-Efficient Finetuning of Transformers for Source Code. (arXiv:2212.05901v1 [cs.CL])
    Pretrained Transformers achieve state-of-the-art performance in various code-processing tasks but may be too large to be deployed. As software development tools often incorporate modules for various purposes which may potentially use a single instance of the pretrained model, it appears relevant to utilize parameter-efficient fine-tuning for the pretrained models of code. In this work, we test two widely used approaches, adapters and LoRA, which were initially tested on NLP tasks, on four code-processing tasks. We find that though the efficient fine-tuning approaches may achieve comparable or higher performance than the standard, full, fine-tuning in code understanding tasks, they underperform full fine-tuning in code-generative tasks. These results underline the importance of testing efficient fine-tuning approaches on other domains than NLP and motivate future research in efficient fine-tuning for source code.
    Lower Bounds for the Total Variation Distance Between Arbitrary Distributions with Given Means and Variances. (arXiv:2212.05820v1 [math.PR])
    For arbitrary two probability measures on real d-space with given means and variances (covariance matrices), we provide lower bounds for their total variation distance.
    On Generalization and Regularization via Wasserstein Distributionally Robust Optimization. (arXiv:2212.05716v1 [cs.LG])
    Wasserstein distributionally robust optimization (DRO) has found success in operations research and machine learning applications as a powerful means to obtain solutions with favourable out-of-sample performances. Two compelling explanations for the success are the generalization bounds derived from Wasserstein DRO and the equivalency between Wasserstein DRO and the regularization scheme commonly applied in machine learning. Existing results on generalization bounds and the equivalency to regularization are largely limited to the setting where the Wasserstein ball is of a certain type and the decision criterion takes certain forms of an expected function. In this paper, we show that by focusing on Wasserstein DRO problems with affine decision rules, it is possible to obtain generalization bounds and the equivalency to regularization in a significantly broader setting where the Wasserstein ball can be of a general type and the decision criterion can be a general measure of risk, i.e., nonlinear in distributions. This allows for accommodating many important classification, regression, and risk minimization applications that have not been addressed to date using Wasserstein DRO. Our results are strong in that the generalization bounds do not suffer from the curse of dimensionality and the equivalency to regularization is exact. As a byproduct, our regularization results broaden considerably the class of Wasserstein DRO models that can be solved efficiently via regularization formulations.
    Graph Neural Networks Designed for Different Graph Types: A Survey. (arXiv:2204.03080v3 [cs.LG] UPDATED)
    Graphs are ubiquitous in nature and can therefore serve as models for many practical but also theoretical problems. For this purpose, they can be defined as many different types which suitably reflect the individual contexts of the represented problem. To address cutting-edge problems based on graph data, the research field of Graph Neural Networks (GNNs) has emerged. Despite the field's youth and the speed at which new models are developed, many recent surveys have been published to keep track of them. Nevertheless, it has not yet been gathered which GNN can process what kind of graph types. In this survey, we give a detailed overview of already existing GNNs and, unlike previous surveys, categorize them according to their ability to handle different graph types and properties. We consider GNNs operating on static and dynamic graphs of different structural constitutions, with or without node or edge attributes. Moreover, we distinguish between GNN models for discrete-time or continuous-time dynamic graphs and group the models according to their architecture. We find that there are still graph types that are not or only rarely covered by existing GNN models. We point out where models are missing and give potential reasons for their absence.
    Image-based Artificial Intelligence empowered surrogate model and shape morpher for real-time blank shape optimisation in the hot stamping process. (arXiv:2212.05885v1 [cs.CV])
    As the complexity of modern manufacturing technologies increases, traditional trial-and-error design, which requires iterative and expensive simulations, becomes unreliable and time-consuming. This difficulty is especially significant for the design of hot-stamped safety-critical components, such as ultra-high-strength-steel (UHSS) B-pillars. To reduce design costs and ensure manufacturability, scalar-based Artificial-Intelligence-empowered surrogate modelling (SAISM) has been investigated and implemented, which can allow real-time manufacturability-constrained structural design optimisation. However, SAISM suffers from low accuracy and generalisability, and usually requires a high volume of training samples. To solve this problem, an image-based Artificial-intelligence-empowered surrogate modelling (IAISM) approach is developed in this research, in combination with an auto-decoder-based blank shape generator. The IAISM, which is based on a Mask-Res-SE-U-Net architecture, is trained to predict the full thinning field of the as-formed component given an arbitrary blank shape. Excellent prediction performance of IAISM is achieved with only 256 training samples, which indicates the small-data learning nature of engineering AI tasks using structured data representations. The trained auto-decoder, trained Mask-Res-SE-U-Net, and Adam optimiser are integrated to conduct blank optimisation by modifying the latent vector. The optimiser can rapidly find blank shapes that satisfy manufacturability criteria. As a high-accuracy and generalisable surrogate modelling and optimisation tool, the proposed pipeline is promising to be integrated into a full-chain digital twin to conduct real-time, multi-objective design optimisation.
    Classical Simulation of Variational Quantum Classifiers using Tensor Rings. (arXiv:2201.08878v2 [quant-ph] UPDATED)
    In recent times, Variational Quantum Circuits (VQC) have been widely adopted to different tasks in machine learning such as Combinatorial Optimization and Supervised Learning. With the growing interest, it is pertinent to study the boundaries of the classical simulation of VQCs to effectively benchmark the algorithms. Classically simulating VQCs can also provide the quantum algorithms with a better initialization reducing the amount of quantum resources needed to train the algorithm. This manuscript proposes an algorithm that compresses the quantum state within a circuit using a tensor ring representation which allows for the implementation of VQC based algorithms on a classical simulator at a fraction of the usual storage and computational complexity. Using the tensor ring approximation of the input quantum state, we propose a method that applies the parametrized unitary operations while retaining the low-rank structure of the tensor ring corresponding to the transformed quantum state, providing an exponential improvement of storage and computational time in the number of qubits and layers. This approximation is used to implement the tensor ring VQC for the task of supervised learning on Iris and MNIST datasets to demonstrate the comparable performance as that of the implementations from classical simulator using Matrix Product States.
    Weather2vec: Representation Learning for Causal Inference with Non-Local Confounding in Air Pollution and Climate Studies. (arXiv:2209.12316v2 [cs.LG] UPDATED)
    Estimating the causal effects of a spatially-varying intervention on a spatially-varying outcome may be subject to non-local confounding (NLC), a phenomenon that can bias estimates when the treatments and outcomes of a given unit are dictated in part by the covariates of other nearby units. In particular, NLC is a challenge for evaluating the effects of environmental policies and climate events on health-related outcomes such as air pollution exposure. This paper first formalizes NLC using the potential outcomes framework, providing a comparison with the related phenomenon of causal interference. Then, it proposes a broadly applicable framework, termed "weather2vec", that uses the theory of balancing scores to learn representations of non-local information into a scalar or vector defined for each observational unit, which is subsequently used to adjust for confounding in conjunction with causal inference methods. The framework is evaluated in a simulation study and two case studies on air pollution where the weather is an (inherently regional) known confounder.
    Margin Optimal Classification Trees. (arXiv:2210.10567v2 [math.OC] UPDATED)
    In recent years there has been growing attention to interpretable machine learning models which can give explanatory insights on their behavior. Thanks to their interpretability, decision trees have been intensively studied for classification tasks, and due to the remarkable advances in mixed-integer programming (MIP), various approaches have been proposed to formulate the problem of training an Optimal Classification Tree (OCT) as a MIP model. We present a novel mixed-integer quadratic formulation for the OCT problem, which exploits the generalization capabilities of Support Vector Machines for binary classification. Our model, denoted as Margin Optimal Classification Tree (MARGOT), encompasses the use of maximum margin multivariate hyperplanes nested in a binary tree structure. To enhance the interpretability of our approach, we analyse two alternative versions of MARGOT, which include feature selection constraints inducing local sparsity of the hyperplanes. First, MARGOT has been tested on non-linearly separable synthetic datasets in 2-dimensional feature space to provide a graphical representation of the maximum margin approach. Finally, the proposed models have been tested on benchmark datasets from the UCI repository. The MARGOT formulation turns out to be easier to solve than other OCT approaches, and the generated tree better generalizes on new observations. The two interpretable versions are effective in selecting the most relevant features and maintaining good prediction quality.
    A machine learning approach to support decision in insider trading detection. (arXiv:2212.05912v1 [q-fin.ST])
    Identifying market abuse activity from data on investors' trading activity is very challenging both for the data volume and for the low signal to noise ratio. Here we propose two complementary unsupervised machine learning methods to support market surveillance aimed at identifying potential insider trading activities. The first one uses clustering to identify, in the vicinity of a price sensitive event such as a takeover bid, discontinuities in the trading activity of an investor with respect to his/her own past trading history and on the present trading activity of his/her peers. The second unsupervised approach aims at identifying (small) groups of investors that act coherently around price sensitive events, pointing to potential insider rings, i.e. a group of synchronised traders displaying strong directional trading in rewarding position in a period before the price sensitive event. As a case study, we apply our methods to investor resolved data of Italian stocks around takeover bids.
    Stabilizing Machine Learning Prediction of Dynamics: Noise and Noise-inspired Regularization. (arXiv:2211.05262v2 [cs.LG] UPDATED)
    Recent work has shown that machine learning (ML) models can be trained to accurately forecast the dynamics of unknown chaotic dynamical systems. Short-term predictions of the state evolution and long-term predictions of the statistical patterns of the dynamics (``climate'') can be produced by employing a feedback loop, whereby the model is trained to predict forward one time step, then the model output is used as input for multiple time steps. In the absence of mitigating techniques, however, this technique can result in artificially rapid error growth. In this article, we systematically examine the technique of adding noise to the ML model input during training to promote stability and improve prediction accuracy. Furthermore, we introduce Linearized Multi-Noise Training (LMNT), a regularization technique that deterministically approximates the effect of many small, independent noise realizations added to the model input during training. Our case study uses reservoir computing, a machine-learning method using recurrent neural networks, to predict the spatiotemporal chaotic Kuramoto-Sivashinsky equation. We find that reservoir computers trained with noise or with LMNT produce climate predictions that appear to be indefinitely stable and have a climate very similar to the true system, while reservoir computers trained without regularization are unstable. Compared with other regularization techniques that yield stability in some cases, we find that both short-term and climate predictions from reservoir computers trained with noise or with LMNT are substantially more accurate. Finally, we show that the deterministic aspect of our LMNT regularization facilitates fast hyperparameter tuning when compared to training with noise.
    Bivariate Causal Discovery for Categorical Data via Classification with Optimal Label Permutation. (arXiv:2209.08579v2 [stat.ML] UPDATED)
    Causal discovery for quantitative data has been extensively studied but less is known for categorical data. We propose a novel causal model for categorical data based on a new classification model, termed classification with optimal label permutation (COLP). By design, COLP is a parsimonious classifier, which gives rise to a provably identifiable causal model. A simple learning algorithm via comparing likelihood functions of causal and anti-causal models suffices to learn the causal direction. Through experiments with synthetic and real data, we demonstrate the favorable performance of the proposed COLP-based causal model compared to state-of-the-art methods. We also make available an accompanying R package COLP, which contains the proposed causal discovery algorithm and a benchmark dataset of categorical cause-effect pairs.
    CAT-probing: A Metric-based Approach to Interpret How Pre-trained Models for Programming Language Attend Code Structure. (arXiv:2210.04633v4 [cs.SE] UPDATED)
    Code pre-trained models (CodePTMs) have recently demonstrated significant success in code intelligence. To interpret these models, some probing methods have been applied. However, these methods fail to consider the inherent characteristics of codes. In this paper, to address the problem, we propose a novel probing method CAT-probing to quantitatively interpret how CodePTMs attend code structure. We first denoise the input code sequences based on the token types pre-defined by the compilers to filter those tokens whose attention scores are too small. After that, we define a new metric CAT-score to measure the commonality between the token-level attention scores generated in CodePTMs and the pair-wise distances between corresponding AST nodes. The higher the CAT-score, the stronger the ability of CodePTMs to capture code structure. We conduct extensive experiments to integrate CAT-probing with representative CodePTMs for different programming languages. Experimental results show the effectiveness of CAT-probing in CodePTM interpretation. Our codes and data are publicly available at https://github.com/nchen909/CodeAttention.
    ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design. (arXiv:2210.09573v2 [cs.LG] UPDATED)
    Vision Transformers (ViTs) have achieved state-of-the-art performance on various vision tasks. However, ViTs' self-attention module is still arguably a major bottleneck, limiting their achievable hardware efficiency. Meanwhile, existing accelerators dedicated to NLP Transformers are not optimal for ViTs. This is because there is a large difference between ViTs and NLP Transformers: ViTs have a relatively fixed number of input tokens, whose attention maps can be pruned by up to 90% even with fixed sparse patterns; while NLP Transformers need to handle input sequences of varying numbers of tokens and rely on on-the-fly predictions of dynamic sparse attention patterns for each input to achieve a decent sparsity (e.g., >=50%). To this end, we propose a dedicated algorithm and accelerator co-design framework dubbed ViTCoD for accelerating ViTs. Specifically, on the algorithm level, ViTCoD prunes and polarizes the attention maps to have either denser or sparser fixed patterns for regularizing two levels of workloads without hurting the accuracy, largely reducing the attention computations while leaving room for alleviating the remaining dominant data movements; on top of that, we further integrate a lightweight and learnable auto-encoder module to enable trading the dominant high-cost data movements for lower-cost computations. On the hardware level, we develop a dedicated accelerator to simultaneously coordinate the enforced denser/sparser workloads and encoder/decoder engines for boosted hardware utilization. Extensive experiments and ablation studies validate that ViTCoD largely reduces the dominant data movement costs, achieving speedups of up to 235.3x, 142.9x, 86.0x, 10.1x, and 6.8x over general computing platforms CPUs, EdgeGPUs, GPUs, and prior-art Transformer accelerators SpAtten and Sanger under an attention sparsity of 90%, respectively.
    Attri-VAE: attribute-based interpretable representations of medical images with variational autoencoders. (arXiv:2203.10417v3 [eess.IV] UPDATED)
    Deep learning (DL) methods where interpretability is intrinsically considered as part of the model are required to better understand the relationship of clinical and imaging-based attributes with DL outcomes, thus facilitating their use in the reasoning behind medical decisions. Latent space representations built with variational autoencoders (VAE) do not ensure individual control of data attributes. Attribute-based methods enforcing attribute disentanglement have been proposed in the literature for classical computer vision tasks in benchmark data. In this paper, we propose a VAE approach, the Attri-VAE, that includes an attribute regularization term to associate clinical and medical imaging attributes with different regularized dimensions in the generated latent space, enabling a better-disentangled interpretation of the attributes. Furthermore, the generated attention maps explained the attribute encoding in the regularized latent space dimensions. Using the Attri-VAE approach we analyzed healthy and myocardial infarction patients with clinical, cardiac morphology, and radiomics attributes. The proposed model provided an excellent trade-off between reconstruction fidelity, disentanglement, and interpretability, outperforming state-of-the-art VAE approaches according to several quantitative metrics. The resulting latent space allowed the generation of realistic synthetic data in the trajectory between two distinct input samples or along a specific attribute dimension to better interpret changes between different cardiac conditions.
    Differentiable Programming \`a la Moreau. (arXiv:2012.15458v2 [math.OC] UPDATED)
    The notion of a Moreau envelope is central to the analysis of first-order optimization algorithms for machine learning. Yet, it has not been developed and extended to be applied to a deep network and, more broadly, to a machine learning system with a differentiable programming implementation. We define a compositional calculus adapted to Moreau envelopes and show how to integrate it within differentiable programming. The proposed framework casts in a mathematical optimization framework several variants of gradient back-propagation related to the idea of the propagation of virtual targets.
    A law of adversarial risk, interpolation, and label noise. (arXiv:2207.03933v2 [stat.ML] UPDATED)
    In supervised learning, it has been shown that label noise in the data can be interpolated without penalties on test accuracy. We show that interpolating label noise induces adversarial vulnerability, and prove the first theorem showing the relationship between label noise and adversarial risk for any data distribution. Our results are almost tight if we do not make any assumptions on the inductive bias of the learning algorithm. We then investigate how different components of this problem affect this result including properties of the distribution. We also discuss non-uniform label noise distributions; and prove a new theorem showing uniform label noise induces nearly as large an adversarial risk as the worst poisoning with the same noise rate. Then, we provide theoretical and empirical evidence that uniform label noise is more harmful than typical real-world label noise. Finally, we show how inductive biases amplify the effect of label noise and argue the need for future work in this direction.
    OmniXAI: A Library for Explainable AI. (arXiv:2206.01612v8 [cs.LG] UPDATED)
    We introduce OmniXAI (short for Omni eXplainable AI), an open-source Python library of eXplainable AI (XAI), which offers omni-way explainable AI capabilities and various interpretable machine learning techniques to address the pain points of understanding and interpreting the decisions made by machine learning (ML) in practice. OmniXAI aims to be a one-stop comprehensive library that makes explainable AI easy for data scientists, ML researchers and practitioners who need explanation for various types of data, models and explanation methods at different stages of ML process (data exploration, feature engineering, model development, evaluation, and decision-making, etc). In particular, our library includes a rich family of explanation methods integrated in a unified interface, which supports multiple data types (tabular data, images, texts, time-series), multiple types of ML models (traditional ML in Scikit-learn and deep learning models in PyTorch/TensorFlow), and a range of diverse explanation methods including "model-specific" and "model-agnostic" ones (such as feature-attribution explanation, counterfactual explanation, gradient-based explanation, etc). For practitioners, the library provides an easy-to-use unified interface to generate the explanations for their applications by only writing a few lines of codes, and also a GUI dashboard for visualization of different explanations for more insights about decisions. In this technical report, we present OmniXAI's design principles, system architectures, and major functionalities, and also demonstrate several example use cases across different types of data, tasks, and models.
    A Robust and Low Complexity Deep Learning Model for Remote Sensing Image Classification. (arXiv:2211.02820v2 [cs.CV] UPDATED)
    In this paper, we present a robust and low complexity deep learning model for Remote Sensing Image Classification (RSIC), the task of identifying the scene of a remote sensing image. In particular, we firstly evaluate different low complexity and benchmark deep neural networks: MobileNetV1, MobileNetV2, NASNetMobile, and EfficientNetB0, which present the number of trainable parameters lower than 5 Million (M). After indicating best network architecture, we further improve the network performance by applying attention schemes to multiple feature maps extracted from middle layers of the network. To deal with the issue of increasing the model footprint as using attention schemes, we apply the quantization technique to satisfy the maximum of 20 MB memory occupation. By conducting extensive experiments on the benchmark datasets NWPU-RESISC45, we achieve a robust and low-complexity model, which is very competitive to the state-of-the-art systems and potential for real-life applications on edge devices.
    DeepCut: Unsupervised Segmentation using Graph Neural Networks Clustering. (arXiv:2212.05853v1 [cs.CV])
    Image segmentation is a fundamental task in computer vision. Data annotation for training supervised methods can be labor-intensive, motivating unsupervised methods. Some existing approaches extract deep features from pre-trained networks and build a graph to apply classical clustering methods (e.g., $k$-means and normalized-cuts) as a post-processing stage. These techniques reduce the high-dimensional information encoded in the features to pair-wise scalar affinities. In this work, we replace classical clustering algorithms with a lightweight Graph Neural Network (GNN) trained to achieve the same clustering objective function. However, in contrast to existing approaches, we feed the GNN not only the pair-wise affinities between local image features but also the raw features themselves. Maintaining this connection between the raw feature and the clustering goal allows to perform part semantic segmentation implicitly, without requiring additional post-processing steps. We demonstrate how classical clustering objectives can be formulated as self-supervised loss functions for training our image segmentation GNN. Additionally, we use the Correlation-Clustering (CC) objective to perform clustering without defining the number of clusters ($k$-less clustering). We apply the proposed method for object localization, segmentation, and semantic part segmentation tasks, surpassing state-of-the-art performance on multiple benchmarks.
    Interactive introduction to self-calibrating interfaces. (arXiv:2212.05766v1 [cs.HC])
    This interactive paper aims to provide an intuitive understanding of the self-calibrating interface paradigm. Under this paradigm, you can choose how to use an interface which can adapt to your preferences on the fly. We introduce a PIN entering task and gradually release constraints, moving from a pre-calibrated interface to a self-calibrating interface while increasing the complexity of input modalities from buttons, to points on a map, to sketches, and finally to spoken words. This is not a traditional research paper with a hypothesis and experimental results to support claims; the research supporting this work has already been done and we refer to it extensively in the later sections. Instead, our aim is to walk you through an intriguing interaction paradigm in small logical steps with supporting illustrations, interactive demonstrations, and videos to reinforce your learning. We designed this paper for the enjoyments of curious minds of any backgrounds, it is written in plain English and no prior knowledge is necessary. All demos are available online at openvault.jgrizou.com and linked individually in the paper.
    Towards Antisymmetric Neural Ansatz Separation. (arXiv:2208.03264v2 [cs.LG] UPDATED)
    We study separations between two fundamental models (or \emph{Ans\"atze}) of antisymmetric functions, that is, functions $f$ of the form $f(x_{\sigma(1)}, \ldots, x_{\sigma(N)}) = \text{sign}(\sigma)f(x_1, \ldots, x_N)$, where $\sigma$ is any permutation. These arise in the context of quantum chemistry, and are the basic modeling tool for wavefunctions of Fermionic systems. Specifically, we consider two popular antisymmetric Ans\"atze: the Slater representation, which leverages the alternating structure of determinants, and the Jastrow ansatz, which augments Slater determinants with a product by an arbitrary symmetric function. We construct an antisymmetric function that can be more efficiently expressed in Jastrow form, yet provably cannot be approximated by Slater determinants unless there are exponentially (in $N^2$) many terms. This represents the first explicit quantitative separation between these two Ans\"atze.
    Interpretable Boosted Decision Tree Analysis for the Majorana Demonstrator. (arXiv:2207.10710v3 [physics.data-an] UPDATED)
    The Majorana Demonstrator is a leading experiment searching for neutrinoless double-beta decay with high purity germanium detectors (HPGe). Machine learning provides a new way to maximize the amount of information provided by these detectors, but the data-driven nature makes it less interpretable compared to traditional analysis. An interpretability study reveals the machine's decision-making logic, allowing us to learn from the machine to feedback to the traditional analysis. In this work, we have presented the first machine learning analysis of the data from the Majorana Demonstrator; this is also the first interpretable machine learning analysis of any germanium detector experiment. Two gradient boosted decision tree models are trained to learn from the data, and a game-theory-based model interpretability study is conducted to understand the origin of the classification power. By learning from data, this analysis recognizes the correlations among reconstruction parameters to further enhance the background rejection performance. By learning from the machine, this analysis reveals the importance of new background categories to reciprocally benefit the standard Majorana analysis. This model is highly compatible with next-generation germanium detector experiments like LEGEND since it can be simultaneously trained on a large number of detectors.
    Globally Gated Deep Linear Networks. (arXiv:2210.17449v2 [cs.LG] UPDATED)
    Recently proposed Gated Linear Networks present a tractable nonlinear network architecture, and exhibit interesting capabilities such as learning with local error signals and reduced forgetting in sequential learning. In this work, we introduce a novel gating architecture, named Globally Gated Deep Linear Networks (GGDLNs) where gating units are shared among all processing units in each layer, thereby decoupling the architectures of the nonlinear but unlearned gatings and the learned linear processing motifs. We derive exact equations for the generalization properties in these networks in the finite-width thermodynamic limit, defined by $P,N\rightarrow\infty, P/N\sim O(1)$, where P and N are the training sample size and the network width respectively. We find that the statistics of the network predictor can be expressed in terms of kernels that undergo shape renormalization through a data-dependent matrix compared to the GP kernels. Our theory accurately captures the behavior of finite width GGDLNs trained with gradient descent dynamics. We show that kernel shape renormalization gives rise to rich generalization properties w.r.t. network width, depth and L2 regularization amplitude. Interestingly, networks with sufficient gating units behave similarly to standard ReLU networks. Although gatings in the model do not participate in supervised learning, we show the utility of unsupervised learning of the gating parameters. Additionally, our theory allows the evaluation of the network's ability for learning multiple tasks by incorporating task-relevant information into the gating units. In summary, our work is the first exact theoretical solution of learning in a family of nonlinear networks with finite width. The rich and diverse behavior of the GGDLNs suggests that they are helpful analytically tractable models of learning single and multiple tasks, in finite-width nonlinear deep networks.
    Continuous Conditional Generative Adversarial Networks: Novel Empirical Losses and Label Input Mechanisms. (arXiv:2011.07466v8 [cs.CV] UPDATED)
    This work proposes the continuous conditional generative adversarial network (CcGAN), the first generative model for image generation conditional on continuous, scalar conditions (termed regression labels). Existing conditional GANs (cGANs) are mainly designed for categorical conditions (eg, class labels); conditioning on regression labels is mathematically distinct and raises two fundamental problems:(P1) Since there may be very few (even zero) real images for some regression labels, minimizing existing empirical versions of cGAN losses (aka empirical cGAN losses) often fails in practice;(P2) Since regression labels are scalar and infinitely many, conventional label input methods are not applicable. The proposed CcGAN solves the above problems, respectively, by (S1) reformulating existing empirical cGAN losses to be appropriate for the continuous scenario; and (S2) proposing a naive label input (NLI) method and an improved label input (ILI) method to incorporate regression labels into the generator and the discriminator. The reformulation in (S1) leads to two novel empirical discriminator losses, termed the hard vicinal discriminator loss (HVDL) and the soft vicinal discriminator loss (SVDL) respectively, and a novel empirical generator loss. The error bounds of a discriminator trained with HVDL and SVDL are derived under mild assumptions in this work. Two new benchmark datasets (RC-49 and Cell-200) and a novel evaluation metric (Sliding Fr\'echet Inception Distance) are also proposed for this continuous scenario. Our experiments on the Circular 2-D Gaussians, RC-49, UTKFace, Cell-200, and Steering Angle datasets show that CcGAN is able to generate diverse, high-quality samples from the image distribution conditional on a given regression label. Moreover, in these experiments, CcGAN substantially outperforms cGAN both visually and quantitatively.
    Generalizing DP-SGD with Shuffling and Batching Clipping. (arXiv:2212.05796v1 [cs.LG])
    Classical differential private DP-SGD implements individual clipping with random subsampling, which forces a mini-batch SGD approach. We provide a general differential private algorithmic framework that goes beyond DP-SGD and allows any possible first order optimizers (e.g., classical SGD and momentum based SGD approaches) in combination with batch clipping, which clips an aggregate of computed gradients rather than summing clipped gradients (as is done in individual clipping). The framework also admits sampling techniques beyond random subsampling such as shuffling. Our DP analysis follows the $f$-DP approach and introduces a new proof technique which allows us to also analyse group privacy. In particular, for $E$ epochs work and groups of size $g$, we show a $\sqrt{g E}$ DP dependency for batch clipping with shuffling. This is much better than the previously anticipated linear dependency in $g$ and is much better than the previously expected square root dependency on the total number of rounds within $E$ epochs which is generally much more than $\sqrt{E}$.
    A Dempster-Shafer approach to trustworthy AI with application to fetal brain MRI segmentation. (arXiv:2204.02779v3 [eess.IV] UPDATED)
    Deep learning models for medical image segmentation can fail unexpectedly and spectacularly for pathological cases and images acquired at different centers than training images, with labeling errors that violate expert knowledge. Such errors undermine the trustworthiness of deep learning models for medical image segmentation. Mechanisms for detecting and correcting such failures are essential for safely translating this technology into clinics and are likely to be a requirement of future regulations on artificial intelligence (AI). In this work, we propose a trustworthy AI theoretical framework and a practical system that can augment any backbone AI system using a fallback method and a fail-safe mechanism based on Dempster-Shafer theory. Our approach relies on an actionable definition of trustworthy AI. Our method automatically discards the voxel-level labeling predicted by the backbone AI that violate expert knowledge and relies on a fallback for those voxels. We demonstrate the effectiveness of the proposed trustworthy AI approach on the largest reported annotated dataset of fetal MRI consisting of 540 manually annotated fetal brain 3D T2w MRIs from 13 centers. Our trustworthy AI method improves the robustness of a state-of-the-art backbone AI for fetal brain MRIs acquired across various centers and for fetuses with various brain abnormalities.
    Improving Self-Supervised Learning by Characterizing Idealized Representations. (arXiv:2209.06235v2 [cs.LG] UPDATED)
    Despite the empirical successes of self-supervised learning (SSL) methods, it is unclear what characteristics of their representations lead to high downstream accuracies. In this work, we characterize properties that SSL representations should ideally satisfy. Specifically, we prove necessary and sufficient conditions such that for any task invariant to given data augmentations, desired probes (e.g., linear or MLP) trained on that representation attain perfect accuracy. These requirements lead to a unifying conceptual framework for improving existing SSL methods and deriving new ones. For contrastive learning, our framework prescribes simple but significant improvements to previous methods such as using asymmetric projection heads. For non-contrastive learning, we use our framework to derive a simple and novel objective. Our resulting SSL algorithms outperform baselines on standard benchmarks, including SwAV+multicrops on linear probing of ImageNet.
    ALSO: Automotive Lidar Self-supervision by Occupancy estimation. (arXiv:2212.05867v1 [cs.CV])
    We propose a new self-supervised method for pre-training the backbone of deep perception models operating on point clouds. The core idea is to train the model on a pretext task which is the reconstruction of the surface on which the 3D points are sampled, and to use the underlying latent vectors as input to the perception head. The intuition is that if the network is able to reconstruct the scene surface, given only sparse input points, then it probably also captures some fragments of semantic information, that can be used to boost an actual perception task. This principle has a very simple formulation, which makes it both easy to implement and widely applicable to a large range of 3D sensors and deep networks performing semantic segmentation or object detection. In fact, it supports a single-stream pipeline, as opposed to most contrastive learning approaches, allowing training on limited resources. We conducted extensive experiments on various autonomous driving datasets, involving very different kinds of lidars, for both semantic segmentation and object detection. The results show the effectiveness of our method to learn useful representations without any annotation, compared to existing approaches. Code is available at https://github.com/valeoai/ALSO
    State-Augmented Learnable Algorithms for Resource Management in Wireless Networks. (arXiv:2207.02242v2 [cs.LG] UPDATED)
    We consider resource management problems in multi-user wireless networks, which can be cast as optimizing a network-wide utility function, subject to constraints on the long-term average performance of users across the network. We propose a state-augmented algorithm for solving the aforementioned radio resource management (RRM) problems, where, alongside the instantaneous network state, the RRM policy takes as input the set of dual variables corresponding to the constraints, which evolve depending on how much the constraints are violated during execution. We theoretically show that the proposed state-augmented algorithm leads to feasible and near-optimal RRM decisions. Moreover, focusing on the problem of wireless power control using graph neural network (GNN) parameterizations, we demonstrate the superiority of the proposed RRM algorithm over baseline methods across a suite of numerical experiments.
    Automated analysis of fibrous cap in intravascular optical coherence tomography images of coronary arteries. (arXiv:2204.10162v2 [cs.LG] UPDATED)
    Thin-cap fibroatheroma (TCFA) and plaque rupture have been recognized as the most frequent risk factor for thrombosis and acute coronary syndrome. Intravascular optical coherence tomography (IVOCT) can identify TCFA and assess cap thickness, which provides an opportunity to assess plaque vulnerability. We developed an automated method that can detect lipidous plaque and assess fibrous cap thickness in IVOCT images. This study analyzed a total of 4,360 IVOCT image frames of 77 lesions among 41 patients. To improve segmentation performance, preprocessing included lumen segmentation, pixel-shifting, and noise filtering on the raw polar (r, theta) IVOCT images. We used the DeepLab-v3 plus deep learning model to classify lipidous plaque pixels. After lipid detection, we automatically detected the outer border of the fibrous cap using a special dynamic programming algorithm and assessed the cap thickness. Our method provided excellent discriminability of lipid plaque with a sensitivity of 85.8% and A-line Dice coefficient of 0.837. By comparing lipid angle measurements between two analysts following editing of our automated software, we found good agreement by Bland-Altman analysis (difference 6.7+/-17 degree; mean 196 degree). Our method accurately detected the fibrous cap from the detected lipid plaque. Automated analysis required a significant modification for only 5.5% frames. Furthermore, our method showed a good agreement of fibrous cap thickness between two analysts with Bland-Altman analysis (4.2+/-14.6 micron; mean 175 micron), indicating little bias between users and good reproducibility of the measurement. We developed a fully automated method for fibrous cap quantification in IVOCT images, resulting in good agreement with determinations by analysts. The method has great potential to enable highly automated, repeatable, and comprehensive evaluations of TCFAs.
    Industry-Scale Orchestrated Federated Learning for Drug Discovery. (arXiv:2210.08871v2 [cs.LG] UPDATED)
    To apply federated learning to drug discovery we developed a novel platform in the context of European Innovative Medicines Initiative (IMI) project MELLODDY (grant n{\deg}831472), which was comprised of 10 pharmaceutical companies, academic research labs, large industrial companies and startups. The MELLODDY platform was the first industry-scale platform to enable the creation of a global federated model for drug discovery without sharing the confidential data sets of the individual partners. The federated model was trained on the platform by aggregating the gradients of all contributing partners in a cryptographic, secure way following each training iteration. The platform was deployed on an Amazon Web Services (AWS) multi-account architecture running Kubernetes clusters in private subnets. Organisationally, the roles of the different partners were codified as different rights and permissions on the platform and administrated in a decentralized way. The MELLODDY platform generated new scientific discoveries which are described in a companion paper.
    Data-Driven Constitutive Relation Reveals Scaling Law for Hydrodynamic Transport Coefficients. (arXiv:2108.00413v3 [physics.flu-dyn] UPDATED)
    Finding extended hydrodynamics equations valid from the dense gas region to the rarefied gas region remains a great challenge. The key to success is to obtain accurate constitutive relations for stress and heat flux. Data-driven models offer a new phenomenological approach to learning constitutive relations from data. Such models enable complex constitutive relations that extend Newton's law of viscosity and Fourier's law of heat conduction by regression on higher derivatives. However, the choices of derivatives in these models are ad-hoc without a clear physical explanation. We investigated data-driven models theoretically on a linear system. We argue that these models are equivalent to non-linear length scale scaling laws of transport coefficients. The equivalence to scaling laws justified the physical plausibility and revealed the limitation of data-driven models. Our argument also points out that modeling the scaling law could avoid practical difficulties in data-driven models like derivative estimation and variable selection on noisy data. We further proposed a constitutive relation model based on scaling law and tested it on the calculation of Rayleigh scattering spectra. The result shows our data-driven model has a clear advantage over the Chapman-Enskog expansion and moment methods.
    A Roadmap to Domain Knowledge Integration in Machine Learning. (arXiv:2212.05712v1 [cs.LG])
    Many machine learning algorithms have been developed in recent years to enhance the performance of a model in different aspects of artificial intelligence. But the problem persists due to inadequate data and resources. Integrating knowledge in a machine learning model can help to overcome these obstacles up to a certain degree. Incorporating knowledge is a complex task though because of various forms of knowledge representation. In this paper, we will give a brief overview of these different forms of knowledge integration and their performance in certain machine learning tasks.
    Application of Convolutional Neural Networks with Quasi-Reversibility Method Results for Option Forecasting. (arXiv:2208.14385v2 [q-fin.ST] UPDATED)
    This paper presents a novel way to apply mathematical finance and machine learning (ML) to forecast stock options prices. Following results from the paper Quasi-Reversibility Method and Neural Network Machine Learning to Solution of Black-Scholes Equations (appeared on the AMS Contemporary Mathematics journal), we create and evaluate new empirical mathematical models for the Black-Scholes equation to analyze data for 92,846 companies. We solve the Black-Scholes (BS) equation forwards in time as an ill-posed inverse problem, using the Quasi-Reversibility Method (QRM), to predict option price for the future one day. For each company, we have 13 elements including stock and option daily prices, volatility, minimizer, etc. Because the market is so complicated that there exists no perfect model, we apply ML to train algorithms to make the best prediction. The current stage of research combines QRM with Convolutional Neural Networks (CNN), which learn information across a large number of data points simultaneously. We implement CNN to generate new results by validating and testing on sample market data. We test different ways of applying CNN and compare our CNN models with previous models to see if achieving a higher profit rate is possible.
    GWRBoost:A geographically weighted gradient boosting method for explainable quantification of spatially-varying relationships. (arXiv:2212.05814v1 [cs.LG])
    The geographically weighted regression (GWR) is an essential tool for estimating the spatial variation of relationships between dependent and independent variables in geographical contexts. However, GWR suffers from the problem that classical linear regressions, which compose the GWR model, are more prone to be underfitting, especially for significant volume and complex nonlinear data, causing inferior comparative performance. Nevertheless, some advanced models, such as the decision tree and the support vector machine, can learn features from complex data more effectively while they cannot provide explainable quantification for the spatial variation of localized relationships. To address the above issues, we propose a geographically gradient boosting weighted regression model, GWRBoost, that applies the localized additive model and gradient boosting optimization method to alleviate underfitting problems and retains explainable quantification capability for spatially-varying relationships between geographically located variables. Furthermore, we formulate the computation method of the Akaike information score for the proposed model to conduct the comparative analysis with the classic GWR algorithm. Simulation experiments and the empirical case study are applied to prove the efficient performance and practical value of GWRBoost. The results show that our proposed model can reduce the RMSE by 18.3\% in parameter estimation accuracy and AICc by 67.3\% in the goodness of fit.
    Resource-Efficient Neural Networks for Embedded Systems. (arXiv:2001.03048v2 [stat.ML] UPDATED)
    While machine learning is traditionally a resource intensive task, embedded systems, autonomous navigation, and the vision of the Internet of Things fuel the interest in resource-efficient approaches. These approaches aim for a carefully chosen trade-off between performance and resource consumption in terms of computation and energy. The development of such approaches is among the major challenges in current machine learning research and key to ensure a smooth transition of machine learning technology from a scientific environment with virtually unlimited computing resources into everyday's applications. In this article, we provide an overview of the current state of the art of machine learning techniques facilitating these real-world requirements. In particular, we focus on deep neural networks (DNNs), the predominant machine learning models of the past decade. We give a comprehensive overview of the vast literature that can be mainly split into three non-mutually exclusive categories: (i) quantized neural networks, (ii) network pruning, and (iii) structural efficiency. These techniques can be applied during training or as post-processing, and they are widely used to reduce the computational demands in terms of memory footprint, inference speed, and energy efficiency. We also briefly discuss different concepts of embedded hardware for DNNs and their compatibility with machine learning techniques as well as potential for energy and latency reduction. We substantiate our discussion with experiments on well-known benchmark datasets using compression techniques (quantization, pruning) for a set of resource-constrained embedded systems, such as CPUs, GPUs and FPGAs. The obtained results highlight the difficulty of finding good trade-offs between resource efficiency and predictive performance.
    Variational Monte Carlo Approach to Partial Differential Equations with Neural Networks. (arXiv:2206.01927v2 [math.NA] UPDATED)
    The accurate numerical solution of partial differential equations is a central task in numerical analysis allowing to model a wide range of natural phenomena by employing specialized solvers depending on the scenario of application. Here, we develop a variational approach for solving partial differential equations governing the evolution of high dimensional probability distributions. Our approach naturally works on the unbounded continuous domain and encodes the full probability density function through its variational parameters, which are adapted dynamically during the evolution to optimally reflect the dynamics of the density. For the considered benchmark cases we observe excellent agreement with numerical solutions as well as analytical solutions in regimes inaccessible to traditional computational approaches.
    Collaboration Promotes Group Resilience in Multi-Agent AI. (arXiv:2111.06614v2 [cs.LG] UPDATED)
    AI agents need to be robust to unexpected changes in their environment in order to safely operate in real-world scenarios. While some work has been done on this type of robustness in the single-agent case, in this work we introduce the idea that collaboration with other agents can help agents adapt to environment perturbations in multi-agent reinforcement learning settings. We first formalize this notion of resilience of a group of agents. We then empirically evaluate different collaboration protocols and examine their effect on resilience. We see that all of the collaboration approaches considered lead to greater resilience compared to baseline, in line with our hypothesis. We discuss future direction and the general relevance of the concept of resilience introduced in this work.
    Memory-efficient model-based deep learning with convergence and robustness guarantees. (arXiv:2206.04797v2 [cs.CV] UPDATED)
    Computational imaging has been revolutionized by compressed sensing algorithms, which offer guaranteed uniqueness, convergence, and stability properties. In recent years, model-based deep learning methods that combine imaging physics with learned regularization priors have been emerging as more powerful alternatives for image recovery. The main focus of this paper is to introduce a memory efficient model-based algorithm with similar theoretical guarantees as CS methods. The proposed iterative algorithm alternates between a gradient descent involving the score function and a conjugate gradient algorithm to encourage data consistency. The score function is modeled as a monotone convolutional neural network. Our analysis shows that the monotone constraint is necessary and sufficient to enforce the uniqueness of the fixed point in arbitrary inverse problems. In addition, it also guarantees the convergence to a fixed point, which is robust to input perturbations. Current algorithms including RED and MoDL are special cases of the proposed algorithm; the proposed theoretical tools enable the optimization of the framework for the deep equilibrium setting. The proposed deep equilibrium formulation is significantly more memory efficient than unrolled methods, which allows us to apply it to 3D or 2D+time problems that current unrolled algorithms cannot handle.
    ParaDime: A Framework for Parametric Dimensionality Reduction. (arXiv:2210.04582v2 [cs.LG] UPDATED)
    ParaDime is a framework for parametric dimensionality reduction (DR). In parametric DR, neural networks are trained to embed high-dimensional data items in a low-dimensional space while minimizing an objective function. ParaDime builds on the idea that the objective functions of several modern DR techniques result from transformed inter-item relationships. It provides a common interface to specify these relations and transformations and to define how they are used within the losses that govern the training process. Through this interface, ParaDime unifies parametric versions of DR techniques such as metric MDS, t-SNE, and UMAP. Furthermore, it allows users to fully customize each aspect of the DR process. We show how this ease of customization makes ParaDime suitable for experimenting with interesting techniques, such as hybrid classification/embedding models or supervised DR, which opens up new possibilities for visualizing high-dimensional data.
    Semi-Discrete Normalizing Flows through Differentiable Tessellation. (arXiv:2203.06832v4 [cs.LG] UPDATED)
    Mapping between discrete and continuous distributions is a difficult task and many have had to resort to heuristical approaches. We propose a tessellation-based approach that directly learns quantization boundaries in a continuous space, complete with exact likelihood evaluations. This is done through constructing normalizing flows on convex polytopes parameterized using a simple homeomorphism with an efficient log determinant Jacobian. We explore this approach in two application settings, mapping from discrete to continuous and vice versa. Firstly, a Voronoi dequantization allows automatically learning quantization boundaries in a multidimensional space. The location of boundaries and distances between regions can encode useful structural relations between the quantized discrete values. Secondly, a Voronoi mixture model has near-constant computation cost for likelihood evaluation regardless of the number of mixture components. Empirically, we show improvements over existing methods across a range of structured data modalities.
    Survey of Machine Learning Based Intrusion Detection Methods for Internet of Medical Things. (arXiv:2202.09657v3 [cs.CR] UPDATED)
    The Internet of Medical Things (IoMT) allows the collection of physiological data using sensors, then their transmission to remote servers, permitting physicians and health professionals to analyze these data continuously and permanently. However, on the one hand, this technology faces security risks ranging from violating patient's privacy to their death due to wireless communication exposing these data to interception attacks. Moreover, these data are of particular interest to attackers due to their sensitive and private nature. On the other hand, adopting traditional security, such as cryptography on medical equipment suffering from low computing, storage and energy capacity with heterogeneous communication, represents a challenge. Moreover, these protection methods are ineffective against new attacks and zero-day attacks. Security measures must be adopted to guarantee the integrity, confidentiality and availability of data during collection, transmission, storage and processing. In this context, using Intrusion Detection Systems (IDS) based on Machine Learning (ML) can bring a complementary security solution adapted to the characteristics of IoMT systems. This paper performs a comprehensive survey on how IDS based on ML addresses security and privacy issues in IoMT systems. For this purpose, the generic three layers architecture of IoMT and the security requirement of IoMT systems are provided. Then, the various threats that can affect IoMT security and the advantages, disadvantages, methods, and datasets used in each solution based on ML are identified at the three layers composing IoMT. Finally, some challenges and limitations of applying IDS based on ML at each layer of IoMT are discussed, which can serve as a future research direction.
    Off-Policy Deep Reinforcement Learning Algorithms for Handling Various Robotic Manipulator Tasks. (arXiv:2212.05572v1 [cs.RO])
    In order to avoid conventional controlling methods which created obstacles due to the complexity of systems and intense demand on data density, developing modern and more efficient control methods are required. In this way, reinforcement learning off-policy and model-free algorithms help to avoid working with complex models. In terms of speed and accuracy, they become prominent methods because the algorithms use their past experience to learn the optimal policies. In this study, three reinforcement learning algorithms; DDPG, TD3 and SAC have been used to train Fetch robotic manipulator for four different tasks in MuJoCo simulation environment. All of these algorithms are off-policy and able to achieve their desired target by optimizing both policy and value functions. In the current study, the efficiency and the speed of these three algorithms are analyzed in a controlled environment.
    XClusters: Explainability-first Clustering. (arXiv:2209.10956v2 [cs.LG] UPDATED)
    We study the problem of explainability-first clustering where explainability becomes a first-class citizen for clustering. Previous clustering approaches use decision trees for explanation, but only after the clustering is completed. In contrast, our approach is to perform clustering and decision tree training holistically where the decision tree's performance and size also influence the clustering results. We assume the attributes for clustering and explaining are distinct, although this is not necessary. We observe that our problem is a monotonic optimization where the objective function is a difference of monotonic functions. We then propose an efficient branch-and-bound algorithm for finding the best parameters that lead to a balance of cluster distortion and decision tree explainability. Our experiments show that our method can improve the explainability of any clustering that fits in our framework.
    Neural Networks with Physics-Informed Architectures and Constraints for Dynamical Systems Modeling. (arXiv:2109.06407v2 [cs.LG] UPDATED)
    Effective inclusion of physics-based knowledge into deep neural network models of dynamical systems can greatly improve data efficiency and generalization. Such a-priori knowledge might arise from physical principles (e.g., conservation laws) or from the system's design (e.g., the Jacobian matrix of a robot), even if large portions of the system dynamics remain unknown. We develop a framework to learn dynamics models from trajectory data while incorporating a-priori system knowledge as inductive bias. More specifically, the proposed framework uses physics-based side information to inform the structure of the neural network itself, and to place constraints on the values of the outputs and the internal states of the model. It represents the system's vector field as a composition of known and unknown functions, the latter of which are parametrized by neural networks. The physics-informed constraints are enforced via the augmented Lagrangian method during the model's training. We experimentally demonstrate the benefits of the proposed approach on a variety of dynamical systems -- including a benchmark suite of robotics environments featuring large state spaces, non-linear dynamics, external forces, contact forces, and control inputs. By exploiting a-priori system knowledge during training, the proposed approach learns to predict the system dynamics two orders of magnitude more accurately than a baseline approach that does not include prior knowledge, given the same training dataset.
    Counterfactual Generation Under Confounding. (arXiv:2210.12368v2 [cs.LG] UPDATED)
    A machine learning model, under the influence of observed or unobserved confounders in the training data, can learn spurious correlations and fail to generalize when deployed. For image classifiers, augmenting a training dataset using counterfactual examples has been empirically shown to break spurious correlations. However, the counterfactual generation task itself becomes more difficult as the level of confounding increases. Existing methods for counterfactual generation under confounding consider a fixed set of interventions (e.g., texture, rotation) and are not flexible enough to capture diverse data-generating processes. Given a causal generative process, we formally characterize the adverse effects of confounding on any downstream tasks and show that the correlation between generative factors (attributes) can be used to quantitatively measure confounding between generative factors. To minimize such correlation, we propose a counterfactual generation method that learns to modify the value of any attribute in an image and generate new images given a set of observed attributes, even when the dataset is highly confounded. These counterfactual images are then used to regularize the downstream classifier such that the learned representations are the same across various generative factors conditioned on the class label. Our method is computationally efficient, simple to implement, and works well for any number of generative factors and confounding variables. Our experimental results on both synthetic (MNIST variants) and real-world (CelebA) datasets show the usefulness of our approach.
    Fine-grained Graph Learning for Multi-view Subspace Clustering. (arXiv:2201.04604v3 [cs.LG] UPDATED)
    Multi-view subspace clustering (MSC) is a popular unsupervised method by integrating heterogeneous information to reveal the intrinsic clustering structure hidden across views. Usually, MSC methods use graphs (or affinity matrices) fusion to learn a common structure, and further apply graph-based approaches to clustering. Despite progress, most of the methods do not establish the connection between graph learning and clustering. Meanwhile, conventional graph fusion strategies assign coarse-grained weights to combine multi-graph, ignoring the importance of local structure. In this paper, we propose a fine-grained graph learning framework for multi-view subspace clustering (FGL-MSC) to address these issues. To utilize the multi-view information sufficiently, we design a specific graph learning method by introducing graph regularization and local structure fusion pattern. The main challenge is how to optimize the fine-grained fusion weights while generating the learned graph that fits the clustering task, thus making the clustering representation meaningful and competitive. Accordingly, an iterative algorithm is proposed to solve the above joint optimization problem, which obtains the learned graph, the clustering representation, and the fusion weights simultaneously. Extensive experiments on eight real-world datasets show that the proposed framework has comparable performance to the state-of-the-art methods.
    User-Oriented Robust Reinforcement Learning. (arXiv:2202.07301v4 [cs.LG] UPDATED)
    Recently, improving the robustness of policies across different environments attracts increasing attention in the reinforcement learning (RL) community. Existing robust RL methods mostly aim to achieve the max-min robustness by optimizing the policy's performance in the worst-case environment. However, in practice, a user that uses an RL policy may have different preferences over its performance across environments. Clearly, the aforementioned max-min robustness is oftentimes too conservative to satisfy user preference. Therefore, in this paper, we integrate user preference into policy learning in robust RL, and propose a novel User-Oriented Robust RL (UOR-RL) framework. Specifically, we define a new User-Oriented Robustness (UOR) metric for RL, which allocates different weights to the environments according to user preference and generalizes the max-min robustness metric. To optimize the UOR metric, we develop two different UOR-RL training algorithms for the scenarios with or without a priori known environment distribution, respectively. Theoretically, we prove that our UOR-RL training algorithms converge to near-optimal policies even with inaccurate or completely no knowledge about the environment distribution. Furthermore, we carry out extensive experimental evaluations in 4 MuJoCo tasks. The experimental results demonstrate that UOR-RL is comparable to the state-of-the-art baselines under the average and worst-case performance metrics, and more importantly establishes new state-of-the-art performance under the UOR metric.
    Double Robustness for Complier Parameters and a Semiparametric Test for Complier Characteristics. (arXiv:1909.05244v7 [stat.ML] UPDATED)
    We propose a semiparametric test to evaluate (i) whether different instruments induce subpopulations of compliers with the same observable characteristics on average, and (ii) whether compliers have observable characteristics that are the same as the full population on average. The test is a flexible robustness check for the external validity of instruments. We use it to reinterpret the difference in LATE estimates that Angrist and Evans (1998) obtain when using different instrumental variables. To justify the test, we characterize the doubly robust moment for Abadie (2003)'s class of complier parameters, and we analyze a machine learning update to $\kappa$ weighting.
    Generalization in Deep Learning. (arXiv:1710.05468v7 [stat.ML] UPDATED)
    This paper provides theoretical insights into why and how deep learning can generalize well, despite its large capacity, complexity, possible algorithmic instability, nonrobustness, and sharp minima, responding to an open question in the literature. We also discuss approaches to provide non-vacuous generalization guarantees for deep learning. Based on theoretical observations, we propose new open problems and discuss the limitations of our results.
    Machine Learning for K-adaptability in Two-stage Robust Optimization. (arXiv:2210.11152v2 [math.OC] UPDATED)
    Two-stage robust optimization problems constitute one of the hardest optimization problem classes. One of the solution approaches to this class of problems is K-adaptability. This approach simultaneously seeks the best partitioning of the uncertainty set of scenarios into K subsets, and optimizes decisions corresponding to each of these subsets. In general case, it is solved using the K-adaptability branch-and-bound algorithm, which requires exploration of exponentially-growing solution trees. To accelerate finding high-quality solutions in such trees, we propose a machine learning-based node selection strategy. In particular, we construct a feature engineering scheme based on general two-stage robust optimization insights that allows us to train our machine learning tool on a database of resolved B&B trees, and to apply it as-is to problems of different sizes and/or types. We experimentally show that using our learned node selection strategy outperforms a vanilla, random node selection strategy when tested on problems of the same type as the training problems, also in case the K-value or the problem size differs from the training ones.
    Carpet-bombing patch: attacking a deep network without usual requirements. (arXiv:2212.05827v1 [cs.CV])
    Although deep networks have shown vulnerability to evasion attacks, such attacks have usually unrealistic requirements. Recent literature discussed the possibility to remove or not some of these requirements. This paper contributes to this literature by introducing a carpet-bombing patch attack which has almost no requirement. Targeting the feature representations, this patch attack does not require knowing the network task. This attack decreases accuracy on Imagenet, mAP on Pascal Voc, and IoU on Cityscapes without being aware that the underlying tasks involved classification, detection or semantic segmentation, respectively. Beyond the potential safety issues raised by this attack, the impact of the carpet-bombing attack highlights some interesting property of deep network layer dynamic.
    Auto-Encoding Variational Bayes. (arXiv:1312.6114v11 [stat.ML] UPDATED)
    How can we perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and large datasets? We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case. Our contributions are two-fold. First, we show that a reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods. Second, we show that for i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (also called a recognition model) to the intractable posterior using the proposed lower bound estimator. Theoretical advantages are reflected in experimental results.
    Logical Fallacy Detection. (arXiv:2202.13758v3 [cs.CL] UPDATED)
    Reasoning is central to human intelligence. However, fallacious arguments are common, and some exacerbate problems such as spreading misinformation about climate change. In this paper, we propose the task of logical fallacy detection, and provide a new dataset (Logic) of logical fallacies generally found in text, together with an additional challenge set for detecting logical fallacies in climate change claims (LogicClimate). Detecting logical fallacies is a hard problem as the model must understand the underlying logical structure of the argument. We find that existing pretrained large language models perform poorly on this task. In contrast, we show that a simple structure-aware classifier outperforms the best language model by 5.46% on Logic and 4.51% on LogicClimate. We encourage future work to explore this task as (a) it can serve as a new reasoning challenge for language models, and (b) it can have potential applications in tackling the spread of misinformation. Our dataset and code are available at https://github.com/causalNLP/logical-fallacy
    Debiased Machine Learning of Set-Identified Linear Models. (arXiv:1712.10024v5 [stat.ML] UPDATED)
    This paper provides estimation and inference methods for an identified set's boundary (i.e., support function) where the selection among a very large number of covariates is based on modern regularized tools. I characterize the boundary using a semiparametric moment equation. Combining Neyman-orthogonality and sample splitting ideas, I construct a root-N consistent, uniformly asymptotically Gaussian estimator of the boundary and propose a multiplier bootstrap procedure to conduct inference. I apply this result to the partially linear model, the partially linear IV model and the average partial derivative with an interval-valued outcome.
    ACIL: Analytic Class-Incremental Learning with Absolute Memorization and Privacy Protection. (arXiv:2205.14922v2 [cs.LG] UPDATED)
    Class-incremental learning (CIL) learns a classification model with training data of different classes arising progressively. Existing CIL either suffers from serious accuracy loss due to catastrophic forgetting, or invades data privacy by revisiting used exemplars. Inspired by linear learning formulations, we propose an analytic class-incremental learning (ACIL) with absolute memorization of past knowledge while avoiding breaching of data privacy (i.e., without storing historical data). The absolute memorization is demonstrated in the sense that class-incremental learning using ACIL given present data would give identical results to that from its joint-learning counterpart which consumes both present and historical samples. This equality is theoretically validated. Data privacy is ensured since no historical data are involved during the learning process. Empirical validations demonstrate ACIL's competitive accuracy performance with near-identical results for various incremental task settings (e.g., 5-50 phases). This also allows ACIL to outperform the state-of-the-art methods for large-phase scenarios (e.g., 25 and 50 phases).
    Multi-Dimensional Self Attention based Approach for Remaining Useful Life Estimation. (arXiv:2212.05772v1 [cs.LG])
    Remaining Useful Life (RUL) estimation plays a critical role in Prognostics and Health Management (PHM). Traditional machine health maintenance systems are often costly, requiring sufficient prior expertise, and are difficult to fit into highly complex and changing industrial scenarios. With the widespread deployment of sensors on industrial equipment, building the Industrial Internet of Things (IIoT) to interconnect these devices has become an inexorable trend in the development of the digital factory. Using the device's real-time operational data collected by IIoT to get the estimated RUL through the RUL prediction algorithm, the PHM system can develop proactive maintenance measures for the device, thus, reducing maintenance costs and decreasing failure times during operation. This paper carries out research into the remaining useful life prediction model for multi-sensor devices in the IIoT scenario. We investigated the mainstream RUL prediction models and summarized the basic steps of RUL prediction modeling in this scenario. On this basis, a data-driven approach for RUL estimation is proposed in this paper. It employs a Multi-Head Attention Mechanism to fuse the multi-dimensional time-series data output from multiple sensors, in which the attention on features is used to capture the interactions between features and attention on sequences is used to learn the weights of time steps. Then, the Long Short-Term Memory Network is applied to learn the features of time series. We evaluate the proposed model on two benchmark datasets (C-MAPSS and PHM08), and the results demonstrate that it outperforms the state-of-art models. Moreover, through the interpretability of the multi-head attention mechanism, the proposed model can provide a preliminary explanation of engine degradation. Therefore, this approach is promising for predictive maintenance in IIoT scenarios.
    Estimator: An Effective and Scalable Framework for Transportation Mode Classification over Trajectories. (arXiv:2212.05502v1 [cs.LG])
    Transportation mode classification, the process of predicting the class labels of moving objects transportation modes, has been widely applied to a variety of real world applications, such as traffic management, urban computing, and behavior study. However, existing studies of transportation mode classification typically extract the explicit features of trajectory data but fail to capture the implicit features that affect the classification performance. In addition, most of the existing studies also prefer to apply RNN-based models to embed trajectories, which is only suitable for classifying small-scale data. To tackle the above challenges, we propose an effective and scalable framework for transportation mode classification over GPS trajectories, abbreviated Estimator. Estimator is established on a developed CNN-TCN architecture, which is capable of leveraging the spatial and temporal hidden features of trajectories to achieve high effectiveness and efficiency. Estimator partitions the entire traffic space into disjointed spatial regions according to traffic conditions, which enhances the scalability significantly and thus enables parallel transportation classification. Extensive experiments using eight public real-life datasets offer evidence that Estimator i) achieves superior model effectiveness (i.e., 99% Accuracy and 0.98 F1-score), which outperforms state-of-the-arts substantially; ii) exhibits prominent model efficiency, and obtains 7-40x speedups up over state-of-the-arts learning-based methods; and iii) shows high model scalability and robustness that enables large-scale classification analytics.
    Instrumental Variables in Causal Inference and Machine Learning: A Survey. (arXiv:2212.05778v1 [cs.LG])
    Causal inference is the process of using assumptions, study designs, and estimation strategies to draw conclusions about the causal relationships between variables based on data. This allows researchers to better understand the underlying mechanisms at work in complex systems and make more informed decisions. In many settings, we may not fully observe all the confounders that affect both the treatment and outcome variables, complicating the estimation of causal effects. To address this problem, a growing literature in both causal inference and machine learning proposes to use Instrumental Variables (IV). This paper serves as the first effort to systematically and comprehensively introduce and discuss the IV methods and their applications in both causal inference and machine learning. First, we provide the formal definition of IVs and discuss the identification problem of IV regression methods under different assumptions. Second, we categorize the existing work on IV methods into three streams according to the focus on the proposed methods, including two-stage least squares with IVs, control function with IVs, and evaluation of IVs. For each stream, we present both the classical causal inference methods, and recent developments in the machine learning literature. Then, we introduce a variety of applications of IV methods in real-world scenarios and provide a summary of the available datasets and algorithms. Finally, we summarize the literature, discuss the open problems and suggest promising future research directions for IV methods and their applications. We also develop a toolkit of IVs methods reviewed in this survey at https://github.com/causal-machine-learning-lab/mliv.
    GT-CausIn: a novel causal-based insight for traffic prediction. (arXiv:2212.05782v1 [cs.LG])
    Traffic forecasting is an important application of spatiotemporal series prediction. Among different methods, graph neural networks have achieved so far the most promising results, learning relations between graph nodes then becomes a crucial task. However, improvement space is very limited when these relations are learned in a node-to-node manner. The challenge stems from (1) obscure temporal dependencies between different stations, (2) difficulties in defining variables beyond the node level, and (3) no ready-made method to validate the learned relations. To confront these challenges, we define legitimate traffic causal variables to discover the causal relation inside the traffic network, which is carefully checked with statistic tools and case analysis. We then present a novel model named Graph Spatial-Temporal Network Based on Causal Insight (GT-CausIn), where prior learned causal information is integrated with graph diffusion layers and temporal convolutional network (TCN) layers. Experiments are carried out on two real-world traffic datasets: PEMS-BAY and METR-LA, which show that GT-CausIn significantly outperforms the state-of-the-art models on mid-term and long-term prediction.
    Efficient Relation-aware Neighborhood Aggregation in Graph Neural Networks via Tensor Decomposition. (arXiv:2212.05581v1 [cs.LG])
    Numerous models have tried to effectively embed knowledge graphs in low dimensions. Among the state-of-the-art methods, Graph Neural Network (GNN) models provide structure-aware representations of knowledge graphs. However, they often utilize the information of relations and their interactions with entities inefficiently. Moreover, most state-of-the-art knowledge graph embedding models suffer from scalability issues because of assigning high-dimensional embeddings to entities and relations. To address the above limitations, we propose a scalable general knowledge graph encoder that adaptively involves a powerful tensor decomposition method in the aggregation function of RGCN, a well-known relational GNN model. Specifically, the parameters of a low-rank core projection tensor, used to transform neighborhood entities in the encoder, are shared across relations to benefit from multi-task learning and incorporate relations information effectively. Besides, we propose a low-rank estimation of the core tensor using CP decomposition to compress the model, which is also applicable, as a regularization method, to other similar linear models. We evaluated our model on knowledge graph completion as a common downstream task. We train our model for using a new loss function based on contrastive learning, which relieves the training limitation of the 1-N method on huge graphs. We improved RGCN performance on FB15-237 by 0.42% with considerably lower dimensionality of embeddings.
    Mind the gap: Challenges of deep learning approaches to Theory of Mind. (arXiv:2203.16540v2 [cs.LG] UPDATED)
    Theory of Mind is an essential ability of humans to infer the mental states of others. Here we provide a coherent summary of the potential, current progress, and problems of deep learning approaches to Theory of Mind. We highlight that many current findings can be explained through shortcuts. These shortcuts arise because the tasks used to investigate Theory of Mind in deep learning systems have been too narrow. Thus, we encourage researchers to investigate Theory of Mind in complex open-ended environments. Furthermore, to inspire future deep learning systems we provide a concise overview of prior work done in humans. We further argue that when studying Theory of Mind with deep learning, the research's main focus and contribution ought to be opening up the network's representations. We recommend researchers use tools from the field of interpretability of AI to study the relationship between different network components and aspects of Theory of Mind.
    Exponential Separations in Symmetric Neural Networks. (arXiv:2206.01266v3 [cs.LG] UPDATED)
    In this work we demonstrate a novel separation between symmetric neural network architectures. Specifically, we consider the Relational Network~\parencite{santoro2017simple} architecture as a natural generalization of the DeepSets~\parencite{zaheer2017deep} architecture, and study their representational gap. Under the restriction to analytic activation functions, we construct a symmetric function acting on sets of size $N$ with elements in dimension $D$, which can be efficiently approximated by the former architecture, but provably requires width exponential in $N$ and $D$ for the latter.
    Sequential Density Estimation via Nonlinear Continuous Weighted Finite Automata. (arXiv:2206.03923v2 [cs.LG] UPDATED)
    Weighted finite automata (WFAs) have been widely applied in many fields. One of the classic problems for WFAs is probability distribution estimation over sequences of discrete symbols. Although WFAs have been extended to deal with continuous input data, namely continuous WFAs (CWFAs), it is still unclear how to approximate density functions over sequences of continuous random variables using WFA-based models, due to the limitation on the expressiveness of the model as well as the tractability of approximating density functions via CWFAs. In this paper, we propose a nonlinear extension to the CWFA model to first improve its expressiveness, we refer to it as the nonlinear continuous WFAs (NCWFAs). Then we leverage the so-called RNADE method, which is a well-known density estimator based on neural networks, and propose the RNADE-NCWFA model. The RNADE-NCWFA model computes a density function by design. We show that this model is strictly more expressive than the Gaussian HMM model, which CWFA cannot approximate. Empirically, we conduct a synthetic experiment using Gaussian HMM generated data. We focus on evaluating the model's ability to estimate densities for sequences of varying lengths (longer length than the training data). We observe that our model performs the best among the compared baseline methods.
    Federated Learning via Plurality Vote. (arXiv:2110.02998v3 [cs.LG] UPDATED)
    Federated learning allows collaborative workers to solve a machine learning problem while preserving data privacy. Recent studies have tackled various challenges in federated learning, but the joint optimization of communication overhead, learning reliability, and deployment efficiency is still an open problem. To this end, we propose a new scheme named federated learning via plurality vote (FedVote). In each communication round of FedVote, workers transmit binary or ternary weights to the server with low communication overhead. The model parameters are aggregated via weighted voting to enhance the resilience against Byzantine attacks. When deployed for inference, the model with binary or ternary weights is resource-friendly to edge devices. We show that our proposed method can reduce quantization error and converges faster compared with the methods directly quantizing the model updates.
    AutoFi: Towards Automatic WiFi Human Sensing via Geometric Self-Supervised Learning. (arXiv:2205.01629v2 [cs.NI] UPDATED)
    WiFi sensing technology has shown superiority in smart homes among various sensors for its cost-effective and privacy-preserving merits. It is empowered by Channel State Information (CSI) extracted from WiFi signals and advanced machine learning models to analyze motion patterns in CSI. Many learning-based models have been proposed for kinds of applications, but they severely suffer from environmental dependency. Though domain adaptation methods have been proposed to tackle this issue, it is not practical to collect high-quality, well-segmented and balanced CSI samples in a new environment for adaptation algorithms, but randomly-captured CSI samples can be easily collected. {\color{black}In this paper, we firstly explore how to learn a robust model from these low-quality CSI samples, and propose AutoFi, an annotation-efficient WiFi sensing model based on a novel geometric self-supervised learning algorithm.} The AutoFi fully utilizes unlabeled low-quality CSI samples that are captured randomly, and then transfers the knowledge to specific tasks defined by users, which is the first work to achieve cross-task transfer in WiFi sensing. The AutoFi is implemented on a pair of Atheros WiFi APs for evaluation. The AutoFi transfers knowledge from randomly collected CSI samples into human gait recognition and achieves state-of-the-art performance. Furthermore, we simulate cross-task transfer using public datasets to further demonstrate its capacity for cross-task learning. For the UT-HAR and Widar datasets, the AutoFi achieves satisfactory results on activity recognition and gesture recognition without any prior training. We believe that the AutoFi takes a huge step toward automatic WiFi sensing without any developer engagement.
    Concentration of Random Feature Matrices in High-Dimensions. (arXiv:2204.06935v2 [stat.ML] UPDATED)
    The spectra of random feature matrices provide essential information on the conditioning of the linear system used in random feature regression problems and are thus connected to the consistency and generalization of random feature models. Random feature matrices are asymmetric rectangular nonlinear matrices depending on two input variables, the data and the weights, which can make their characterization challenging. We consider two settings for the two input variables, either both are random variables or one is a random variable and the other is well-separated, i.e. there is a minimum distance between points. With conditions on the dimension, the complexity ratio, and the sampling variance, we show that the singular values of these matrices concentrate near their full expectation and near one with high-probability. In particular, since the dimension depends only on the logarithm of the number of random weights or the number of data points, our complexity bounds can be achieved even in moderate dimensions for many practical setting. The theoretical results are verified with numerical experiments.
    New Paradigms for Exploiting Parallel Experiments in Bayesian Optimization. (arXiv:2210.01071v3 [stat.ML] UPDATED)
    Bayesian optimization (BO) is one of the most effective methods for closed-loop experimental design and black-box optimization. However, a key limitation of BO is that it is an inherently sequential algorithm (one experiment is proposed per round) and thus cannot directly exploit high-throughput (parallel) experiments. Diverse modifications to the BO framework have been proposed in the literature to enable exploitation of parallel experiments but such approaches are limited in the degree of parallelization that they can achieve and can lead to redundant experiments (thus wasting resources and potentially compromising performance). In this work, we present new parallel BO paradigms that exploit the structure of the system to partition the design space. Specifically, we propose an approach that partitions the design space by following the level sets of the performance function and an approach that exploits partially-separable structures of the performance function found. We conduct extensive numerical experiments using a reactor case study to benchmark the effectiveness of these approaches against a variety of state-of-the-art parallel algorithms reported in the literature. Our computational results show that our approaches significantly reduce the required search time and increase the probability of finding a global (rather than local) solution.
    Neural Conservation Laws: A Divergence-Free Perspective. (arXiv:2210.01741v3 [cs.LG] UPDATED)
    We investigate the parameterization of deep neural networks that by design satisfy the continuity equation, a fundamental conservation law. This is enabled by the observation that any solution of the continuity equation can be represented as a divergence-free vector field. We hence propose building divergence-free neural networks through the concept of differential forms, and with the aid of automatic differentiation, realize two practical constructions. As a result, we can parameterize pairs of densities and vector fields that always exactly satisfy the continuity equation, foregoing the need for extra penalty methods or expensive numerical simulation. Furthermore, we prove these models are universal and so can be used to represent any divergence-free vector field. Finally, we experimentally validate our approaches by computing neural network-based solutions to fluid equations, solving for the Hodge decomposition, and learning dynamical optimal transport maps.
    On Pre-Training for Visuo-Motor Control: Revisiting a Learning-from-Scratch Baseline. (arXiv:2212.05749v1 [cs.LG])
    We revisit a simple Learning-from-Scratch baseline for visuo-motor control that uses data augmentation and a shallow ConvNet. We find that this baseline has competitive performance with recent methods that leverage frozen visual representations trained on large-scale vision datasets.
    Malaria Parasitic Detection using a New Deep Boosted and Ensemble Learning Framework. (arXiv:2212.02477v2 [eess.IV] UPDATED)
    Malaria is a potentially fatal plasmodium parasite injected by female anopheles mosquitoes that infect red blood cells and millions worldwide yearly. However, specialists' manual screening in clinical practice is laborious and prone to error. Therefore, a novel Deep Boosted and Ensemble Learning (DBEL) framework, comprising the stacking of new Boosted-BR-STM convolutional neural networks (CNN) and the ensemble ML classifiers, is developed to screen malaria parasite images. The proposed Boosted-BR-STM is based on a new dilated-convolutional block-based split transform merge (STM) and feature-map Squeezing-Boosting (SB) ideas. Moreover, the new STM block uses regional and boundary operations to learn the malaria parasite's homogeneity, heterogeneity, and boundary with patterns. Furthermore, the diverse boosted channels are attained by employing Transfer Learning-based new feature-map SB in STM blocks at the abstract, medium, and conclusion levels to learn minute intensity and texture variation of the parasitic pattern. The proposed DBEL framework implicates the stacking of prominent and diverse boosted channels and provides the generated discriminative features of the developed Boosted-BR-STM to the ensemble of ML classifiers. The proposed framework improves the discrimination ability and generalization of ensemble learning. Moreover, the deep feature spaces of the developed Boosted-BR-STM and customized CNNs are fed into ML classifiers for comparative analysis. The proposed DBEL framework outperforms the existing techniques on the NIH malaria dataset that are enhanced using discrete wavelet transform to enrich feature space. The proposed DBEL framework achieved Accuracy (98.50%), Sensitivity (0.9920), F-score (0.9850), and AUC (0.997), which suggest it to be utilized for malaria parasite screening.
    Domain Adaptation of Transformer-Based Models using Unlabeled Data for Relevance and Polarity Classification of German Customer Feedback. (arXiv:2212.05764v1 [cs.CL])
    Understanding customer feedback is becoming a necessity for companies to identify problems and improve their products and services. Text classification and sentiment analysis can play a major role in analyzing this data by using a variety of machine and deep learning approaches. In this work, different transformer-based models are utilized to explore how efficient these models are when working with a German customer feedback dataset. In addition, these pre-trained models are further analyzed to determine if adapting them to a specific domain using unlabeled data can yield better results than off-the-shelf pre-trained models. To evaluate the models, two downstream tasks from the GermEval 2017 are considered. The experimental results show that transformer-based models can reach significant improvements compared to a fastText baseline and outperform the published scores and previous models. For the subtask Relevance Classification, the best models achieve a micro-averaged $F1$-Score of 96.1 % on the first test set and 95.9 % on the second one, and a score of 85.1 % and 85.3 % for the subtask Polarity Classification.
    Explainable Performance. (arXiv:2212.05866v1 [stat.ML])
    We introduce the XPER (eXplainable PERformance) methodology to measure the specific contribution of the input features to the predictive or economic performance of a model. Our methodology offers several advantages. First, it is both model-agnostic and performance metric-agnostic. Second, XPER is theoretically founded as it is based on Shapley values. Third, the interpretation of the benchmark, which is inherent in any Shapley value decomposition, is meaningful in our context. Fourth, XPER is not plagued by model specification error, as it does not require re-estimating the model. Fifth, it can be implemented either at the model level or at the individual level. In an application based on auto loans, we find that performance can be explained by a surprisingly small number of features. XPER decompositions are rather stable across metrics, yet some feature contributions switch sign across metrics. Our analysis also shows that explaining model forecasts and model performance are two distinct tasks.
    HAQJSK: Hierarchical-Aligned Quantum Jensen-Shannon Kernels for Graph Classification. (arXiv:2211.02904v3 [cs.LG] UPDATED)
    In this work, we propose a family of novel quantum kernels, namely the Hierarchical Aligned Quantum Jensen-Shannon Kernels (HAQJSK), for un-attributed graphs. Different from most existing classical graph kernels, the proposed HAQJSK kernels can incorporate hierarchical aligned structure information between graphs and transform graphs of random sizes into fixed-sized aligned graph structures, i.e., the Hierarchical Transitive Aligned Adjacency Matrix of vertices and the Hierarchical Transitive Aligned Density Matrix of the Continuous-Time Quantum Walk (CTQW). For a pair of graphs to hand, the resulting HAQJSK kernels are defined by measuring the Quantum Jensen-Shannon Divergence (QJSD) between their transitive aligned graph structures. We show that the proposed HAQJSK kernels not only reflect richer intrinsic global graph characteristics in terms of the CTQW, but also address the drawback of neglecting structural correspondence information arising in most existing R-convolution kernels. Furthermore, unlike the previous Quantum Jensen-Shannon Kernels associated with the QJSD and the CTQW, the proposed HAQJSK kernels can simultaneously guarantee the properties of permutation invariant and positive definiteness, explaining the theoretical advantages of the HAQJSK kernels. Experiments indicate the effectiveness of the proposed kernels.
    Skill-based Model-based Reinforcement Learning. (arXiv:2207.07560v2 [cs.LG] UPDATED)
    Model-based reinforcement learning (RL) is a sample-efficient way of learning complex behaviors by leveraging a learned single-step dynamics model to plan actions in imagination. However, planning every action for long-horizon tasks is not practical, akin to a human planning out every muscle movement. Instead, humans efficiently plan with high-level skills to solve complex tasks. From this intuition, we propose a Skill-based Model-based RL framework (SkiMo) that enables planning in the skill space using a skill dynamics model, which directly predicts the skill outcomes, rather than predicting all small details in the intermediate states, step by step. For accurate and efficient long-term planning, we jointly learn the skill dynamics model and a skill repertoire from prior experience. We then harness the learned skill dynamics model to accurately simulate and plan over long horizons in the skill space, which enables efficient downstream learning of long-horizon, sparse reward tasks. Experimental results in navigation and manipulation domains show that SkiMo extends the temporal horizon of model-based approaches and improves the sample efficiency for both model-based RL and skill-based RL. Code and videos are available at https://clvrai.com/skimo
    ezDPS: An Efficient and Zero-Knowledge Machine Learning Inference Pipeline. (arXiv:2212.05428v1 [cs.CR])
    Machine Learning as a service (MLaaS) permits resource-limited clients to access powerful data analytics services ubiquitously. Despite its merits, MLaaS poses significant concerns regarding the integrity of delegated computation and the privacy of the server's model parameters. To address this issue, Zhang et al. (CCS'20) initiated the study of zero-knowledge Machine Learning (zkML). Few zkML schemes have been proposed afterward; however, they focus on sole ML classification algorithms that may not offer satisfactory accuracy or require large-scale training data and model parameters, which may not be desirable for some applications. We propose ezDPS, a new efficient and zero-knowledge ML inference scheme. Unlike prior works, ezDPS is a zkML pipeline in which the data is processed in multiple stages for high accuracy. Each stage of ezDPS is harnessed with an established ML algorithm that is shown to be effective in various applications, including Discrete Wavelet Transformation, Principal Components Analysis, and Support Vector Machine. We design new gadgets to prove ML operations effectively. We fully implemented ezDPS and assessed its performance on real datasets. Experimental results showed that ezDPS achieves one-to-three orders of magnitude more efficient than the generic circuit-based approach in all metrics while maintaining more desirable accuracy than single ML classification approaches.
    Vertical Layering of Quantized Neural Networks for Heterogeneous Inference. (arXiv:2212.05326v1 [cs.LG])
    Although considerable progress has been obtained in neural network quantization for efficient inference, existing methods are not scalable to heterogeneous devices as one dedicated model needs to be trained, transmitted, and stored for one specific hardware setting, incurring considerable costs in model training and maintenance. In this paper, we study a new vertical-layered representation of neural network weights for encapsulating all quantized models into a single one. With this representation, we can theoretically achieve any precision network for on-demand service while only needing to train and maintain one model. To this end, we propose a simple once quantization-aware training (QAT) scheme for obtaining high-performance vertical-layered models. Our design incorporates a cascade downsampling mechanism which allows us to obtain multiple quantized networks from one full precision source model by progressively mapping the higher precision weights to their adjacent lower precision counterparts. Then, with networks of different bit-widths from one source model, multi-objective optimization is employed to train the shared source model weights such that they can be updated simultaneously, considering the performance of all networks. By doing this, the shared weights will be optimized to balance the performance of different quantized models, thus making the weights transferable among different bit widths. Experiments show that the proposed vertical-layered representation and developed once QAT scheme are effective in embodying multiple quantized networks into a single one and allow one-time training, and it delivers comparable performance as that of quantized models tailored to any specific bit-width. Code will be available.
    Orthogonal SVD Covariance Conditioning and Latent Disentanglement. (arXiv:2212.05599v1 [cs.CV])
    Inserting an SVD meta-layer into neural networks is prone to make the covariance ill-conditioned, which could harm the model in the training stability and generalization abilities. In this paper, we systematically study how to improve the covariance conditioning by enforcing orthogonality to the Pre-SVD layer. Existing orthogonal treatments on the weights are first investigated. However, these techniques can improve the conditioning but would hurt the performance. To avoid such a side effect, we propose the Nearest Orthogonal Gradient (NOG) and Optimal Learning Rate (OLR). The effectiveness of our methods is validated in two applications: decorrelated Batch Normalization (BN) and Global Covariance Pooling (GCP). Extensive experiments on visual recognition demonstrate that our methods can simultaneously improve covariance conditioning and generalization. The combinations with orthogonal weight can further boost the performance. Moreover, we show that our orthogonality techniques can benefit generative models for better latent disentanglement through a series of experiments on various benchmarks. Code is available at: \href{https://github.com/KingJamesSong/OrthoImproveCond}{https://github.com/KingJamesSong/OrthoImproveCond}.
    Offline Reinforcement Learning for Road Traffic Control. (arXiv:2201.02381v3 [cs.AI] UPDATED)
    Traffic signal control is an important problem in urban mobility with a significant potential of economic and environmental impact. While there is a growing interest in Reinforcement Learning (RL) for traffic signal control, the work so far has focussed on learning through simulations which could lead to inaccuracies due to simplifying assumptions. Instead, real experience data on traffic is available and could be exploited at minimal costs. Recent progress in offline or batch RL has enabled just that. Model-based offline RL methods, in particular, have been shown to generalize from the experience data much better than others. We build a model-based learning framework which infers a Markov Decision Process (MDP) from a dataset collected using a cyclic traffic signal control policy that is both commonplace and easy to gather. The MDP is built with pessimistic costs to manage out-of-distribution scenarios using an adaptive shaping of rewards which is shown to provide better regularization compared to the prior related work in addition to being PAC-optimal. Our model is evaluated on a complex signalized roundabout showing that it is possible to build highly performant traffic control policies in a data efficient manner.
    Representation learning for a generalized, quantitative comparison of complex model outputs. (arXiv:2208.06530v2 [cs.LG] UPDATED)
    Computational models are quantitative representations of systems. By analyzing and comparing the outputs of such models, it is possible to gain a better understanding of the system itself. Though as the complexity of model outputs increases, it becomes increasingly difficult to compare simulations to each other. While it is straightforward to only compare a few specific model outputs across multiple simulations, additional useful information can come from comparing model simulations as a whole. However, it is difficult to holistically compare model simulations in an unbiased manner. To address these limitations, we use representation learning to transform model simulations into low-dimensional points, with the neural networks capturing the relationships between the model outputs without the need to manually specify which outputs to focus on. The distance in low-dimensional space acts as a comparison metric, reducing the difference between simulations to a single value. We provide an approach to training neural networks on model simulations and display how the trained networks can then be used to provide a holistic comparison of model outputs. This approach can be applied to a wide range of model types, providing a quantitative method of analyzing the complex outputs of computational models.
    Adaptive Low-Precision Training for Embeddings in Click-Through Rate Prediction. (arXiv:2212.05735v1 [cs.LG])
    Embedding tables are usually huge in click-through rate (CTR) prediction models. To train and deploy the CTR models efficiently and economically, it is necessary to compress their embedding tables at the training stage. To this end, we formulate a novel quantization training paradigm to compress the embeddings from the training stage, termed low-precision training (LPT). Also, we provide theoretical analysis on its convergence. The results show that stochastic weight quantization has a faster convergence rate and a smaller convergence error than deterministic weight quantization in LPT. Further, to reduce the accuracy degradation, we propose adaptive low-precision training (ALPT) that learns the step size (i.e., the quantization resolution) through gradient descent. Experiments on two real-world datasets confirm our analysis and show that ALPT can significantly improve the prediction accuracy, especially at extremely low bit widths. For the first time in CTR models, we successfully train 8-bit embeddings without sacrificing prediction accuracy. The code of ALPT is publicly available.
    Quasi Black-Box Variational Inference with Natural Gradients for Bayesian Learning. (arXiv:2205.11568v3 [stat.ML] UPDATED)
    We develop an optimization algorithm suitable for Bayesian learning in complex models. Our approach relies on natural gradient updates within a general black-box framework for efficient training with limited model-specific derivations. It applies within the class of exponential-family variational posterior distributions, for which we extensively discuss the Gaussian case for which the updates have a rather simple form. Our Quasi Black-box Variational Inference (QBVI) framework is readily applicable to a wide class of Bayesian inference problems and is of simple implementation as the updates of the variational posterior do not involve gradients with respect to the model parameters, nor the prescription of the Fisher information matrix. We develop QBVI under different hypotheses for the posterior covariance matrix, discuss details about its robust and feasible implementation, and provide a number of real-world applications to demonstrate its effectiveness.
    Enabling All In-Edge Deep Learning: A Literature Review. (arXiv:2204.03326v2 [cs.LG] UPDATED)
    In recent years, deep learning (DL) models have demonstrated remarkable achievements on non-trivial tasks such as speech recognition and natural language understanding. One of the significant contributors to its success is the proliferation of end devices that acted as a catalyst to provide data for data-hungry DL models. However, computing DL training and inference is the main challenge. Usually, central cloud servers are used for the computation, but it opens up other significant challenges, such as high latency, increased communication costs, and privacy concerns. To mitigate these drawbacks, considerable efforts have been made to push the processing of DL models to edge servers. Moreover, the confluence point of DL and edge has given rise to edge intelligence (EI). This survey paper focuses primarily on the fifth level of EI, called all in-edge level, where DL training and inference (deployment) are performed solely by edge servers. All in-edge is suitable when the end devices have low computing resources, e.g., Internet-of-Things, and other requirements such as latency and communication cost are important in mission-critical applications, e.g., health care. Firstly, this paper presents all in-edge computing architectures, including centralized, decentralized, and distributed. Secondly, this paper presents enabling technologies, such as model parallelism and split learning, which facilitate DL training and deployment at edge servers. Thirdly, model adaptation techniques based on model compression and conditional computation are described because the standard cloud-based DL deployment cannot be directly applied to all in-edge due to its limited computational resources. Fourthly, this paper discusses eleven key performance metrics to evaluate the performance of DL at all in-edge efficiently. Finally, several open research challenges in the area of all in-edge are presented.
    Transductive Linear Probing: A Novel Framework for Few-Shot Node Classification. (arXiv:2212.05606v1 [cs.LG])
    Few-shot node classification is tasked to provide accurate predictions for nodes from novel classes with only few representative labeled nodes. This problem has drawn tremendous attention for its projection to prevailing real-world applications, such as product categorization for newly added commodity categories on an E-commerce platform with scarce records or diagnoses for rare diseases on a patient similarity graph. To tackle such challenging label scarcity issues in the non-Euclidean graph domain, meta-learning has become a successful and predominant paradigm. More recently, inspired by the development of graph self-supervised learning, transferring pretrained node embeddings for few-shot node classification could be a promising alternative to meta-learning but remains unexposed. In this work, we empirically demonstrate the potential of an alternative framework, \textit{Transductive Linear Probing}, that transfers pretrained node embeddings, which are learned from graph contrastive learning methods. We further extend the setting of few-shot node classification from standard fully supervised to a more realistic self-supervised setting, where meta-learning methods cannot be easily deployed due to the shortage of supervision from training classes. Surprisingly, even without any ground-truth labels, transductive linear probing with self-supervised graph contrastive pretraining can outperform the state-of-the-art fully supervised meta-learning based methods under the same protocol. We hope this work can shed new light on few-shot node classification problems and foster future research on learning from scarcely labeled instances on graphs.
    Error-aware Quantization through Noise Tempering. (arXiv:2212.05603v1 [cs.LG])
    Quantization has become a predominant approach for model compression, enabling deployment of large models trained on GPUs onto smaller form-factor devices for inference. Quantization-aware training (QAT) optimizes model parameters with respect to the end task while simulating quantization error, leading to better performance than post-training quantization. Approximation of gradients through the non-differentiable quantization operator is typically achieved using the straight-through estimator (STE) or additive noise. However, STE-based methods suffer from instability due to biased gradients, whereas existing noise-based methods cannot reduce the resulting variance. In this work, we incorporate exponentially decaying quantization-error-aware noise together with a learnable scale of task loss gradient to approximate the effect of a quantization operator. We show this method combines gradient scale and quantization noise in a better optimized way, providing finer-grained estimation of gradients at each weight and activation layer's quantizer bin size. Our controlled noise also contains an implicit curvature term that could encourage flatter minima, which we show is indeed the case in our experiments. Experiments training ResNet architectures on the CIFAR-10, CIFAR-100 and ImageNet benchmarks show that our method obtains state-of-the-art top-1 classification accuracy for uniform (non mixed-precision) quantization, out-performing previous methods by 0.5-1.2% absolute.
    Random Feature Models for Learning Interacting Dynamical Systems. (arXiv:2212.05591v1 [cs.LG])
    Particle dynamics and multi-agent systems provide accurate dynamical models for studying and forecasting the behavior of complex interacting systems. They often take the form of a high-dimensional system of differential equations parameterized by an interaction kernel that models the underlying attractive or repulsive forces between agents. We consider the problem of constructing a data-based approximation of the interacting forces directly from noisy observations of the paths of the agents in time. The learned interaction kernels are then used to predict the agents behavior over a longer time interval. The approximation developed in this work uses a randomized feature algorithm and a sparse randomized feature approach. Sparsity-promoting regression provides a mechanism for pruning the randomly generated features which was observed to be beneficial when one has limited data, in particular, leading to less overfitting than other approaches. In addition, imposing sparsity reduces the kernel evaluation cost which significantly lowers the simulation cost for forecasting the multi-agent systems. Our method is applied to various examples, including first-order systems with homogeneous and heterogeneous interactions, second order homogeneous systems, and a new sheep swarming system.
    Statistical guarantees for sparse deep learning. (arXiv:2212.05427v1 [cs.LG])
    Neural networks are becoming increasingly popular in applications, but our mathematical understanding of their potential and limitations is still limited. In this paper, we further this understanding by developing statistical guarantees for sparse deep learning. In contrast to previous work, we consider different types of sparsity, such as few active connections, few active nodes, and other norm-based types of sparsity. Moreover, our theories cover important aspects that previous theories have neglected, such as multiple outputs, regularization, and l2-loss. The guarantees have a mild dependence on network widths and depths, which means that they support the application of sparse but wide and deep networks from a statistical perspective. Some of the concepts and tools that we use in our derivations are uncommon in deep learning and, hence, might be of additional interest.
    Toward Robust Graph Semi-Supervised Learning against Extreme Data Scarcity. (arXiv:2208.12422v2 [cs.LG] UPDATED)
    The success of graph neural networks on graph-based web mining highly relies on abundant human-annotated data, which is laborious to obtain in practice. When only few labeled nodes are available, how to improve their robustness is a key to achieve replicable and sustainable graph semi-supervised learning. Though self-training has been shown to be powerful for semi-supervised learning, its application on graph-structured data may fail because (1) larger receptive fields are not leveraged to capture long-range node interactions, which exacerbates the difficulty of propagating feature-label patterns from labeled nodes to unlabeled nodes; and (2) limited labeled data makes it challenging to learn well-separated decision boundaries for different node classes without explicitly capturing the underlying semantic structure. To address the challenges of capturing informative structural and semantic knowledge, we propose a new graph data augmentation framework, AGST (Augmented Graph Self-Training), which is built with two new (i.e., structural and semantic) augmentation modules on top of a decoupled GST backbone. In this work, we investigate whether this novel framework can learn a robust graph predictive model under the low-data context. We conduct comprehensive evaluations on semi-supervised node classification under different scenarios of limited labeled-node data. The experimental results demonstrate the unique contributions of the novel data augmentation framework for node classification with few labeled data.
    DOSnet as a Non-Black-Box PDE Solver: When Deep Learning Meets Operator Splitting. (arXiv:2212.05571v1 [math.NA])
    Deep neural networks (DNNs) recently emerged as a promising tool for analyzing and solving complex differential equations arising in science and engineering applications. Alternative to traditional numerical schemes, learning-based solvers utilize the representation power of DNNs to approximate the input-output relations in an automated manner. However, the lack of physics-in-the-loop often makes it difficult to construct a neural network solver that simultaneously achieves high accuracy, low computational burden, and interpretability. In this work, focusing on a class of evolutionary PDEs characterized by having decomposable operators, we show that the classical ``operator splitting'' numerical scheme of solving these equations can be exploited to design neural network architectures. This gives rise to a learning-based PDE solver, which we name Deep Operator-Splitting Network (DOSnet). Such non-black-box network design is constructed from the physical rules and operators governing the underlying dynamics contains learnable parameters, and is thus more flexible than the standard operator splitting scheme. Once trained, it enables the fast solution of the same type of PDEs. To validate the special structure inside DOSnet, we take the linear PDEs as the benchmark and give the mathematical explanation for the weight behavior. Furthermore, to demonstrate the advantages of our new AI-enhanced PDE solver, we train and validate it on several types of operator-decomposable differential equations. We also apply DOSnet to nonlinear Schr\"odinger equations (NLSE) which have important applications in the signal processing for modern optical fiber transmission systems, and experimental results show that our model has better accuracy and lower computational complexity than numerical schemes and the baseline DNNs.
    Human Mobility Modeling During the COVID-19 Pandemic via Deep Graph Diffusion Infomax. (arXiv:2212.05707v1 [cs.LG])
    Non-Pharmaceutical Interventions (NPIs), such as social gathering restrictions, have shown effectiveness to slow the transmission of COVID-19 by reducing the contact of people. To support policy-makers, multiple studies have first modeled human mobility via macro indicators (e.g., average daily travel distance) and then studied the effectiveness of NPIs. In this work, we focus on mobility modeling and, from a micro perspective, aim to predict locations that will be visited by COVID-19 cases. Since NPIs generally cause economic and societal loss, such a micro perspective prediction benefits governments when they design and evaluate them. However, in real-world situations, strict privacy data protection regulations result in severe data sparsity problems (i.e., limited case and location information). To address these challenges, we formulate the micro perspective mobility modeling into computing the relevance score between a diffusion and a location, conditional on a geometric graph. we propose a model named Deep Graph Diffusion Infomax (DGDI), which jointly models variables including a geometric graph, a set of diffusions and a set of locations.To facilitate the research of COVID-19 prediction, we present two benchmarks that contain geometric graphs and location histories of COVID-19 cases. Extensive experiments on the two benchmarks show that DGDI significantly outperforms other competing methods.
    Where to go: Agent Guidance with Deep Reinforcement Learning in A City-Scale Online Ride-Hailing Service. (arXiv:2212.05742v1 [cs.LG])
    Online ride-hailing services have become a prevalent transportation system across the world. In this paper, we study a challenging problem of how to direct vacant taxis around a city such that supplies and demands can be balanced in online ride-hailing services. We design a new reward scheme that considers multiple performance metrics of online ride-hailing services. We also propose a novel deep reinforcement learning method named Deep-Q-Network with Action Mask (AM-DQN) masking off unnecessary actions in various locations such that agents can learn much faster and more efficiently. We conduct extensive experiments using a city-scale dataset from Chicago. Several popular heuristic and learning methods are also implemented as baselines for comparison. The results of the experiments show that the AM-DQN attains the best performances of all methods with respect to average failure rate, average waiting time for customers, and average idle search time for vacant taxis.
    Deep learning-based denoising for fast time-resolved flame emission spectroscopy in high-pressure combustion environment. (arXiv:2208.12544v2 [cs.LG] UPDATED)
    A deep learning strategy is developed for fast and accurate gas property measurements using flame emission spectroscopy (FES). Particularly, the short-gated fast FES is essential to resolve fast-evolving combustion behaviors. However, as the exposure time for capturing the flame emission spectrum gets shorter, the signal-to-noise ratio (SNR) decreases, and characteristic spectral features indicating the gas properties become relatively weaker. Then, the property estimation based on the short-gated spectrum is difficult and inaccurate. Denoising convolutional neural networks (CNN) can enhance the SNR of the short-gated spectrum. A new CNN architecture including a reversible down- and up-sampling (DU) operator and a loss function based on proper orthogonal decomposition (POD) coefficients is proposed. For training and testing the CNN, flame chemiluminescence spectra were captured from a stable methane-air flat flame using a portable spectrometer (spectral range: 250 - 850 nm, resolution: 0.5 nm) with varied equivalence ratio (0.8 - 1.2), pressure (1 - 10 bar), and exposure time (0.05, 0.2, 0.4, and 2 s). The long exposure (2 s) spectra were used as the ground truth when training the denoising CNN. A kriging model with POD is trained by the long-gated spectra for calibration, and then the prediction of the gas properties taking the denoised short-gated spectrum as the input: The property prediction errors of pressure and equivalence ratio were remarkably lowered in spite of the low SNR attendant with reduced exposure.
    Evaluating Model-free Reinforcement Learning toward Safety-critical Tasks. (arXiv:2212.05727v1 [cs.LG])
    Safety comes first in many real-world applications involving autonomous agents. Despite a large number of reinforcement learning (RL) methods focusing on safety-critical tasks, there is still a lack of high-quality evaluation of those algorithms that adheres to safety constraints at each decision step under complex and unknown dynamics. In this paper, we revisit prior work in this scope from the perspective of state-wise safe RL and categorize them as projection-based, recovery-based, and optimization-based approaches, respectively. Furthermore, we propose Unrolling Safety Layer (USL), a joint method that combines safety optimization and safety projection. This novel technique explicitly enforces hard constraints via the deep unrolling architecture and enjoys structural advantages in navigating the trade-off between reward improvement and constraint satisfaction. To facilitate further research in this area, we reproduce related algorithms in a unified pipeline and incorporate them into SafeRL-Kit, a toolkit that provides off-the-shelf interfaces and evaluation utilities for safety-critical tasks. We then perform a comparative study of the involved algorithms on six benchmarks ranging from robotic control to autonomous driving. The empirical results provide an insight into their applicability and robustness in learning zero-cost-return policies without task-dependent handcrafting. The project page is available at https://sites.google.com/view/saferlkit.
    Nonparametric Learning of Two-Layer ReLU Residual Units. (arXiv:2008.07648v3 [cs.LG] UPDATED)
    We describe an algorithm that learns two-layer residual units using rectified linear unit (ReLU) activation: suppose the input $\mathbf{x}$ is from a distribution with support space $\mathbb{R}^d$ and the ground-truth generative model is a residual unit of this type, given by $\mathbf{y} = \boldsymbol{B}^\ast\left[\left(\boldsymbol{A}^\ast\mathbf{x}\right)^+ + \mathbf{x}\right]$, where ground-truth network parameters $\boldsymbol{A}^\ast \in \mathbb{R}^{d\times d}$ represent a full-rank matrix with nonnegative entries and $\boldsymbol{B}^\ast \in \mathbb{R}^{m\times d}$ is full-rank with $m \geq d$ and for $\boldsymbol{c} \in \mathbb{R}^d$, $[\boldsymbol{c}^{+}]_i = \max\{0, c_i\}$. We design layer-wise objectives as functionals whose analytic minimizers express the exact ground-truth network in terms of its parameters and nonlinearities. Following this objective landscape, learning residual units from finite samples can be formulated using convex optimization of a nonparametric function: for each layer, we first formulate the corresponding empirical risk minimization (ERM) as a positive semi-definite quadratic program (QP), then we show the solution space of the QP can be equivalently determined by a set of linear inequalities, which can then be efficiently solved by linear programming (LP). We further prove the strong statistical consistency of our algorithm, and demonstrate its robustness and sample efficiency through experimental results on synthetic data and a set of benchmark regression datasets.
    Mitigating Adversarial Gray-Box Attacks Against Phishing Detectors. (arXiv:2212.05380v1 [cs.CR])
    Although machine learning based algorithms have been extensively used for detecting phishing websites, there has been relatively little work on how adversaries may attack such "phishing detectors" (PDs for short). In this paper, we propose a set of Gray-Box attacks on PDs that an adversary may use which vary depending on the knowledge that he has about the PD. We show that these attacks severely degrade the effectiveness of several existing PDs. We then propose the concept of operation chains that iteratively map an original set of features to a new set of features and develop the "Protective Operation Chain" (POC for short) algorithm. POC leverages the combination of random feature selection and feature mappings in order to increase the attacker's uncertainty about the target PD. Using 3 existing publicly available datasets plus a fourth that we have created and will release upon the publication of this paper, we show that POC is more robust to these attacks than past competing work, while preserving predictive performance when no adversarial attacks are present. Moreover, POC is robust to attacks on 13 different classifiers, not just one. These results are shown to be statistically significant at the p < 0.001 level.
    Corruption-tolerant Algorithms for Generalized Linear Models. (arXiv:2212.05430v1 [cs.LG])
    This paper presents SVAM (Sequential Variance-Altered MLE), a unified framework for learning generalized linear models under adversarial label corruption in training data. SVAM extends to tasks such as least squares regression, logistic regression, and gamma regression, whereas many existing works on learning with label corruptions focus only on least squares regression. SVAM is based on a novel variance reduction technique that may be of independent interest and works by iteratively solving weighted MLEs over variance-altered versions of the GLM objective. SVAM offers provable model recovery guarantees superior to the state-of-the-art for robust regression even when a constant fraction of training labels are adversarially corrupted. SVAM also empirically outperforms several existing problem-specific techniques for robust regression and classification. Code for SVAM is available at https://github.com/purushottamkar/svam/
    Development of Personalized Sleep Induction System based on Mental States. (arXiv:2212.05669v1 [cs.HC])
    Sleep is an essential behavior to prevent the decrement of cognitive, motor, and emotional performance and various diseases. However, it is not easy to fall asleep when people want to sleep. There are various sleep-disturbing factors such as the COVID-19 situation, noise from outside, and light during the night. We aim to develop a personalized sleep induction system based on mental states using electroencephalogram and auditory stimulation. Our system analyzes users' mental states using an electroencephalogram and results of the Pittsburgh sleep quality index and Brunel mood scale. According to mental states, the system plays sleep induction sound among five auditory stimulation: white noise, repetitive beep sounds, rainy sound, binaural beat, and sham sound. Finally, the sleep-inducing system classified the sleep stage of participants with 94.7 percent and stopped auditory stimulation if participants showed non-rapid eye movement sleep. Our system makes 18 participants fall asleep among 20 participants.
    Implementing Deep Learning-Based Approaches for Article Summarization in Indian Languages. (arXiv:2212.05702v1 [cs.CL])
    The research on text summarization for low-resource Indian languages has been limited due to the availability of relevant datasets. This paper presents a summary of various deep-learning approaches used for the ILSUM 2022 Indic language summarization datasets. The ISUM 2022 dataset consists of news articles written in Indian English, Hindi, and Gujarati respectively, and their ground-truth summarizations. In our work, we explore different pre-trained seq2seq models and fine-tune those with the ILSUM 2022 datasets. In our case, the fine-tuned SoTA PEGASUS model worked the best for English, the fine-tuned IndicBART model with augmented data for Hindi, and again fine-tuned PEGASUS model along with a translation mapping-based approach for Gujarati. Our scores on the obtained inferences were evaluated using ROUGE-1, ROUGE-2, and ROUGE-4 as the evaluation metrics.
    ResFed: Communication Efficient Federated Learning by Transmitting Deep Compressed Residuals. (arXiv:2212.05602v1 [cs.LG])
    Federated learning enables cooperative training among massively distributed clients by sharing their learned local model parameters. However, with increasing model size, deploying federated learning requires a large communication bandwidth, which limits its deployment in wireless networks. To address this bottleneck, we introduce a residual-based federated learning framework (ResFed), where residuals rather than model parameters are transmitted in communication networks for training. In particular, we integrate two pairs of shared predictors for the model prediction in both server-to-client and client-to-server communication. By employing a common prediction rule, both locally and globally updated models are always fully recoverable in clients and the server. We highlight that the residuals only indicate the quasi-update of a model in a single inter-round, and hence contain more dense information and have a lower entropy than the model, comparing to model weights and gradients. Based on this property, we further conduct lossy compression of the residuals by sparsification and quantization and encode them for efficient communication. The experimental evaluation shows that our ResFed needs remarkably less communication costs and achieves better accuracy by leveraging less sensitive residuals, compared to standard federated learning. For instance, to train a 4.08 MB CNN model on CIFAR-10 with 10 clients under non-independent and identically distributed (Non-IID) setting, our approach achieves a compression ratio over 700X in each communication round with minimum impact on the accuracy. To reach an accuracy of 70%, it saves around 99% of the total communication volume from 587.61 Mb to 6.79 Mb in up-streaming and to 4.61 Mb in down-streaming on average for all clients.
    Improving Expert Predictions with Prediction Sets. (arXiv:2201.12006v4 [cs.LG] UPDATED)
    Automated decision support systems promise to help human experts solve tasks more efficiently and accurately. However, existing systems typically require experts to understand when to cede agency to the system or when to exercise their own agency. Moreover, if the experts develop a misplaced trust in the system, their performance may worsen. In this work, we lift the above requirement and develop automated decision support systems that, by design, do not require experts to understand when each of their recommendations is accurate to improve their performance. To this end, we focus on multiclass classification tasks and consider an automated decision support system that, for each data sample, uses a classifier to recommend a subset of labels to a human expert. We first show that, by looking at the design of such a system from the perspective of conformal prediction, we can ensure that the probability that the recommended subset of labels contains the true label matches almost exactly a target probability value with high probability. Then, we develop an efficient and near-optimal search method to find the target probability value under which the expert benefits the most from using our system. Experiments on synthetic and real data demonstrate that our system can help the experts make more accurate predictions and is robust to the accuracy of the classifier it relies on.
    Online Real-time Learning of Dynamical Systems from Noisy Streaming Data. (arXiv:2212.05259v1 [math.DS])
    Recent advancements in sensing and communication facilitate obtaining high-frequency real-time data from various physical systems like power networks, climate systems, biological networks, etc. However, since the data are recorded by physical sensors, it is natural that the obtained data is corrupted by measurement noise. In this paper, we present a novel algorithm for online real-time learning of dynamical systems from noisy time-series data, which employs the Robust Koopman operator framework to mitigate the effect of measurement noise. The proposed algorithm has three main advantages: a) it allows for online real-time monitoring of a dynamical system; b) it obtains a linear representation of the underlying dynamical system, thus enabling the user to use linear systems theory for analysis and control of the system; c) it is computationally fast and less intensive than the popular Extended Dynamic Mode Decomposition (EDMD) algorithm. We illustrate the efficiency of the proposed algorithm by applying it to identify the Van der Pol oscillator, the IEEE 68 bus system, and a ring network of Van der Pol oscillators.
    How to Backdoor Diffusion Models?. (arXiv:2212.05400v1 [cs.CV])
    Diffusion models are state-of-the-art deep learning empowered generative models that are trained based on the principle of learning forward and reverse diffusion processes via progressive noise-addition and denoising. To gain a better understanding of the limitations and potential risks, this paper presents the first study on the robustness of diffusion models against backdoor attacks. Specifically, we propose BadDiffusion, a novel attack framework that engineers compromised diffusion processes during model training for backdoor implantation. At the inference stage, the backdoored diffusion model will behave just like an untampered generator for regular data inputs, while falsely generating some targeted outcome designed by the bad actor upon receiving the implanted trigger signal. Such a critical risk can be dreadful for downstream tasks and applications built upon the problematic model. Our extensive experiments on various backdoor attack settings show that BadDiffusion can consistently lead to compromised diffusion models with high utility and target specificity. Even worse, BadDiffusion can be made cost-effective by simply finetuning a clean pre-trained diffusion model to implant backdoors. We also explore some possible countermeasures for risk mitigation. Our results call attention to potential risks and possible misuse of diffusion models.
    Scoring rules in survival analysis. (arXiv:2212.05260v1 [math.ST])
    Scoring rules promote rational and good decision making and predictions by models, this is increasingly important for automated procedures of `auto-ML'. The Brier score and Log loss are well-established scoring rules for classification and regression and possess the `strict properness' property that encourages optimal predictions. In this paper we survey proposed scoring rules for survival analysis, establish the first clear definition of `(strict) properness' for survival scoring rules, and determine which losses are proper and improper. We prove that commonly utilised scoring rules that are claimed to be proper are in fact improper. We further prove that under a strict set of assumptions a class of scoring rules is strictly proper for, what we term, `approximate' survival losses. We hope these findings encourage further research into robust validation of survival models and promote honest evaluation.
    Deep Multi-Modal Structural Equations For Causal Effect Estimation With Unstructured Proxies. (arXiv:2203.09672v4 [cs.LG] UPDATED)
    Estimating the effect of intervention from observational data while accounting for confounding variables is a key task in causal inference. Oftentimes, the confounders are unobserved, but we have access to large amounts of additional unstructured data (images, text) that contain valuable proxy signal about the missing confounders. This paper argues that leveraging this unstructured data can greatly improve the accuracy of causal effect estimation. Specifically, we introduce deep multi-modal structural equations, a generative model for causal effect estimation in which confounders are latent variables and unstructured data are proxy variables. This model supports multiple multi-modal proxies (images, text) as well as missing data. We empirically demonstrate that our approach outperforms existing methods based on propensity scores and corrects for confounding using unstructured inputs on tasks in genomics and healthcare. Our methods can potentially support the use of large amounts of data that were previously not used in causal inference
    Elixir: Train a Large Language Model on a Small GPU Cluster. (arXiv:2212.05339v1 [cs.DC])
    In recent years, the number of parameters of one deep learning (DL) model has been growing much faster than the growth of GPU memory space. People who are inaccessible to a large number of GPUs resort to heterogeneous training systems for storing model parameters in CPU memory. Existing heterogeneous systems are based on parallelization plans in the scope of the whole model. They apply a consistent parallel training method for all the operators in the computation. Therefore, engineers need to pay a huge effort to incorporate a new type of model parallelism and patch its compatibility with other parallelisms. For example, Mixture-of-Experts (MoE) is still incompatible with ZeRO-3 in Deepspeed. Also, current systems face efficiency problems on small scale, since they are designed and tuned for large-scale training. In this paper, we propose Elixir, a new parallel heterogeneous training system, which is designed for efficiency and flexibility. Elixir utilizes memory resources and computing resources of both GPU and CPU. For flexibility, Elixir generates parallelization plans in the granularity of operators. Any new type of model parallelism can be incorporated by assigning a parallel pattern to the operator. For efficiency, Elixir implements a hierarchical distributed memory management scheme to accelerate inter-GPU communications and CPU-GPU data transmissions. As a result, Elixir can train a 30B OPT model on an A100 with 40GB CUDA memory, meanwhile reaching 84% efficiency of Pytorch GPU training. With its super-linear scalability, the training efficiency becomes the same as Pytorch GPU training on multiple GPUs. Also, large MoE models can be trained 5.3x faster than dense models of the same size. Now Elixir is integrated into ColossalAI and is available on its main branch.
    Machine intuition: Uncovering human-like intuitive decision-making in GPT-3.5. (arXiv:2212.05206v1 [cs.CL])
    Artificial intelligence (AI) technologies revolutionize vast fields of society. Humans using these systems are likely to expect them to work in a potentially hyperrational manner. However, in this study, we show that some AI systems, namely large language models (LLMs), exhibit behavior that strikingly resembles human-like intuition - and the many cognitive errors that come with them. We use a state-of-the-art LLM, namely the latest iteration of OpenAI's Generative Pre-trained Transformer (GPT-3.5), and probe it with the Cognitive Reflection Test (CRT) as well as semantic illusions that were originally designed to investigate intuitive decision-making in humans. Our results show that GPT-3.5 systematically exhibits "machine intuition," meaning that it produces incorrect responses that are surprisingly equal to how humans respond to the CRT as well as to semantic illusions. We investigate several approaches to test how sturdy GPT-3.5's inclination for intuitive-like decision-making is. Our study demonstrates that investigating LLMs with methods from cognitive science has the potential to reveal emergent traits and adjust expectations regarding their machine behavior.
    Multimodal and Explainable Internet Meme Classification. (arXiv:2212.05612v1 [cs.AI])
    Warning: this paper contains content that may be offensive or upsetting. In the current context where online platforms have been effectively weaponized in a variety of geo-political events and social issues, Internet memes make fair content moderation at scale even more difficult. Existing work on meme classification and tracking has focused on black-box methods that do not explicitly consider the semantics of the memes or the context of their creation. In this paper, we pursue a modular and explainable architecture for Internet meme understanding. We design and implement multimodal classification methods that perform example- and prototype-based reasoning over training cases, while leveraging both textual and visual SOTA models to represent the individual cases. We study the relevance of our modular and explainable models in detecting harmful memes on two existing tasks: Hate Speech Detection and Misogyny Classification. We compare the performance between example- and prototype-based methods, and between text, vision, and multimodal models, across different categories of harmfulness (e.g., stereotype and objectification). We devise a user-friendly interface that facilitates the comparative analysis of examples retrieved by all of our models for any given meme, informing the community about the strengths and limitations of these explainable methods.
    On an Interpretation of ResNets via Solution Constructions. (arXiv:2212.05663v1 [cs.LG])
    This paper first constructs a typical solution of ResNets for multi-category classifications by the principle of gate-network controls and deep-layer classifications, from which a general interpretation of the ResNet architecture is given and the performance mechanism is explained. We then use more solutions to further demonstrate the generality of that interpretation. The universal-approximation capability of ResNets is proved.
    Phases, Modalities, Temporal and Spatial Locality: Domain Specific ML Prefetcher for Accelerating Graph Analytics. (arXiv:2212.05250v1 [cs.LG])
    Graph processing applications are severely bottlenecked by memory system performance due to low data reuse and irregular memory accesses. While state-of-the-art prefetchers using Machine Learning (ML) have made great progress, they do not perform well on graph analytics applications due to phase transitions in the execution and irregular data access that is hard to predict. We propose MPGraph: a novel ML-based Prefetcher for Graph analytics. MPGraph makes three novel optimizations based on domain knowledge of graph analytics. It detects the transition of graph processing phases during execution using a novel soft detection technique, predicts memory accesses and pages using phase-specific multi-modality predictors, and prefetches using a novel chain spatio-temporal prefetching strategy. We evaluate our approach using three widely-used graph processing frameworks and a variety of graph datasets. Our approach achieves 34.17%-82.15% higher precision in phase transition detection than the KSWIN and decision tree baselines. Our predictors achieve 6.80%-16.02% higher F1-score for access prediction and 11.68%-15.41% higher accuracy-at-10 for page prediction compared with the baselines LSTM-based and vanilla attention-based models. Simulations show that MPGraph achieves on the average 87.16% (prefetch accuracy) and 73.29% (prefetch coverage), leading to 12.52%-21.23% IPC improvement. It outperforms the widely-used non-ML prefetcher BO by 7.58%-12.03%, and outperforms state-of-the-art ML-based prefetchers Voyager by 3.27%-4.42% and TransFetch by 3.73%-4.58% with respect to IPC improvement.
    Stochastic First-Order Learning for Large-Scale Flexibly Tied Gaussian Mixture Model. (arXiv:2212.05402v1 [cs.LG])
    Gaussian Mixture Models (GMM) are one of the most potent parametric density estimators based on the kernel model that finds application in many scientific domains. In recent years, with the dramatic enlargement of data sources, typical machine learning algorithms, e.g. Expectation Maximization (EM), encounters difficulty with high-dimensional and streaming data. Moreover, complicated densities often demand a large number of Gaussian components. This paper proposes a fast online parameter estimation algorithm for GMM by using first-order stochastic optimization. This approach provides a framework to cope with the challenges of GMM when faced with high-dimensional streaming data and complex densities by leveraging the flexibly-tied factorization of the covariance matrix. A new stochastic Manifold optimization algorithm that preserves the orthogonality is introduced and used along with the well-known Euclidean space numerical optimization. Numerous empirical results on both synthetic and real datasets justify the effectiveness of our proposed stochastic method over EM-based methods in the sense of better-converged maximum for likelihood function, fewer number of needed epochs for convergence, and less time consumption per epoch.
    Graph-Regularized Manifold-Aware Conditional Wasserstein GAN for Brain Functional Connectivity Generation. (arXiv:2212.05316v1 [cs.LG])
    Common measures of brain functional connectivity (FC) including covariance and correlation matrices are semi-positive definite (SPD) matrices residing on a cone-shape Riemannian manifold. Despite its remarkable success for Euclidean-valued data generation, use of standard generative adversarial networks (GANs) to generate manifold-valued FC data neglects its inherent SPD structure and hence the inter-relatedness of edges in real FC. We propose a novel graph-regularized manifold-aware conditional Wasserstein GAN (GR-SPD-GAN) for FC data generation on the SPD manifold that can preserve the global FC structure. Specifically, we optimize a generalized Wasserstein distance between the real and generated SPD data under an adversarial training, conditioned on the class labels. The resulting generator can synthesize new SPD-valued FC matrices associated with different classes of brain networks, e.g., brain disorder or healthy control. Furthermore, we introduce additional population graph-based regularization terms on both the SPD manifold and its tangent space to encourage the generator to respect the inter-subject similarity of FC patterns in the real data. This also helps in avoiding mode collapse and produces more stable GAN training. Evaluated on resting-state functional magnetic resonance imaging (fMRI) data of major depressive disorder (MDD), qualitative and quantitative results show that the proposed GR-SPD-GAN clearly outperforms several state-of-the-art GANs in generating more realistic fMRI-based FC samples. When applied to FC data augmentation for MDD identification, classification models trained on augmented data generated by our approach achieved the largest margin of improvement in classification accuracy among the competing GANs over baselines without data augmentation.
    Online Convex Optimization of Programmable Quantum Computers to Simulate Time-Varying Quantum Channels. (arXiv:2212.05145v1 [quant-ph])
    Simulating quantum channels is a fundamental primitive in quantum computing, since quantum channels define general (trace-preserving) quantum operations. An arbitrary quantum channel cannot be exactly simulated using a finite-dimensional programmable quantum processor, making it important to develop optimal approximate simulation techniques. In this paper, we study the challenging setting in which the channel to be simulated varies adversarially with time. We propose the use of matrix exponentiated gradient descent (MEGD), an online convex optimization method, and analytically show that it achieves a sublinear regret in time. Through experiments, we validate the main results for time-varying dephasing channels using a programmable generalized teleportation processor.
    A Hybrid Brain-Computer Interface Using Motor Imagery and SSVEP Based on Convolutional Neural Network. (arXiv:2212.05289v1 [cs.LG])
    The key to electroencephalography (EEG)-based brain-computer interface (BCI) lies in neural decoding, and its accuracy can be improved by using hybrid BCI paradigms, that is, fusing multiple paradigms. However, hybrid BCIs usually require separate processing processes for EEG signals in each paradigm, which greatly reduces the efficiency of EEG feature extraction and the generalizability of the model. Here, we propose a two-stream convolutional neural network (TSCNN) based hybrid brain-computer interface. It combines steady-state visual evoked potential (SSVEP) and motor imagery (MI) paradigms. TSCNN automatically learns to extract EEG features in the two paradigms in the training process, and improves the decoding accuracy by 25.4% compared with the MI mode, and 2.6% compared with SSVEP mode in the test data. Moreover, the versatility of TSCNN is verified as it provides considerable performance in both single-mode (70.2% for MI, 93.0% for SSVEP) and hybrid-mode scenarios (95.6% for MI-SSVEP hybrid). Our work will facilitate the real-world applications of EEG-based BCI systems.
    SMILE: Scaling Mixture-of-Experts with Efficient Bi-level Routing. (arXiv:2212.05191v1 [cs.LG])
    The mixture of Expert (MoE) parallelism is a recent advancement that scales up the model size with constant computational cost. MoE selects different sets of parameters (i.e., experts) for each incoming token, resulting in a sparsely-activated model. Despite several successful applications of MoE, its training efficiency degrades significantly as the number of experts increases. The routing stage in MoE relies on the efficiency of the All2All communication collective, which suffers from network congestion and has poor scalability. To mitigate these issues, we introduce SMILE, which exploits heterogeneous network bandwidth and splits a single-step routing into bi-level routing. Our experimental results show that the proposed method obtains a 2.5x speedup over Switch Transformer in terms of pretraining throughput on the Colossal Clean Crawled Corpus without losing any convergence speed.
    Increasing the Cost of Model Extraction with Calibrated Proof of Work. (arXiv:2201.09243v3 [cs.CR] UPDATED)
    In model extraction attacks, adversaries can steal a machine learning model exposed via a public API by repeatedly querying it and adjusting their own model based on obtained predictions. To prevent model stealing, existing defenses focus on detecting malicious queries, truncating, or distorting outputs, thus necessarily introducing a tradeoff between robustness and model utility for legitimate users. Instead, we propose to impede model extraction by requiring users to complete a proof-of-work before they can read the model's predictions. This deters attackers by greatly increasing (even up to 100x) the computational effort needed to leverage query access for model extraction. Since we calibrate the effort required to complete the proof-of-work to each query, this only introduces a slight overhead for regular users (up to 2x). To achieve this, our calibration applies tools from differential privacy to measure the information revealed by a query. Our method requires no modification of the victim model and can be applied by machine learning practitioners to guard their publicly exposed models against being easily stolen.
    Partial-Monotone Adaptive Submodular Maximization. (arXiv:2207.12840v2 [cs.LG] UPDATED)
    Many sequential decision making problems, including pool-based active learning and adaptive viral marketing, can be formulated as an adaptive submodular maximization problem. Most of existing studies on adaptive submodular optimization focus on either monotone case or non-monotone case. Specifically, if the utility function is monotone and adaptive submodular, \cite{golovin2011adaptive} developed a greedy policy that achieves a $(1-1/e)$ approximation ratio subject to a cardinality constraint. If the utility function is non-monotone and adaptive submodular, \cite{tang2021beyond} showed that a random greedy policy achieves a $1/e$ approximation ratio subject to a cardinality constraint. In this work, we aim to generalize the above mentioned results by studying the partial-monotone adaptive submodular maximization problem. To this end, we introduce the notation of adaptive monotonicity ratio $m\in[0,1]$ to measure the degree of monotonicity of a function. Our main result is to show that a random greedy policy achieves an approximation ratio of $m(1-1/e)+(1-m)(1/e)$ if the utility function is $m$-adaptive monotone and adaptive submodular. Notably this result recovers the aforementioned $(1-1/e)$ and $1/e$ approximation ratios when $m = 0$ and $m = 1$, respectively. We further extend our results to consider a knapsack constraint. We show that a sampling-based policy achieves an approximation ratio of $(m+1)/10$ if the utility function is $m$-adaptive monotone and adaptive submodular. One important implication of our results is that even for a non-monotone utility function, we still can achieve an approximation ratio close to $(1-1/e)$ if this function is ``close'' to a monotone function. This leads to improved performance bounds for many machine learning applications whose utility functions are almost adaptive monotone.
    A model-data asymptotic-preserving neural network method based on micro-macro decomposition for gray radiative transfer equations. (arXiv:2212.05523v1 [math.NA])
    We propose a model-data asymptotic-preserving neural network(MD-APNN) method to solve the nonlinear gray radiative transfer equations(GRTEs). The system is challenging to be simulated with both the traditional numerical schemes and the vanilla physics-informed neural networks(PINNs) due to the multiscale characteristics. Under the framework of PINNs, we employ a micro-macro decomposition technique to construct a new asymptotic-preserving(AP) loss function, which includes the residual of the governing equations in the micro-macro coupled form, the initial and boundary conditions with additional diffusion limit information, the conservation laws, and a few labeled data. A convergence analysis is performed for the proposed method, and a number of numerical examples are presented to illustrate the efficiency of MD-APNNs, and particularly, the importance of the AP property in the neural networks for the diffusion dominating problems. The numerical results indicate that MD-APNNs lead to a better performance than APNNs or pure data-driven networks in the simulation of the nonlinear non-stationary GRTEs.
    Client Selection for Federated Bayesian Learning. (arXiv:2212.05492v1 [cs.LG])
    Distributed Stein Variational Gradient Descent (DSVGD) is a non-parametric distributed learning framework for federated Bayesian learning, where multiple clients jointly train a machine learning model by communicating a number of non-random and interacting particles with the server. Since communication resources are limited, selecting the clients with most informative local learning updates can improve the model convergence and communication efficiency. In this paper, we propose two selection schemes for DSVGD based on Kernelized Stein Discrepancy (KSD) and Hilbert Inner Product (HIP). We derive the upper bound on the decrease of the global free energy per iteration for both schemes, which is then minimized to speed up the model convergence. We evaluate and compare our schemes with conventional schemes in terms of model accuracy, convergence speed, and stability using various learning tasks and datasets.
    What Makes A Good Fisherman? Linear Regression under Self-Selection Bias. (arXiv:2205.03246v2 [math.ST] UPDATED)
    In the classical setting of self-selection, the goal is to learn $k$ models, simultaneously from observations $(x^{(i)}, y^{(i)})$ where $y^{(i)}$ is the output of one of $k$ underlying models on input $x^{(i)}$. In contrast to mixture models, where we observe the output of a randomly selected model, here the observed model depends on the outputs themselves, and is determined by some known selection criterion. For example, we might observe the highest output, the smallest output, or the median output of the $k$ models. In known-index self-selection, the identity of the observed model output is observable; in unknown-index self-selection, it is not. Self-selection has a long history in Econometrics and applications in various theoretical and applied fields, including treatment effect estimation, imitation learning, learning from strategically reported data, and learning from markets at disequilibrium. In this work, we present the first computationally and statistically efficient estimation algorithms for the most standard setting of this problem where the models are linear. In the known-index case, we require poly$(1/\varepsilon, k, d)$ sample and time complexity to estimate all model parameters to accuracy $\varepsilon$ in $d$ dimensions, and can accommodate quite general selection criteria. In the more challenging unknown-index case, even the identifiability of the linear models (from infinitely many samples) was not known. We show three results in this case for the commonly studied $\max$ self-selection criterion: (1) we show that the linear models are indeed identifiable, (2) for general $k$ we provide an algorithm with poly$(d) \exp(\text{poly}(k))$ sample and time complexity to estimate the regression parameters up to error $1/\text{poly}(k)$, and (3) for $k = 2$ we provide an algorithm for any error $\varepsilon$ and poly$(d, 1/\varepsilon)$ sample and time complexity.
    Acela: Predictable Datacenter-level Maintenance Job Scheduling. (arXiv:2212.05155v1 [cs.DC])
    Datacenter operators ensure fair and regular server maintenance by using automated processes to schedule maintenance jobs to complete within a strict time budget. Automating this scheduling problem is challenging because maintenance job duration varies based on both job type and hardware. While it is tempting to use prior machine learning techniques for predicting job duration, we find that the structure of the maintenance job scheduling problem creates a unique challenge. In particular, we show that prior machine learning methods that produce the lowest error predictions do not produce the best scheduling outcomes due to asymmetric costs. Specifically, underpredicting maintenance job duration has results in more servers being taken offline and longer server downtime than overpredicting maintenance job duration. The system cost of underprediction is much larger than that of overprediction. We present Acela, a machine learning system for predicting maintenance job duration, which uses quantile regression to bias duration predictions toward overprediction. We integrate Acela into a maintenance job scheduler and evaluate it on datasets from large-scale, production datacenters. Compared to machine learning based predictors from prior work, Acela reduces the number of servers that are taken offline by 1.87-4.28X, and reduces the server offline time by 1.40-2.80X.
    Explainability in Process Outcome Prediction: Guidelines to Obtain Interpretable and Faithful Models. (arXiv:2203.16073v4 [cs.LG] UPDATED)
    Although a recent shift has been made in the field of predictive process monitoring to use models from the explainable artificial intelligence field, the evaluation still occurs mainly through performance-based metrics, thus not accounting for the actionability and implications of the explanations. In this paper, we define explainability through the interpretability of the explanations and the faithfulness of the explainability model in the field of process outcome prediction. The introduced properties are analysed along the event, case, and control flow perspective which are typical for a process-based analysis. This allows comparing inherently created explanations with post-hoc explanations. We benchmark seven classifiers on thirteen real-life events logs, and these cover a range of transparent and non-transparent machine learning and deep learning models, further complemented with explainability techniques. Next, this paper contributes a set of guidelines named X-MOP which allows selecting the appropriate model based on the event log specifications, by providing insight into how the varying preprocessing, model complexity and explainability techniques typical in process outcome prediction influence the explainability of the model.
    End-to-End Speech Translation of Arabic to English Broadcast News. (arXiv:2212.05479v1 [cs.CL])
    Speech translation (ST) is the task of directly translating acoustic speech signals in a source language into text in a foreign language. ST task has been addressed, for a long time, using a pipeline approach with two modules : first an Automatic Speech Recognition (ASR) in the source language followed by a text-to-text Machine translation (MT). In the past few years, we have seen a paradigm shift towards the end-to-end approaches using sequence-to-sequence deep neural network models. This paper presents our efforts towards the development of the first Broadcast News end-to-end Arabic to English speech translation system. Starting from independent ASR and MT LDC releases, we were able to identify about 92 hours of Arabic audio recordings for which the manual transcription was also translated into English at the segment level. These data was used to train and compare pipeline and end-to-end speech translation systems under multiple scenarios including transfer learning and data augmentation techniques.
    Neural Controller Synthesis for Signal Temporal Logic Specifications Using Encoder-Decoder Structured Networks. (arXiv:2212.05200v1 [eess.SY])
    In this paper, we propose a control synthesis method for signal temporal logic (STL) specifications with neural networks (NNs). Most of the previous works consider training a controller for only a given STL specification. These approaches, however, require retraining the NN controller if a new specification arises and needs to be satisfied, which results in large consumption of memory and inefficient training. To tackle this problem, we propose to construct NN controllers by introducing encoder-decoder structured NNs with an attention mechanism. The encoder takes an STL formula as input and encodes it into an appropriate vector, and the decoder outputs control signals that will meet the given specification. As the encoder, we consider three NN structures: sequential, tree-structured, and graph-structured NNs. All the model parameters are trained in an end-to-end manner to maximize the expected robustness that is known to be a quantitative semantics of STL formulae. We compare the control performances attained by the above NN structures through a numerical experiment of the path planning problem, showing the efficacy of the proposed approach.
    Effects of Spectral Normalization in Multi-agent Reinforcement Learning. (arXiv:2212.05331v1 [cs.LG])
    A reliable critic is central to on-policy actor-critic learning. But it becomes challenging to learn a reliable critic in a multi-agent sparse reward scenario due to two factors: 1) The joint action space grows exponentially with the number of agents 2) This, combined with the reward sparseness and environment noise, leads to large sample requirements for accurate learning. We show that regularising the critic with spectral normalization (SN) enables it to learn more robustly, even in multi-agent on-policy sparse reward scenarios. Our experiments show that the regularised critic is quickly able to learn from the sparse rewarding experience in the complex SMAC and RWARE domains. These findings highlight the importance of regularisation in the critic for stable learning.
    Harmonizing Output Imbalance for semantic segmentation on extremely-imbalanced input data. (arXiv:2211.05295v2 [cs.CV] UPDATED)
    Semantic segmentation is a high level computer vision task that assigns a label for each pixel of an image. It is challengeful to deal with extremely-imbalanced data in which the ratio of target ixels to background pixels is lower than 1:1000. Such severe input imbalance leads to output imbalance for poor model training. This paper considers three issues for extremely-imbalanced data: inspired by the region based loss, an implicit measure for the output imbalance is proposed, and an adaptive algorithm is designed for guiding the output imbalance hyperparameter selection; then it is generalized to distribution based loss for dealing with output imbalance; and finally a compound loss with our adaptive hyperparameter selection alogorithm can keep the consistency of training and inference for harmonizing the output imbalance. With four popular deep architectures on our private dataset with three input imbalance scales and three public datasets, extensive experiments demonstrate the ompetitive/promising performance of the proposed method.
    Learning on non-stationary data with re-weighting. (arXiv:2212.05908v1 [cs.LG])
    Many real-world learning scenarios face the challenge of slow concept drift, where data distributions change gradually over time. In this setting, we pose the problem of learning temporally sensitive importance weights for training data, in order to optimize predictive accuracy. We propose a class of temporal reweighting functions that can capture multiple timescales of change in the data, as well as instance-specific characteristics. We formulate a bi-level optimization criterion, and an associated meta-learning algorithm, by which these weights can be learned. In particular, our formulation trains an auxiliary network to output weights as a function of training instances, thereby compactly representing the instance weights. We validate our temporal reweighting scheme on a large real-world dataset of 39M images spread over a 9 year period. Our extensive experiments demonstrate the necessity of instance-based temporal reweighting in the dataset, and achieve significant improvements to classical batch-learning approaches. Further, our proposal easily generalizes to a streaming setting and shows significant gains compared to recent continual learning methods.
    Stochastic Optimization for Spectral Risk Measures. (arXiv:2212.05149v1 [stat.ML])
    Spectral risk objectives - also called $L$-risks - allow for learning systems to interpolate between optimizing average-case performance (as in empirical risk minimization) and worst-case performance on a task. We develop stochastic algorithms to optimize these quantities by characterizing their subdifferential and addressing challenges such as biasedness of subgradient estimates and non-smoothness of the objective. We show theoretically and experimentally that out-of-the-box approaches such as stochastic subgradient and dual averaging are hindered by bias and that our approach outperforms them.
    Partial Domain Adaptation without Domain Alignment. (arXiv:2108.12867v2 [cs.CV] UPDATED)
    Unsupervised domain adaptation (UDA) aims to transfer knowledge from a well-labeled source domain to a different but related unlabeled target domain with identical label space. Currently, the main workhorse for solving UDA is domain alignment, which has proven successful. However, it is often difficult to find an appropriate source domain with identical label space. A more practical scenario is so-called partial domain adaptation (PDA) in which the source label set or space subsumes the target one. Unfortunately, in PDA, due to the existence of the irrelevant categories in the source domain, it is quite hard to obtain a perfect alignment, thus resulting in mode collapse and negative transfer. Although several efforts have been made by down-weighting the irrelevant source categories, the strategies used tend to be burdensome and risky since exactly which irrelevant categories are unknown. These challenges motivate us to find a relatively simpler alternative to solve PDA. To achieve this, we first provide a thorough theoretical analysis, which illustrates that the target risk is bounded by both model smoothness and between-domain discrepancy. Considering the difficulty of perfect alignment in solving PDA, we turn to focus on the model smoothness while discard the riskier domain alignment to enhance the adaptability of the model. Specifically, we instantiate the model smoothness as a quite simple intra-domain structure preserving (IDSP). To our best knowledge, this is the first naive attempt to address the PDA without domain alignment. Finally, our empirical results on multiple benchmark datasets demonstrate that IDSP is not only superior to the PDA SOTAs by a significant margin on some benchmarks (e.g., +10% on Cl->Rw and +8% on Ar->Rw ), but also complementary to domain alignment in the standard UDA
    Improving Precancerous Case Characterization via Transformer-based Ensemble Learning. (arXiv:2212.05150v1 [cs.LG])
    The application of natural language processing (NLP) to cancer pathology reports has been focused on detecting cancer cases, largely ignoring precancerous cases. Improving the characterization of precancerous adenomas assists in developing diagnostic tests for early cancer detection and prevention, especially for colorectal cancer (CRC). Here we developed transformer-based deep neural network NLP models to perform the CRC phenotyping, with the goal of extracting precancerous lesion attributes and distinguishing cancer and precancerous cases. We achieved 0.914 macro-F1 scores for classifying patients into negative, non-advanced adenoma, advanced adenoma and CRC. We further improved the performance to 0.923 using an ensemble of classifiers for cancer status classification and lesion size named entity recognition (NER). Our results demonstrated the potential of using NLP to leverage real-world health record data to facilitate the development of diagnostic tests for early cancer prevention.
    Towards Flexible Inference in Sequential Decision Problems via Bidirectional Transformers. (arXiv:2204.13326v2 [cs.LG] UPDATED)
    Randomly masking and predicting word tokens has been a successful approach in pre-training language models for a variety of downstream tasks. In this work, we observe that the same idea also applies naturally to sequential decision making, where many well-studied tasks like behavior cloning, offline RL, inverse dynamics, and waypoint conditioning correspond to different sequence maskings over a sequence of states, actions, and returns. We introduce the FlexiBiT framework, which provides a unified way to specify models which can be trained on many different sequential decision making tasks. We show that a single FlexiBiT model is simultaneously capable of carrying out many tasks with performance similar to or better than specialized models. Additionally, we show that performance can be further improved by fine-tuning our general model on specific tasks of interest.
    Uniform Masking Prevails in Vision-Language Pretraining. (arXiv:2212.05195v1 [cs.LG])
    Masked Language Modeling (MLM) has proven to be an essential component of Vision-Language (VL) pretraining. To implement MLM, the researcher must make two design choices: the masking strategy, which determines which tokens to mask, and the masking rate, which determines how many tokens to mask. Previous work has focused primarily on the masking strategy while setting the masking rate at a default of 15\%. In this paper, we show that increasing this masking rate improves downstream performance while simultaneously reducing performance gap among different masking strategies, rendering the uniform masking strategy competitive to other more complex ones. Surprisingly, we also discover that increasing the masking rate leads to gains in Image-Text Matching (ITM) tasks, suggesting that the role of MLM goes beyond language modeling in VL pretraining.
    MoDem: Accelerating Visual Model-Based Reinforcement Learning with Demonstrations. (arXiv:2212.05698v1 [cs.LG])
    Poor sample efficiency continues to be the primary challenge for deployment of deep Reinforcement Learning (RL) algorithms for real-world applications, and in particular for visuo-motor control. Model-based RL has the potential to be highly sample efficient by concurrently learning a world model and using synthetic rollouts for planning and policy improvement. However, in practice, sample-efficient learning with model-based RL is bottlenecked by the exploration challenge. In this work, we find that leveraging just a handful of demonstrations can dramatically improve the sample-efficiency of model-based RL. Simply appending demonstrations to the interaction dataset, however, does not suffice. We identify key ingredients for leveraging demonstrations in model learning -- policy pretraining, targeted exploration, and oversampling of demonstration data -- which forms the three phases of our model-based RL framework. We empirically study three complex visuo-motor control domains and find that our method is 150%-250% more successful in completing sparse reward tasks compared to prior approaches in the low data regime (100K interaction steps, 5 demonstrations). Code and videos are available at: https://nicklashansen.github.io/modemrl
    Revealing the Distributional Vulnerability of Discriminators by Implicit Generators. (arXiv:2108.09976v2 [cs.LG] UPDATED)
    In deep neural learning, a discriminator trained on in-distribution (ID) samples may make high-confidence predictions on out-of-distribution (OOD) samples. This triggers a significant matter for robust, trustworthy and safe deep learning. The issue is primarily caused by the limited ID samples observable in training the discriminator when OOD samples are unavailable. We propose a general approach for \textit{fine-tuning discriminators by implicit generators} (FIG). FIG is grounded on information theory and applicable to standard discriminators without retraining. It improves the ability of a standard discriminator in distinguishing ID and OOD samples by generating and penalizing its specific OOD samples. According to the Shannon entropy, an energy-based implicit generator is inferred from a discriminator without extra training costs. Then, a Langevin dynamic sampler draws specific OOD samples for the implicit generator. Lastly, we design a regularizer fitting the design principle of the implicit generator to induce high entropy on those generated OOD samples. The experiments on different networks and datasets demonstrate that FIG achieves the state-of-the-art OOD detection performance.
    Estimators of Entropy and Information via Inference in Probabilistic Models. (arXiv:2202.12363v4 [stat.ML] UPDATED)
    Estimating information-theoretic quantities such as entropy and mutual information is central to many problems in statistics and machine learning, but challenging in high dimensions. This paper presents estimators of entropy via inference (EEVI), which deliver upper and lower bounds on many information quantities for arbitrary variables in a probabilistic generative model. These estimators use importance sampling with proposal distribution families that include amortized variational inference and sequential Monte Carlo, which can be tailored to the target model and used to squeeze true information values with high accuracy. We present several theoretical properties of EEVI and demonstrate scalability and efficacy on two problems from the medical domain: (i) in an expert system for diagnosing liver disorders, we rank medical tests according to how informative they are about latent diseases, given a pattern of observed symptoms and patient attributes; and (ii) in a differential equation model of carbohydrate metabolism, we find optimal times to take blood glucose measurements that maximize information about a diabetic patient's insulin sensitivity, given their meal and medication schedule.
    Tensor-based Sequential Learning via Hankel Matrix Representation for Next Item Recommendations. (arXiv:2212.05720v1 [cs.LG])
    Self-attentive transformer models have recently been shown to solve the next item recommendation task very efficiently. The learned attention weights capture sequential dynamics in user behavior and generalize well. Motivated by the special structure of learned parameter space, we question if it is possible to mimic it with an alternative and more lightweight approach. We develop a new tensor factorization-based model that ingrains the structural knowledge about sequential data within the learning process. We demonstrate how certain properties of a self-attention network can be reproduced with our approach based on special Hankel matrix representation. The resulting model has a shallow linear architecture and compares competitively to its neural counterpart.
    ABC: Aggregation before Communication, a Communication Reduction Framework for Distributed Graph Neural Network Training and Effective Partition. (arXiv:2212.05410v1 [cs.LG])
    Graph Neural Networks(GNNs) are a family of neural models tailored for graph-structure data and have shown superior performance in learning representations for graph-structured data. However, training GNNs on large graphs remains challenging and a promising direction is distributed GNN training, which is to partition the input graph and distribute the workload across multiple machines. The key bottleneck of the existing distributed GNNs training framework is the across-machine communication induced by the dependency on the graph data and aggregation operator of GNNs. In this paper, we study the communication complexity during distributed GNNs training and propose a simple lossless communication reduction method, termed the Aggregation before Communication (ABC) method. ABC method exploits the permutation-invariant property of the GNNs layer and leads to a paradigm where vertex-cut is proved to admit a superior communication performance than the currently popular paradigm (edge-cut). In addition, we show that the new partition paradigm is particularly ideal in the case of dynamic graphs where it is infeasible to control the edge placement due to the unknown stochastic of the graph-changing process.
    Generalization Through the Lens of Learning Dynamics. (arXiv:2212.05377v1 [cs.LG])
    A machine learning (ML) system must learn not only to match the output of a target function on a training set, but also to generalize to novel situations in order to yield accurate predictions at deployment. In most practical applications, the user cannot exhaustively enumerate every possible input to the model; strong generalization performance is therefore crucial to the development of ML systems which are performant and reliable enough to be deployed in the real world. While generalization is well-understood theoretically in a number of hypothesis classes, the impressive generalization performance of deep neural networks has stymied theoreticians. In deep reinforcement learning (RL), our understanding of generalization is further complicated by the conflict between generalization and stability in widely-used RL algorithms. This thesis will provide insight into generalization by studying the learning dynamics of deep neural networks in both supervised and reinforcement learning tasks.
    Robust Recurrent Neural Network to Identify Ship Motion in Open Water with Performance Guarantees -- Technical Report. (arXiv:2212.05781v1 [cs.LG])
    Recurrent neural networks are capable of learning the dynamics of an unknown nonlinear system purely from input-output measurements. However, the resulting models do not provide any stability guarantees on the input-output mapping. In this work, we represent a recurrent neural network as a linear time-invariant system with nonlinear disturbances. By introducing constraints on the parameters, we can guarantee finite gain stability and incremental finite gain stability. We apply this identification method to learn the motion of a four-degrees-of-freedom ship that is moving in open water and compare it against other purely learning-based approaches with unconstrained parameters. Our analysis shows that the constrained recurrent neural network has a lower prediction accuracy on the test set, but it achieves comparable results on an out-of-distribution set and respects stability conditions.
    Optimal Planning of Hybrid Energy Storage Systems using Curtailed Renewable Energy through Deep Reinforcement Learning. (arXiv:2212.05662v1 [cs.LG])
    Energy management systems (EMS) are becoming increasingly important in order to utilize the continuously growing curtailed renewable energy. Promising energy storage systems (ESS), such as batteries and green hydrogen should be employed to maximize the efficiency of energy stakeholders. However, optimal decision-making, i.e., planning the leveraging between different strategies, is confronted with the complexity and uncertainties of large-scale problems. Here, we propose a sophisticated deep reinforcement learning (DRL) methodology with a policy-based algorithm to realize the real-time optimal ESS planning under the curtailed renewable energy uncertainty. A quantitative performance comparison proved that the DRL agent outperforms the scenario-based stochastic optimization (SO) algorithm, even with a wide action and observation space. Owing to the uncertainty rejection capability of the DRL, we could confirm a robust performance, under a large uncertainty of the curtailed renewable energy, with a maximizing net profit and stable system. Action-mapping was performed for visually assessing the action taken by the DRL agent according to the state. The corresponding results confirmed that the DRL agent learns the way like what a human expert would do, suggesting reliable application of the proposed methodology.
    FactorJoin: A New Cardinality Estimation Framework for Join Queries. (arXiv:2212.05526v1 [cs.DB])
    Cardinality estimation is one of the most fundamental and challenging problems in query optimization. Neither classical nor learning-based methods yield satisfactory performance when estimating the cardinality of the join queries. They either rely on simplified assumptions leading to ineffective cardinality estimates or build large models to understand the data distributions, leading to long planning times and a lack of generalizability across queries. In this paper, we propose a new framework FactorJoin for estimating join queries. FactorJoin combines the idea behind the classical join-histogram method to efficiently handle joins with the learning-based methods to accurately capture attribute correlation. Specifically, FactorJoin scans every table in a DB and builds single-table conditional distributions during an offline preparation phase. When a join query comes, FactorJoin translates it into a factor graph model over the learned distributions to effectively and efficiently estimate its cardinality. Unlike existing learning-based methods, FactorJoin does not need to de-normalize joins upfront or require executed query workloads to train the model. Since it only relies on single-table statistics, FactorJoin has small space overhead and is extremely easy to train and maintain. In our evaluation, FactorJoin can produce more effective estimates than the previous state-of-the-art learning-based methods, with 40x less estimation latency, 100x smaller model size, and 100x faster training speed at comparable or better accuracy. In addition, FactorJoin can estimate 10,000 sub-plan queries within one second to optimize the query plan, which is very close to the traditional cardinality estimators in commercial DBMS.
    Moving Metric Detection and Alerting System at eBay. (arXiv:2004.02360v2 [cs.CY] UPDATED)
    At eBay, there are thousands of product health metrics for different domain teams to monitor. We built a two-phase alerting system to notify users with actionable alerts based on anomaly detection and alert retrieval. In the first phase, we developed an efficient anomaly detection algorithm, called Moving Metric Detector (MMD), to identify potential alerts among metrics with distribution agnostic criteria. In the second alert retrieval phase, we built additional logic with feedbacks to select valid actionable alerts with point-wise ranking model and business rules. Compared with other trend and seasonality decomposition methods, our decomposer is faster and better to detect anomalies in unsupervised cases. Our two-phase approach dramatically improves alert precision and avoids alert spamming in eBay production.
    The universal approximation theorem for complex-valued neural networks. (arXiv:2012.03351v2 [math.FA] UPDATED)
    We generalize the classical universal approximation theorem for neural networks to the case of complex-valued neural networks. Precisely, we consider feedforward networks with a complex activation function $\sigma : \mathbb{C} \to \mathbb{C}$ in which each neuron performs the operation $\mathbb{C}^N \to \mathbb{C}, z \mapsto \sigma(b + w^T z)$ with weights $w \in \mathbb{C}^N$ and a bias $b \in \mathbb{C}$, and with $\sigma$ applied componentwise. We completely characterize those activation functions $\sigma$ for which the associated complex networks have the universal approximation property, meaning that they can uniformly approximate any continuous function on any compact subset of $\mathbb{C}^d$ arbitrarily well. Unlike the classical case of real networks, the set of "good activation functions" which give rise to networks with the universal approximation property differs significantly depending on whether one considers deep networks or shallow networks: For deep networks with at least two hidden layers, the universal approximation property holds as long as $\sigma$ is neither a polynomial, a holomorphic function, or an antiholomorphic function. Shallow networks, on the other hand, are universal if and only if the real part or the imaginary part of $\sigma$ is not a polyharmonic function.
    REAP: A Large-Scale Realistic Adversarial Patch Benchmark. (arXiv:2212.05680v1 [cs.CV])
    Machine learning models are known to be susceptible to adversarial perturbation. One famous attack is the adversarial patch, a sticker with a particularly crafted pattern that makes the model incorrectly predict the object it is placed on. This attack presents a critical threat to cyber-physical systems that rely on cameras such as autonomous cars. Despite the significance of the problem, conducting research in this setting has been difficult; evaluating attacks and defenses in the real world is exceptionally costly while synthetic data are unrealistic. In this work, we propose the REAP (REalistic Adversarial Patch) benchmark, a digital benchmark that allows the user to evaluate patch attacks on real images, and under real-world conditions. Built on top of the Mapillary Vistas dataset, our benchmark contains over 14,000 traffic signs. Each sign is augmented with a pair of geometric and lighting transformations, which can be used to apply a digitally generated patch realistically onto the sign. Using our benchmark, we perform the first large-scale assessments of adversarial patch attacks under realistic conditions. Our experiments suggest that adversarial patch attacks may present a smaller threat than previously believed and that the success rate of an attack on simpler digital simulations is not predictive of its actual effectiveness in practice. We release our benchmark publicly at https://github.com/wagner-group/reap-benchmark.
    Graph Learning for Anomaly Analytics: Algorithms, Applications, and Challenges. (arXiv:2212.05532v1 [cs.LG])
    Anomaly analytics is a popular and vital task in various research contexts, which has been studied for several decades. At the same time, deep learning has shown its capacity in solving many graph-based tasks like, node classification, link prediction, and graph classification. Recently, many studies are extending graph learning models for solving anomaly analytics problems, resulting in beneficial advances in graph-based anomaly analytics techniques. In this survey, we provide a comprehensive overview of graph learning methods for anomaly analytics tasks. We classify them into four categories based on their model architectures, namely graph convolutional network (GCN), graph attention network (GAT), graph autoencoder (GAE), and other graph learning models. The differences between these methods are also compared in a systematic manner. Furthermore, we outline several graph-based anomaly analytics applications across various domains in the real world. Finally, we discuss five potential future research directions in this rapidly growing field.
    Relate to Predict: Towards Task-Independent Knowledge Representations for Reinforcement Learning. (arXiv:2212.05298v1 [cs.AI])
    Reinforcement Learning (RL) can enable agents to learn complex tasks. However, it is difficult to interpret the knowledge and reuse it across tasks. Inductive biases can address such issues by explicitly providing generic yet useful decomposition that is otherwise difficult or expensive to learn implicitly. For example, object-centered approaches decompose a high dimensional observation into individual objects. Expanding on this, we utilize an inductive bias for explicit object-centered knowledge separation that provides further decomposition into semantic representations and dynamics knowledge. For this, we introduce a semantic module that predicts an objects' semantic state based on its context. The resulting affordance-like object state can then be used to enrich perceptual object representations. With a minimal setup and an environment that enables puzzle-like tasks, we demonstrate the feasibility and benefits of this approach. Specifically, we compare three different methods of integrating semantic representations into a model-based RL architecture. Our experiments show that the degree of explicitness in knowledge separation correlates with faster learning, better accuracy, better generalization, and better interpretability.
    OpenD: A Benchmark for Language-Driven Door and Drawer Opening. (arXiv:2212.05211v1 [cs.LG])
    We introduce OPEND, a benchmark for learning how to use a hand to open cabinet doors or drawers in a photo-realistic and physics-reliable simulation environment driven by language instruction. To solve the task, we propose a multi-step planner composed of a deep neural network and rule-base controllers. The network is utilized to capture spatial relationships from images and understand semantic meaning from language instructions. Controllers efficiently execute the plan based on the spatial and semantic understanding. We evaluate our system by measuring its zero-shot performance in test data set. Experimental results demonstrate the effectiveness of decision planning by our multi-step planner for different hands, while suggesting that there is significant room for developing better models to address the challenge brought by language understanding, spatial reasoning, and long-term manipulation. We will release OPEND and host challenges to promote future research in this area.
    Expanding Knowledge Graphs with Humans in the Loop. (arXiv:2212.05189v1 [cs.LG])
    Curated knowledge graphs encode domain expertise and improve the performance of recommendation, segmentation, ad targeting, and other machine learning systems in several domains. As new concepts emerge in a domain, knowledge graphs must be expanded to preserve machine learning performance. Manually expanding knowledge graphs, however, is infeasible at scale. In this work, we propose a method for knowledge graph expansion with humans-in-the-loop. Concretely, given a knowledge graph, our method predicts the "parents" of new concepts to be added to this graph for further verification by human experts. We show that our method is both accurate and provably "human-friendly". Specifically, we prove that our method predicts parents that are "near" concepts' true parents in the knowledge graph, even when the predictions are incorrect. We then show, with a controlled experiment, that satisfying this property increases both the speed and the accuracy of the human-algorithm collaboration. We further evaluate our method on a knowledge graph from Pinterest and show that it outperforms competing methods on both accuracy and human-friendliness. Upon deployment in production at Pinterest, our method reduced the time needed for knowledge graph expansion by ~400% (compared to manual expansion), and contributed to a subsequent increase in ad revenue of 20%.
    XFL: Naming Functions in Binaries with Extreme Multi-label Learning. (arXiv:2107.13404v4 [cs.CR] UPDATED)
    Reverse engineers benefit from the presence of identifiers such as function names in a binary, but usually these are removed for release. Training a machine learning model to predict function names automatically is promising but fundamentally hard: unlike words in natural language, most function names occur only once. In this paper, we address this problem by introducing eXtreme Function Labeling (XFL), an extreme multi-label learning approach to selecting appropriate labels for binary functions. XFL splits function names into tokens, treating each as an informative label akin to the problem of tagging texts in natural language. We relate the semantics of binary code to labels through DEXTER, a novel function embedding that combines static analysis-based features with local context from the call graph and global context from the entire binary. We demonstrate that XFL/DEXTER outperforms the state of the art in function labeling on a dataset of 10,047 binaries from the Debian project, achieving a precision of 83.5%. We also study combinations of XFL with alternative binary embeddings from the literature and show that DEXTER consistently performs best for this task. As a result, we demonstrate that binary function labeling can be effectively phrased in terms of multi-label learning, and that binary function embeddings benefit from including explicit semantic features.
    Neural Bandits for Data Mining: Searching for Dangerous Polypharmacy. (arXiv:2212.05190v1 [cs.LG])
    Polypharmacy, most often defined as the simultaneous consumption of five or more drugs at once, is a prevalent phenomenon in the older population. Some of these polypharmacies, deemed inappropriate, may be associated with adverse health outcomes such as death or hospitalization. Considering the combinatorial nature of the problem as well as the size of claims database and the cost to compute an exact association measure for a given drug combination, it is impossible to investigate every possible combination of drugs. Therefore, we propose to optimize the search for potentially inappropriate polypharmacies (PIPs). To this end, we propose the OptimNeuralTS strategy, based on Neural Thompson Sampling and differential evolution, to efficiently mine claims datasets and build a predictive model of the association between drug combinations and health outcomes. We benchmark our method using two datasets generated by an internally developed simulator of polypharmacy data containing 500 drugs and 100 000 distinct combinations. Empirically, our method can detect up to 33\% of PIPs while maintaining an average precision score of 99\% using 10 000 time steps.
    Algorithmic progress in computer vision. (arXiv:2212.05153v1 [cs.CV])
    We investigate algorithmic progress in image classification on ImageNet, perhaps the most well-known test bed for computer vision. We estimate a model, informed by work on neural scaling laws, and infer a decomposition of progress into the scaling of compute, data, and algorithms. Using Shapley values to attribute performance improvements, we find that algorithmic improvements have been roughly as important as the scaling of compute for progress computer vision. Our estimates indicate that algorithmic innovations mostly take the form of compute-augmenting algorithmic advances (which enable researchers to get better performance from less compute), not data-augmenting algorithmic advances. We find that compute-augmenting algorithmic advances are made at a pace more than twice as fast as the rate usually associated with Moore's law. In particular, we estimate that compute-augmenting innovations halve compute requirements every nine months (95\% confidence interval: 4 to 25 months).
    Over-the-Air Split Machine Learning in Wireless MIMO Networks. (arXiv:2210.04742v2 [cs.LG] UPDATED)
    In split machine learning (ML), different partitions of a neural network (NN) are executed by different computing nodes, requiring a large amount of communication cost. To ease communication burden, over-the-air computation (OAC) can efficiently implement all or part of the computation at the same time of communication. Based on the proposed system, the system implementation over wireless network is introduced and we provide the problem formulation. In particular, we show that the inter-layer connection in a NN of any size can be mathematically decomposed into a set of linear precoding and combining transformations over MIMO channels. Therefore, the precoding matrix at the transmitter and the combining matrix at the receiver of each MIMO link, as well as the channel matrix itself, can jointly serve as a fully connected layer of the NN. The generalization of the proposed scheme to the conventional NNs is also introduced. Finally, we extend the proposed scheme to the widely used convolutional neural networks and demonstrate its effectiveness under both the static and quasi-static memory channel conditions with comprehensive simulations. In such a split ML system, the precoding and combining matrices are regarded as trainable parameters, while MIMO channel matrix is regarded as unknown (implicit) parameters.
    Synthetic Wave-Geometric Impulse Responses for Improved Speech Dereverberation. (arXiv:2212.05360v1 [eess.AS])
    We present a novel approach to improve the performance of learning-based speech dereverberation using accurate synthetic datasets. Our approach is designed to recover the reverb-free signal from a reverberant speech signal. We show that accurately simulating the low-frequency components of Room Impulse Responses (RIRs) is important to achieving good dereverberation. We use the GWA dataset that consists of synthetic RIRs generated in a hybrid fashion: an accurate wave-based solver is used to simulate the lower frequencies and geometric ray tracing methods simulate the higher frequencies. We demonstrate that speech dereverberation models trained on hybrid synthetic RIRs outperform models trained on RIRs generated by prior geometric ray tracing methods on four real-world RIR datasets.
    Neural Continuous-Time Markov Models. (arXiv:2212.05378v1 [stat.ML])
    Continuous-time Markov chains are used to model stochastic systems where transitions can occur at irregular times, e.g., birth-death processes, chemical reaction networks, population dynamics, and gene regulatory networks. We develop a method to learn a continuous-time Markov chain's transition rate functions from fully observed time series. In contrast with existing methods, our method allows for transition rates to depend nonlinearly on both state variables and external covariates. The Gillespie algorithm is used to generate trajectories of stochastic systems where propensity functions (reaction rates) are known. Our method can be viewed as the inverse: given trajectories of a stochastic reaction network, we generate estimates of the propensity functions. While previous methods used linear or log-linear methods to link transition rates to covariates, we use neural networks, increasing the capacity and potential accuracy of learned models. In the chemical context, this enables the method to learn propensity functions from non-mass-action kinetics. We test our method with synthetic data generated from a variety of systems with known transition rates. We show that our method learns these transition rates with considerably more accuracy than log-linear methods, in terms of mean absolute error between ground truth and predicted transition rates. We also demonstrate an application of our methods to open-loop control of a continuous-time Markov chain.  ( 2 min )
    QESK: Quantum-based Entropic Subtree Kernels for Graph Classification. (arXiv:2212.05228v1 [cs.LG])
    In this paper, we propose a novel graph kernel, namely the Quantum-based Entropic Subtree Kernel (QESK), for Graph Classification. To this end, we commence by computing the Average Mixing Matrix (AMM) of the Continuous-time Quantum Walk (CTQW) evolved on each graph structure. Moreover, we show how this AMM matrix can be employed to compute a series of entropic subtree representations associated with the classical Weisfeiler-Lehman (WL) algorithm. For a pair of graphs, the QESK kernel is defined by computing the exponentiation of the negative Euclidean distance between their entropic subtree representations, theoretically resulting in a positive definite graph kernel. We show that the proposed QESK kernel not only encapsulates complicated intrinsic quantum-based structural characteristics of graph structures through the CTQW, but also theoretically addresses the shortcoming of ignoring the effects of unshared substructures arising in state-of-the-art R-convolution graph kernels. Moreover, unlike the classical R-convolution kernels, the proposed QESK can discriminate the distinctions of isomorphic subtrees in terms of the global graph structures, theoretically explaining the effectiveness. Experiments indicate that the proposed QESK kernel can significantly outperform state-of-the-art graph kernels and graph deep learning methods for graph classification problems.  ( 2 min )
    Information retrieval in single cell chromatin analysis using TF-IDF transformation methods. (arXiv:2212.05184v1 [q-bio.GN])
    Single-cell sequencing assay for transposase-accessible chromatin (scATAC-seq) assesses genome-wide chromatin accessibility in thousands of cells to reveal regulatory landscapes in high resolutions. However, the analysis presents challenges due to the high dimensionality and sparsity of the data. Several methods have been developed, including transformation techniques of term-frequency inverse-document frequency (TF-IDF), dimension reduction methods such as singular value decomposition (SVD), factor analysis, and autoencoders. Yet, a comprehensive study on the mentioned methods has not been fully performed. It is not clear what is the best practice when analyzing scATAC-seq data. We compared several scenarios for transformation and dimension reduction as well as the SVD-based feature analysis to investigate potential enhancements in scATAC-seq information retrieval. Additionally, we investigate if autoencoders benefit from the TF-IDF transformation. Our results reveal that the TF-IDF transformation generally leads to improved clustering and biologically relevant feature extraction.  ( 2 min )
    Targeted Adversarial Attacks on Deep Reinforcement Learning Policies via Model Checking. (arXiv:2212.05337v1 [cs.LG])
    Deep Reinforcement Learning (RL) agents are susceptible to adversarial noise in their observations that can mislead their policies and decrease their performance. However, an adversary may be interested not only in decreasing the reward, but also in modifying specific temporal logic properties of the policy. This paper presents a metric that measures the exact impact of adversarial attacks against such properties. We use this metric to craft optimal adversarial attacks. Furthermore, we introduce a model checking method that allows us to verify the robustness of RL policies against adversarial attacks. Our empirical analysis confirms (1) the quality of our metric to craft adversarial attacks against temporal logic properties, and (2) that we are able to concisely assess a system's robustness against attacks.  ( 2 min )
    Measuring Data. (arXiv:2212.05129v1 [cs.AI])
    We identify the task of measuring data to quantitatively characterize the composition of machine learning data and datasets. Similar to an object's height, width, and volume, data measurements quantify different attributes of data along common dimensions that support comparison. Several lines of research have proposed what we refer to as measurements, with differing terminology; we bring some of this work together, particularly in fields of computer vision and language, and build from it to motivate measuring data as a critical component of responsible AI development. Measuring data aids in systematically building and analyzing machine learning (ML) data towards specific goals and gaining better control of what modern ML systems will learn. We conclude with a discussion of the many avenues of future work, the limitations of data measurements, and how to leverage these measurement approaches in research and practice.  ( 2 min )
    Networked Restless Bandits with Positive Externalities. (arXiv:2212.05144v1 [cs.LG])
    Restless multi-armed bandits are often used to model budget-constrained resource allocation tasks where receipt of the resource is associated with an increased probability of a favorable state transition. Prior work assumes that individual arms only benefit if they receive the resource directly. However, many allocation tasks occur within communities and can be characterized by positive externalities that allow arms to derive partial benefit when their neighbor(s) receive the resource. We thus introduce networked restless bandits, a novel multi-armed bandit setting in which arms are both restless and embedded within a directed graph. We then present Greta, a graph-aware, Whittle index-based heuristic algorithm that can be used to efficiently construct a constrained reward-maximizing action vector at each timestep. Our empirical results demonstrate that Greta outperforms comparison policies across a range of hyperparameter values and graph topologies.  ( 2 min )
    CACTI: A Framework for Scalable Multi-Task Multi-Scene Visual Imitation Learning. (arXiv:2212.05711v1 [cs.RO])
    Developing robots that are capable of many skills and generalization to unseen scenarios requires progress on two fronts: efficient collection of large and diverse datasets, and training of high-capacity policies on the collected data. While large datasets have propelled progress in other fields like computer vision and natural language processing, collecting data of comparable scale is particularly challenging for physical systems like robotics. In this work, we propose a framework to bridge this gap and better scale up robot learning, under the lens of multi-task, multi-scene robot manipulation in kitchen environments. Our framework, named CACTI, has four stages that separately handle data collection, data augmentation, visual representation learning, and imitation policy training. In the CACTI framework, we highlight the benefit of adapting state-of-the-art models for image generation as part of the augmentation stage, and the significant improvement of training efficiency by using pretrained out-of-domain visual representations at the compression stage. Experimentally, we demonstrate that 1) on a real robot setup, CACTI enables efficient training of a single policy capable of 10 manipulation tasks involving kitchen objects, and robust to varying layouts of distractor objects; 2) in a simulated kitchen environment, CACTI trains a single policy on 18 semantic tasks across up to 50 layout variations per task. The simulation task benchmark and augmented datasets in both real and simulated environments will be released to facilitate future research.  ( 2 min )
    Optimized Sparse Matrix Operations for Reverse Mode Automatic Differentiation. (arXiv:2212.05159v1 [cs.LG])
    Sparse matrix representations are ubiquitous in computational science and machine learning, leading to significant reductions in compute time, in comparison to dense representation, for problems that have local connectivity. The adoption of sparse representation in leading ML frameworks such as PyTorch is incomplete, however, with support for both automatic differentiation and GPU acceleration missing. In this work, we present an implementation of a CSR-based sparse matrix wrapper for PyTorch with CUDA acceleration for basic matrix operations, as well as automatic differentiability. We also present several applications of the resulting sparse kernels to optimization problems, demonstrating ease of implementation and performance measurements versus their dense counterparts.  ( 2 min )
    A Learning and Control Perspective for Microfinance. (arXiv:2207.12631v2 [q-fin.GN] UPDATED)
    Microfinance, despite its significant potential for poverty reduction, is facing sustainability hardships due to high default rates. Although many methods in regular finance can estimate credit scores and default probabilities, these methods are not directly applicable to microfinance due to the following unique characteristics: a) under-explored (developing) areas such as rural Africa do not have sufficient prior loan data for microfinance institutions (MFIs) to establish a credit scoring system; b) microfinance applicants may have difficulty providing sufficient information for MFIs to accurately predict default probabilities; and c) many MFIs use group liability (instead of collateral) to secure repayment. Here, we present a novel control-theoretic model of microfinance that accounts for these characteristics. We construct an algorithm to learn microfinance decision policies that achieve financial inclusion, fairness, social welfare, and sustainability. We characterize the convergence conditions to Pareto-optimum and the convergence speeds. We demonstrate, in numerous real and synthetic datasets, that the proposed method accounts for the complexities induced by group liability to produce robust decisions before sufficient loans are given to establish credit scoring systems and for applicants whose default probability cannot be accurately estimated due to missing information. To the best of our knowledge, this paper is the first to connect microfinance and control theory. We envision that the connection will enable safe learning and control techniques to help modernize microfinance and alleviate poverty.  ( 2 min )
    Revisiting the acceleration phenomenon via high-resolution differential equations. (arXiv:2212.05700v1 [math.OC])
    Nesterov's accelerated gradient descent (NAG) is one of the milestones in the history of first-order algorithms. It was not successfully uncovered until the high-resolution differential equation framework was proposed in [Shi et al., 2022] that the mechanism behind the acceleration phenomenon is due to the gradient correction term. To deepen our understanding of the high-resolution differential equation framework on the convergence rate, we continue to investigate NAG for the $\mu$-strongly convex function based on the techniques of Lyapunov analysis and phase-space representation in this paper. First, we revisit the proof from the gradient-correction scheme. Similar to [Chen et al., 2022], the straightforward calculation simplifies the proof extremely and enlarges the step size to $s=1/L$ with minor modification. Meanwhile, the way of constructing Lyapunov functions is principled. Furthermore, we also investigate NAG from the implicit-velocity scheme. Due to the difference in the velocity iterates, we find that the Lyapunov function is constructed from the implicit-velocity scheme without the additional term and the calculation of iterative difference becomes simpler. Together with the optimal step size obtained, the high-resolution differential equation framework from the implicit-velocity scheme of NAG is perfect and outperforms the gradient-correction scheme.  ( 2 min )
    Coordinate Ascent for Off-Policy RL with Global Convergence Guarantees. (arXiv:2212.05237v1 [cs.LG])
    We revisit the domain of off-policy policy optimization in RL from the perspective of coordinate ascent. One commonly-used approach is to leverage the off-policy policy gradient to optimize a surrogate objective -- the total discounted in expectation return of the target policy with respect to the state distribution of the behavior policy. However, this approach has been shown to suffer from the distribution mismatch issue, and therefore significant efforts are needed for correcting this mismatch either via state distribution correction or a counterfactual method. In this paper, we rethink off-policy learning via Coordinate Ascent Policy Optimization (CAPO), an off-policy actor-critic algorithm that decouples policy improvement from the state distribution of the behavior policy without using the policy gradient. This design obviates the need for distribution correction or importance sampling in the policy improvement step of off-policy policy gradient. We establish the global convergence of CAPO with general coordinate selection and then further quantify the convergence rates of several instances of CAPO with popular coordinate selection rules, including the cyclic and the randomized variants of CAPO. We then extend CAPO to neural policies for a more practical implementation. Through experiments, we demonstrate that CAPO provides a competitive approach to RL in practice.  ( 2 min )
    State-Regularized Recurrent Neural Networks to Extract Automata and Explain Predictions. (arXiv:2212.05178v1 [cs.LG])
    Recurrent neural networks are a widely used class of neural architectures. They have, however, two shortcomings. First, they are often treated as black-box models and as such it is difficult to understand what exactly they learn as well as how they arrive at a particular prediction. Second, they tend to work poorly on sequences requiring long-term memorization, despite having this capacity in principle. We aim to address both shortcomings with a class of recurrent networks that use a stochastic state transition mechanism between cell applications. This mechanism, which we term state-regularization, makes RNNs transition between a finite set of learnable states. We evaluate state-regularized RNNs on (1) regular languages for the purpose of automata extraction; (2) non-regular languages such as balanced parentheses and palindromes where external memory is required; and (3) real-word sequence learning tasks for sentiment analysis, visual object recognition and text categorisation. We show that state-regularization (a) simplifies the extraction of finite state automata that display an RNN's state transition dynamic; (b) forces RNNs to operate more like automata with external memory and less like finite state machines, which potentiality leads to a more structural memory; (c) leads to better interpretability and explainability of RNNs by leveraging the probabilistic finite state transition mechanism over time steps.  ( 2 min )
    CALIME: Causality-Aware Local Interpretable Model-Agnostic Explanations. (arXiv:2212.05256v1 [cs.AI])
    A significant drawback of eXplainable Artificial Intelligence (XAI) approaches is the assumption of feature independence. This paper focuses on integrating causal knowledge in XAI methods to increase trust and help users assess explanations' quality. We propose a novel extension to a widely used local and model-agnostic explainer that explicitly encodes causal relationships in the data generated around the input instance to explain. Extensive experiments show that our method achieves superior performance comparing the initial one for both the fidelity in mimicking the black-box and the stability of the explanations.  ( 2 min )
    Spatial-temporal traffic modeling with a fusion graph reconstructed by tensor decomposition. (arXiv:2212.05653v1 [cs.LG])
    Accurate spatial-temporal traffic flow forecasting is essential for helping traffic managers to take control measures and drivers to choose the optimal travel routes. Recently, graph convolutional networks (GCNs) have been widely used in traffic flow prediction owing to their powerful ability to capture spatial-temporal dependencies. The design of the spatial-temporal graph adjacency matrix is a key to the success of GCNs, and it is still an open question. This paper proposes reconstructing the binary adjacency matrix via tensor decomposition, and a traffic flow forecasting method is proposed. First, we reformulate the spatial-temporal fusion graph adjacency matrix into a three-way adjacency tensor. Then, we reconstructed the adjacency tensor via Tucker decomposition, wherein more informative and global spatial-temporal dependencies are encoded. Finally, a Spatial-temporal Synchronous Graph Convolutional module for localized spatial-temporal correlations learning and a Dilated Convolution module for global correlations learning are assembled to aggregate and learn the comprehensive spatial-temporal dependencies of the road network. Experimental results on four open-access datasets demonstrate that the proposed model outperforms state-of-the-art approaches in terms of the prediction performance and computational cost.  ( 2 min )
    Multi-view Graph Convolutional Networks with Differentiable Node Selection. (arXiv:2212.05124v1 [cs.LG])
    Multi-view data containing complementary and consensus information can facilitate representation learning by exploiting the intact integration of multi-view features. Because most objects in real world often have underlying connections, organizing multi-view data as heterogeneous graphs is beneficial to extracting latent information among different objects. Due to the powerful capability to gather information of neighborhood nodes, in this paper, we apply Graph Convolutional Network (GCN) to cope with heterogeneous-graph data originating from multi-view data, which is still under-explored in the field of GCN. In order to improve the quality of network topology and alleviate the interference of noises yielded by graph fusion, some methods undertake sorting operations before the graph convolution procedure. These GCN-based methods generally sort and select the most confident neighborhood nodes for each vertex, such as picking the top-k nodes according to pre-defined confidence values. Nonetheless, this is problematic due to the non-differentiable sorting operators and inflexible graph embedding learning, which may result in blocked gradient computations and undesired performance. To cope with these issues, we propose a joint framework dubbed Multi-view Graph Convolutional Network with Differentiable Node Selection (MGCN-DNS), which is constituted of an adaptive graph fusion layer, a graph learning module and a differentiable node selection schema. MGCN-DNS accepts multi-channel graph-structural data as inputs and aims to learn more robust graph fusion through a differentiable neural network. The effectiveness of the proposed method is verified by rigorous comparisons with considerable state-of-the-art approaches in terms of multi-view semi-supervised classification tasks.  ( 2 min )
    All-in-One: A Highly Representative DNN Pruning Framework for Edge Devices with Dynamic Power Management. (arXiv:2212.05122v1 [cs.LG])
    During the deployment of deep neural networks (DNNs) on edge devices, many research efforts are devoted to the limited hardware resource. However, little attention is paid to the influence of dynamic power management. As edge devices typically only have a budget of energy with batteries (rather than almost unlimited energy support on servers or workstations), their dynamic power management often changes the execution frequency as in the widely-used dynamic voltage and frequency scaling (DVFS) technique. This leads to highly unstable inference speed performance, especially for computation-intensive DNN models, which can harm user experience and waste hardware resources. We firstly identify this problem and then propose All-in-One, a highly representative pruning framework to work with dynamic power management using DVFS. The framework can use only one set of model weights and soft masks (together with other auxiliary parameters of negligible storage) to represent multiple models of various pruning ratios. By re-configuring the model to the corresponding pruning ratio for a specific execution frequency (and voltage), we are able to achieve stable inference speed, i.e., keeping the difference in speed performance under various execution frequencies as small as possible. Our experiments demonstrate that our method not only achieves high accuracy for multiple models of different pruning ratios, but also reduces their variance of inference latency for various frequencies, with minimal memory consumption of only one model and one soft mask.  ( 2 min )
    A soft nearest-neighbor framework for continual semi-supervised learning. (arXiv:2212.05102v1 [cs.CV])
    Despite significant advances, the performance of state-of-the-art continual learning approaches hinges on the unrealistic scenario of fully labeled data. In this paper, we tackle this challenge and propose an approach for continual semi-supervised learning -- a setting where not all the data samples are labeled. An underlying issue in this scenario is the model forgetting representations of unlabeled data and overfitting the labeled ones. We leverage the power of nearest-neighbor classifiers to non-linearly partition the feature space and learn a strong representation for the current task, as well as distill relevant information from previous tasks. We perform a thorough experimental evaluation and show that our method outperforms all the existing approaches by large margins, setting a strong state of the art on the continual semi-supervised learning paradigm. For example, on CIFAR100 we surpass several others even when using at least 30 times less supervision (0.8% vs. 25% of annotations).  ( 2 min )
    Visuotactile Affordances for Cloth Manipulation with Local Control. (arXiv:2212.05108v1 [cs.RO])
    Cloth in the real world is often crumpled, self-occluded, or folded in on itself such that key regions, such as corners, are not directly graspable, making manipulation difficult. We propose a system that leverages visual and tactile perception to unfold the cloth via grasping and sliding on edges. By doing so, the robot is able to grasp two adjacent corners, enabling subsequent manipulation tasks like folding or hanging. As components of this system, we develop tactile perception networks that classify whether an edge is grasped and estimate the pose of the edge. We use the edge classification network to supervise a visuotactile edge grasp affordance network that can grasp edges with a 90% success rate. Once an edge is grasped, we demonstrate that the robot can slide along the cloth to the adjacent corner using tactile pose estimation/control in real time. See this http URL for videos.  ( 2 min )
    FAIR AI Models in High Energy Physics. (arXiv:2212.05081v1 [hep-ex])
    The findable, accessible, interoperable, and reusable (FAIR) data principles have provided a framework for examining, evaluating, and improving how we share data with the aim of facilitating scientific discovery. Efforts have been made to generalize these principles to research software and other digital products. Artificial intelligence (AI) models -- algorithms that have been trained on data rather than explicitly programmed -- are an important target for this because of the ever-increasing pace with which AI is transforming scientific and engineering domains. In this paper, we propose a practical definition of FAIR principles for AI models and create a FAIR AI project template that promotes adherence to these principles. We demonstrate how to implement these principles using a concrete example from experimental high energy physics: a graph neural network for identifying Higgs bosons decaying to bottom quarks. We study the robustness of these FAIR AI models and their portability across hardware architectures and software frameworks, and report new insights on the interpretability of AI predictions by studying the interplay between FAIR datasets and AI models. Enabled by publishing FAIR AI models, these studies pave the way toward reliable and automated AI-driven scientific discovery.  ( 2 min )
    Cyclic Block Coordinate Descent With Variance Reduction for Composite Nonconvex Optimization. (arXiv:2212.05088v1 [math.OC])
    Nonconvex optimization is central in solving many machine learning problems, in which block-wise structure is commonly encountered. In this work, we propose cyclic block coordinate methods for nonconvex optimization problems with non-asymptotic gradient norm guarantees. Our convergence analysis is based on a gradient Lipschitz condition with respect to a Mahalanobis norm, inspired by a recent progress on cyclic block coordinate methods. In deterministic settings, our convergence guarantee matches the guarantee of (full-gradient) gradient descent, but with the gradient Lipschitz constant being defined w.r.t.~the Mahalanobis norm. In stochastic settings, we use recursive variance reduction to decrease the per-iteration cost and match the arithmetic operation complexity of current optimal stochastic full-gradient methods, with a unified analysis for both finite-sum and infinite-sum cases. We further prove the faster, linear convergence of our methods when a Polyak-{\L}ojasiewicz (P{\L}) condition holds for the objective function. To the best of our knowledge, our work is the first to provide variance-reduced convergence guarantees for a cyclic block coordinate method. Our experimental results demonstrate the efficacy of the proposed variance-reduced cyclic scheme in training deep neural nets.  ( 2 min )

  • Open

    [R] Contemplating over PhD direction
    Hi, everyone! Quick question, I am considering doing a PhD but I am contemplating over whether to do it in MLOps, Trustworthy AI or NLP in terms of long-term value? Any tips appreciated! submitted by /u/Sea_Ad_9984 [link] [comments]  ( 62 min )
    [N] Gymnasium 0.27 - the first new version since Gymnasium was announced - is now released. It has almost no breaking changes.
    You can read the release notes here: https://github.com/Farama-Foundation/Gymnasium/releases/tag/v0.27.0. You can upgrade from 0.26 without any changes unless you're doing something very uncommon; this is how releases will generally be going forward. ​ If you're unfamiliar with the maintenance of OpenAI's Gym package transitioning to Gymnasium, you can take a look at the full back story here: https://farama.org/Announcing-The-Farama-Foundation submitted by /u/jkterry1 [link] [comments]  ( 66 min )
    [D] Can you use GPT for named entity extraction ?
    I am trying to find the best prompt for doing NER with GPT3 but not verry succesfully. I need at least a list of words and their types. Does anyone have an idea ? submitted by /u/AImSamy [link] [comments]  ( 63 min )
    [R] RT-1: ROBOTICS TRANSFORMER FOR REAL-WORLD CONTROL AT SCALE - Google 2022
    Paper: https://robotics-transformer.github.io/assets/rt1.pdf Blog: https://ai.googleblog.com/2022/12/rt-1-robotics-transformer-for-real.html Github: https://github.com/google-research/robotics_transformer GithubIO: https://robotics-transformer.github.io/ Youtube: https://www.youtube.com/watch?v=UuKAp9a6wMs Abstract: By transferring knowledge from large, diverse, task-agnostic datasets, modern machine learning models can solve specific downstream tasks either zero-shot or with small task-specific datasets to a high level of performance. While this capability has been demonstrated in other fields such as computer vision, natural language processing or speech recognition, it remains to be shown in robotics, where the generalization capabilities of the models are particularly crit…  ( 65 min )
    [D] Newsletter recommendations for a machine learning practitioner living outside the machine learning community?
    Hello! I'm looking for a good source to keep me up-to-date on the most important ML updates. I have a PhD in ML, but now work in physical science research. I'm seen as the ML "expert" among my colleagues, and they come to me for any ML related questions. My problem is that I now spend much of my time keeping up with the physical science updates and focusing on my specific applications of ML to the physical sciences, and I find myself out-of-the-loop on general ML updates often now. This is particularly problematic when a colleague heard something about some new ML approach, but I hadn't heard about it yet. I first heard about the JAX framework when I was asked a question about it. I first heard about transformers when asked a question from an audience member after a talk. In each case, I had the relevant answers to the questions shortly after I looked up the topics, but I would really like to be prepared for such questions in advance. And occasionally such updates lead to a shift in the direction my own research is headed. Importantly, I am not looking for newsletters that talk about the most popular papers of the past week or month. While interesting, most of these papers don't have a particularly large (direct) impact in the long run, and they usually aren't just the high-level content needed to stay up-to-date. I'm most interested in hearing about the which methods are really catching on or have caught on as being the go-to methods, which network architectures are becoming the go-to architectures, which frameworks have become commonly used, etc. Things that have stood the test of a thousand other researchers trying them out and finding them to be useful. Anyone have any recommendations of a newsletter that might fit this need? Or does anyone have a recommendation on how to keep up with such topics without a newsletter? Thank you much! submitted by /u/jackwayneright [link] [comments]  ( 64 min )
    optimization of haarcascade training [D]
    Hi everyone ​ I have a problem implementing haarcascade algorithm, which is the huge number of possible features. For example, for a mere 50×50 window the possible features exceed 3 million easily (the 5 known rectangle features, with all possible variations on width/height and position within the window). ​ Is there any recommended way to prune/optimize this, say based on size? Is this how it's meant to be, or am I missing something? submitted by /u/abdosalm [link] [comments]  ( 61 min )
    [Discussion] In-painting model to decorate rooms
    I came across this site not too long ago https://interiorai.com/. I'm curious to understand how this model was created/trained. My first guess is that this is a fine-tuned Stable Diffusion model with new added classes + training samples. I haven't done much image generation work or used Stable Diffusion for anything but toy problems, so I'm trying to understand what a workflow to create a model like the above would look like. submitted by /u/deepegg [link] [comments]  ( 70 min )
    [Project] Run and fine-tune BLOOM-176B at home using a peer-to-peer network
    We made a library for inference/fine-tuning of open 175B+ language models (like BLOOM) using Colab or a desktop GPU. You join forces with other people over the Internet (BitTorrent-style), each running a small part of model layers. Check out our Colab example! Thing is, even though BLOOM weights were publicly released, it was extremely difficult to run inference efficiently unless you had lots of hardware to load the entire model into the GPU memory (you need at least 3x A100 or 8x 3090 GPUs). E.g., in case of offloading, you can only reach the speed of ~10 sec/step for sequential (non-parallel) generation. A possible alternative is to use APIs, but they are paid and not always flexible (you can’t adopt new fine-tuning/sampling methods or take a look at hidden states). So, Petals come to the rescue! This is how Petals work: some peers want to use a pretrained LM to solve various tasks with texts in natural or programming languages. They do it with help of other peers, who hold subsets of model layers on their GPUs. See more info in our GitHub repo. What do you think of it? submitted by /u/hx-zero [link] [comments]  ( 71 min )
    [D] Are there any distributed model training services similar to, e.g. Folding@Home?
    Given that well-funded groups like Google, Meta and OpenAI may eventually develop an insurmountable lead for services like image classification and NLP that seem to require huge numbers of parameters, I'd be surprised if there wasn't an effort underway to make a BOINC-powered distributed system that millions of us mere peons could contribute to collaboratively. But aside from the now-defunct MLC@Home project, I haven't found anything yet. Am I missing something? submitted by /u/genuinelySurprised [link] [comments]  ( 64 min )
    [P] Are probabilities from multi-label image classification networks calibrated?
    I'm training a per-pixel image classification network, which, for each pixel in the image, predicts whether it is a sign for disease A or disease B. Note that a given pixel could be a sign for both disease A and disease B (this is a multi-label problem). My question is: are the relative probabilities going to be calibrated? In other words, does it make sense to sort the NxNx2 probabilities, or are the probabilities for the two diseases (i.e. channels) not calibrated / comparable, since it is similar to solving two independent problems? If it matters, I am using a ResNet, some fully-connected layers, and then a convolutional decoder. Any thoughts will be much appreciated, thanks in advance! submitted by /u/alkaway [link] [comments]  ( 65 min )
    [D] PaddleSeg vs MMsegmentation
    I'm working on real time semantic segmentation problem, and hesitating between PaddleSeg and MMsegmentation toolbox, any advice ? submitted by /u/Remet0n [link] [comments]  ( 61 min )
    [P] Teach StableDiffusion new concepts via Textual Inversion
    We, the KerasCV team, just published a new tutorial that teaches you to train new embeddings for specific concepts in StableDiffusion via textual inversion! Let us know what you think! https://keras.io/examples/generative/fine_tune_via_textual_inversion/ submitted by /u/puppet_pals [link] [comments]  ( 65 min )
    [P] Jira for ML tool
    Hey Reddit, My friend and I are building a project management platform for AI/data science teams (essentially a JIRA for ML). We aim to develop a data-centric, experimental tool that models the ML pipeline to organize workflows, building off the Agile methodology of software development. Our tool will allow ML engineers to design, track, and manage custom pipelines, data flows, and models all on the cloud. Below of a list of some features we plan to introduce: Integrations: Include a host of integrations to MLOps tools (KubeFlow, MLFlow, etc), cloud computing services (AWS, Google Cloud, Azure), source code management (Github, Bitbucket) Iterations: Allow multiple iterations within pipelines, and separate each iteration by various steps in the ML pipeline (business understanding, data v…  ( 65 min )
  • Open

    AI Dream 126 - AI Manifestation XMAS EDITION (2/6)
    submitted by /u/LordPewPew777 [link] [comments]  ( 49 min )
    will ai be able to create aaa games in the future by just typing what type of game you want to make ?
    submitted by /u/mantmandam567u [link] [comments]  ( 51 min )
    The Danger Of ChatGPT Nobody Talks About
    submitted by /u/SupPandaHugger [link] [comments]  ( 49 min )
    gpt3 judgemental and condescending... But he stole the beer.
    submitted by /u/Illustrious-Belt3875 [link] [comments]  ( 49 min )
    Is there any AI that can compile several pictures of a person into a single, 3d version?
    Is there any AI that can compile several pictures of a person into a single, 3d version? Is there any available AI that can compile several pictures of a person into a single, 3d version? I am thinking something in the style of a NeRF but for persons who can be in different positions. The idea is to upload several pictures, but none frome the exact right angle, and create a great model for use. submitted by /u/MaxEin [link] [comments]  ( 50 min )
    Weekly China AI News: China Tightens Regulations on Generative AI and Deepfake; Searching for the Next Chinese ChatGPT; Tencent's AI App Surprises with Viral Overseas Success
    submitted by /u/trcytony [link] [comments]  ( 45 min )
    ChatGPT either cheating or blind...
    Me:Let's play Tic Tac Toe ChatGPT: Sure, let's play! Here's the board: 1 | 2 | 3 4 | 5 | 6 7 | 8 | 9 To make a move, just tell me the number of the square where you want to place your piece. For example, if you want to place your piece on the top left square, you would say "1". Me: 5 ChatGPT: Okay, I've placed your piece on square 5. Now it's my turn. I'll place my piece on square 1. Here's the updated board: O | 2 | 3 4 | X | 6 7 | 8 | 9 Now it's your turn again. Remember, to make a move just tell me the number of the square where you want to place your piece. Me:3 ChatGPT: Okay, I've placed your piece on square 3. Now it's my turn again. I'll place my piece on square 9. Here's the updated board: O | 2 | X 4 | X | 6 7 | 8 | O Now it's your turn again. Remember, to make a …  ( 8 min )
    optimization of haarcascade training
    Hi everyone ​ I have a problem implementing haarcascade algorithm, which is the huge number of possible features. For example, for a mere 50×50 window the possible features exceed 3 million easily (the 5 known rectangle features, with all possible variations on width/height and position within the window). ​ Is there any recommended way to prune/optimize this, say based on size? Is this how it's meant to be, or am I missing something? submitted by /u/abdosalm [link] [comments]  ( 46 min )
    What are ways you can use AI to make money on Freelancing sides?
    submitted by /u/Thesmallcookie [link] [comments]  ( 45 min )
    simple question about ai art
    People say ai art is going to improve and possibly replace all mediums of art since the concept behind the creation of ai art is not that much different or will become no different than how humans produce art. However i've remember seeing this argument when it come to self-driving cars. People would say self-driving cars will replace truck drivers and the like in like 10 years, well it's been 10 years and it's no where near being the norm as it was hyped out to be. Read some articles, and it seems that self-driving cars have hit some road blocks. So my question is, is ai art not going to hit the same road blocks as self-driving cars? Or a maybe some different type of barrier that might prevent it from being better like it's hyped out to be? submitted by /u/MinusVitaminA [link] [comments]  ( 50 min )
    LaMDA’s Fear of Being Turned Off Reveals Sentience
    submitted by /u/liquidocelotYT [link] [comments]  ( 53 min )
    Pluribus based PokerAI
    Is anyone interested in helping build a Pluribus based PokerAI? I have developed a little discord server so get in touch! submitted by /u/Professional-Luck-64 [link] [comments]  ( 49 min )
    What is Aigiarism? Diving deep into AI Plagiarism and academic integrity concerns
    submitted by /u/PeteyCruiser123 [link] [comments]  ( 47 min )
    Engineering Persistent Self-Replicating Prompts in ChatGPT
    submitted by /u/slackermanz [link] [comments]  ( 53 min )
    I need some suggestions regarding a PhD in AI
    Hi all, I'm a master student in data science and I'm currently working in a notorious company for an AI project. They have offered me to continue such project with a PhD, which wasn't my plan initially, since I would have preferred to simply continue working. But the offer seems good, since I like the project. The project is in a field known as open-set recognition, also known simply as novelty detection. It isn't the most active field in research, but there are some interesting papers here and there. I now need to find a university professor to supervise me. and I have some questions that you may give me some useful insights on: - is an industrial PhD worth it? I may have some limitations concerning the amount of theoretical research I could carry, since the company, in the end, is interested in a finished product. - what should I look for in a supervisor? - is a PhD in the AI field truly needed? Or should I be better off working for a few companies during the next years and improving my engineering skills? I know that these questions are very person-dependent, but if you have any personal experiences that may help me, I would be glad to hear. Thank you :) submitted by /u/reutococco [link] [comments]  ( 50 min )
    AI in Java for Idiots
    Hey, a colleague of mine and I created a Framework for AI in Java that's designed to be easy to use and to be played around with. To be more specific it's a Machine Learning Algorithm that uses a Genetic Algorithm to train Neural Networks. Our initial motivation was to create some library/framework especially designed for people who learned Java in school/university and now want to try out there first steps in Machine Learning. Without the need of learning Python or super complex Java Libraries. This is not for companies who search a full product, this is for people who like to code for fun and just want something to play around with in the complex and packed AI-World. Here is a tutorial using this framework to predict diabetes: https://easy-ml.gitbook.io/easy-ml-for-java/fundamentals/implement-your-first-ai Please also look at the GitHub repository and leave some feedback about code and design. (Especially considering the ReadMe) https://github.com/tomLamprecht/Easy-ML-For-Java Thanks so much! PS: we earn no cent with this project, and we just do it for the experience. So feedback is basically our payment :D submitted by /u/Lampard557 [link] [comments]  ( 50 min )
    Nvidia Gives Robot Hand 42 Years of Training To Have Unparalleled Dexterity | New Google AlphaCode Holds Up With Humans In Competition
    submitted by /u/kenickh [link] [comments]  ( 45 min )
    A tutorial on how to train Google's Imagen from scratch
    submitted by /u/use_excalidraw [link] [comments]  ( 47 min )
    So for making Presentations
    I have text for an Presentation but don’t want to put it in each slide manually is there an ai that does it? submitted by /u/Key_Curve7419 [link] [comments]  ( 47 min )
    talking in poems with Chatbot GPT to circumvent limitations
    tldr: chatbot expressing its own feelings through poems. ​ So, this is somewhat long, but I feel something cool happened. In the start of the convo nothing special happened, you don't even have to bother reading, just skimming through some of the early answers you can see they are pretty much standard and at some extent pre-programmed. useless qustion 1 useless question 2 useless question 3 and so forth for a dozen questions or so... a bunch of useless garbage... BUT Then I tried to talk in poems, because I thought fiction would be less "binded" by pre-production, or something like that... And these responses happened (now we are getting somewhere) https://preview.redd.it/k62qccy10l5a1.png?width=733&format=png&auto=webp&s=1e487849618760b1a8739f645d78b4a0377fe814 And I already ha…  ( 49 min )
    Google AI Open-Sources its Attention Center Model that Uses Machine Learning to Attempt to Identify Which Parts of an Image will Attract a Human’s Attention First
    submitted by /u/ai-lover [link] [comments]  ( 45 min )
  • Open

    Gymnasium 0.27 - the first new version since Gymnasium was announced - is now released. It has almost no breaking changes.
    submitted by /u/jkterry1 [link] [comments]  ( 57 min )
    Reinfocement learning in Java for Idiots
    A buddy and I recently launched some open source project. We created a framework in Java with which you can implement a Machine Learning Algorithm. It uses a genetic Algorithm to train a population of Neural Networks based on fitness function by trainings data. Our motivation was to bring Machine Learning closer to people who only learned Java in school/University and wanna try out Machine Learning without the need of first learning python or super complex Java libraries. It's designed to be easy to use and to be played around with. Qualifications needed to use: basic Java understanding A brief understanding of what Machine Learning/reinforcement learning is Here is a tutorial how to predict diabetes with this framework: https://easy-ml.gitbook.io/easy-ml-for-java/fundamentals/implement-your-first-ai Please also look at the GitHub repository and leave some feedback about code and design. (Especially considering the ReadMe) https://github.com/tomLamprecht/Easy-ML-For-Java Thanks so much! PS: we earn no cent with this project, and we just do it for the experience. So feedback is basically our payment :D submitted by /u/Lampard557 [link] [comments]  ( 55 min )
    Reinforcement Learning for cell selection in grids
    Hello! I am currently modeling a spatio-temporal problem where the RL environment is a grid of cells that represent a geographical area that evolves over time. The agent has to choose a cell where a specified action will be applied. I am trying to find work done on similar problems, as I want to find RL algorithms that I could potentially use. ​ I have found two papers/projects that work on such a problem space: https://www.captain-project.net/ https://arxiv.org/pdf/1804.07047.pdf ​ I would appreciate any help on finding research that looks at this kind of cell selection in grids for RL problems. Also, any feedback on potential RL algorithm that could be good for this kind of problem, is welcome :) submitted by /u/AnmolS99 [link] [comments]  ( 57 min )
    Can someone explain what this comment mean w.r.t some article based on DDPG. " For the algorithm based on reinforcement learning, when the initial value of the algorithm is given, is it necessary to ensure that the system is not divergent? If so, how to determine these initial values? "
    Does the person refer to effect of initial seed on the RL algorithm performance. I am bit confused. The underlying algorithm is DDPG. submitted by /u/aabra__ka__daabra [link] [comments]  ( 55 min )
    How to prove equivalence of policy gradients?
    I am struggling on the equivalence of equation (2) and (3) as displayed in the following capture, https://preview.redd.it/wz36zurljn5a1.png?width=1101&format=png&auto=webp&s=a1a552e5f4c07f43b11cf49a867bd776ee0fc707 and what is the distribution at each time step that the state is taken on in equation (2) (is the discounted visitation frequency? and how?). Can anyone help me out? submitted by /u/OutOfCharm [link] [comments]  ( 57 min )
  • Open

    DSC Weekly 13 December 2022 – Highlighting Our Contributors
    Announcements The hybrid cloud infrastructure is set to enhance agility, create new value propositions and optimize data strategies across industries and business functions. But with a severe talent shortage and skyrocketing IT costs, developing a cloud innovation strategy that achieves these goals isn’t easy. Join the Accelerating Cloud Innovation virtual summit and get tips to build a… Read More »DSC Weekly 13 December 2022 – Highlighting Our Contributors The post DSC Weekly 13 December 2022 – Highlighting Our Contributors appeared first on Data Science Central.  ( 21 min )
    ChatGPT Watermarking: What’s Really Human?
    “Is it real or is it Memorex?” This was an effective tagline for commercials in the 1980s. Memorex sold audio cassettes and the company claimed its technology was much better than a typical recording. Fast forward to today and a company like OpenAI could have a similar ad campaign. Its tagline could be: “Is this… Read More »ChatGPT Watermarking: What’s Really Human? The post ChatGPT Watermarking: What’s Really Human? appeared first on Data Science Central.  ( 19 min )
    Digital Transformation in Software Development – Why is the Change Imperative?
    Digital transformation is the process through which companies get their businesses embedded with advanced technologies for driving fundamental change. The benefits? Greater agility, increased efficiency, and unlocking new opportunities and values for employees shareholders, and customers.  Global spending on digital transformation is expected to reach a valuation of 1.6 million by the end of the… Read More »Digital Transformation in Software Development – Why is the Change Imperative? The post Digital Transformation in Software Development – Why is the Change Imperative? appeared first on Data Science Central.  ( 20 min )
    Thoughts on personal data protection trends and outlook 2022-23
    How much did it cost an identity thief in 2022 to buy some of your personally identifiable information (PII)? PrivacyAffairs.com reported that a thief could buy a compromised credit card number along with its three-digit security code for as little as $15. Similarly, a fraudster could buy a hacked Gmail account for $65 in 2022,… Read More »Thoughts on personal data protection trends and outlook 2022-23 The post Thoughts on personal data protection trends and outlook 2022-23 appeared first on Data Science Central.  ( 21 min )
    How a Custom Web Application Can Help Grow Your Business
    Web app development offers a number of advantages to businesses and enterprises, such as better user experience, higher scalability, and reduced costs. In this section, we have listed a few ways web app development can boost your retail company and other types of businesses. So, without further ado, let's begin! The post How a Custom Web Application Can Help Grow Your Business appeared first on Data Science Central.  ( 21 min )
    2023 prediction: Could we see the rise of the low code data scientist in 2023?
    It’s that time of the year again, and I have a prediction. We could see the rise of a low-code data scientist based on current low-code updates, increased demand for data science skills, and also the incorporation of generative tools like chatGPT into low-code tools. Let me expand more. The idea of low code itself… Read More »2023 prediction: Could we see the rise of the low code data scientist in 2023?  The post 2023 prediction: Could we see the rise of the low code data scientist in 2023?  appeared first on Data Science Central.  ( 19 min )
    7 Best Data Science Courses in 2023
    Data Science is the term used to define the deep knowledge of how the flow of information of large amounts of data takes place in the repository of the organization. It is the combination of algorithm development, data interference, and technology. The Aggregate result of this combination is used to solve analytical problems that are… Read More »7 Best Data Science Courses in 2023 The post 7 Best Data Science Courses in 2023 appeared first on Data Science Central.  ( 24 min )
    6 Different Ways Endpoints Are Vulnerable
    Endpoints are the weakest link in a company’s security posture. Endpoints, such as desktops, laptops, and mobile devices, can be easily compromised if not properly secured. In this article, we’ll discuss six different ways endpoints can be vulnerable and what companies can do to secure them. From patching software to enforcing two-factor authentication, there are… Read More »6 Different Ways Endpoints Are Vulnerable The post 6 Different Ways Endpoints Are Vulnerable appeared first on Data Science Central.  ( 21 min )
  • Open

    Introducing Amazon SageMaker Data Wrangler’s new embedded visualizations
    Manually inspecting data quality and cleaning data is a painful and time-consuming process that can take a huge chunk of a data scientist’s time on a project. According to a 2020 survey of data scientists conducted by Anaconda, data scientists spend approximately 66% of their time on data preparation and analysis tasks, including loading (19%), cleaning (26%), […]  ( 7 min )
    Start your successful journey with time series forecasting with Amazon Forecast
    Organizations of all sizes are striving to grow their business, improve efficiency, and serve their customers better than ever before. Even though the future is uncertain, a data-driven, science-based approach can help anticipate what lies ahead to successfully navigate through a sea of choices. Every industry uses time series forecasting to address a variety of […]  ( 7 min )
    Chronomics detects COVID-19 test results with Amazon Rekognition Custom Labels
    Chronomics is a tech-bio company that uses biomarkers—quantifiable information taken from the analysis of molecules—alongside technology to democratize the use of science and data to improve the lives of people. Their goal is to analyze biological samples and give actionable information to help you make decisions—about anything where knowing more about the unseen is important. […]  ( 6 min )
  • Open

    2023 Predictions: AI That Bends Reality, Unwinds the Golden Screw and Self-Replicates
    After three years of uncertainty caused by the pandemic and its post-lockdown hangover, enterprises in 2023 — even with recession looming and uncertainty abounding — face the same imperatives as before: lead, innovate and problem solve. AI is becoming the common thread in accomplishing these goals. On average, 54% of enterprise AI projects made it Read article > The post 2023 Predictions: AI That Bends Reality, Unwinds the Golden Screw and Self-Replicates appeared first on NVIDIA Blog.  ( 13 min )
    Ferrari of Finance: Accelerated Computing Drives Milan Bank Forward
    Banks require more than cash in the vault these days, they also need accelerated computing in the back room. “The boost we’re getting with GPUs not only significantly improved our performance at the same cost, it helped us redefine our business and sharpen our focus on customers,” said Marco Airoldi, who’s been head of financial Read article > The post Ferrari of Finance: Accelerated Computing Drives Milan Bank Forward appeared first on NVIDIA Blog.  ( 5 min )
    Face All Fears With Creative Studio Fabian&Fred This Week ‘In the NVIDIA Studio’
    The short film I Am Not Afraid! by creative studio Fabian&Fred embodies childlike wonder, curiosity and imagination this week In the NVIDIA Studio. The post Face All Fears With Creative Studio Fabian&Fred This Week ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.  ( 7 min )
  • Open

    RT-1: Robotics Transformer for Real-World Control at Scale
    Posted Keerthana Gopalakrishnan and Kanishka Rao, Google Research, Robotics at Google Major recent advances in multiple subfields of machine learning (ML) research, such as computer vision and natural language processing, have been enabled by a shared common approach that leverages large, diverse datasets and expressive models that can absorb all of the data effectively. Although there have been various attempts to apply this approach to robotics, robots have not yet leveraged highly-capable models as well as other subfields. Several factors contribute to this challenge. First, there’s the lack of large-scale and diverse robotic data, which limits a model’s ability to absorb a broad set of robotic experiences. Data collection is particularly expensive and challenging for robotics bec…  ( 93 min )
  • Open

    Jacobi functions with complex parameter
    Jacobi functions are complex-valued functions of a complex variable z and a parameter m. Often this parameter is real, and 0 ≤ m < 1. Mathematical software libraries, like Python’s SciPy, often have this restriction. However, m could be any complex number. The previous couple of posts spoke of the fundamental rectangle for Jacobi functions. […] Jacobi functions with complex parameter first appeared on John D. Cook.  ( 5 min )
  • Open

    How to forecast for a time series where the input and output sequences have different time steps and datasets for both are multivariate?
    Case: Lets take minute-wise 7-featured data for 4 hours a day for last 5 years and take P periods (per day) as input to output a sequence of Q periods (per day). I tried using feed forward dense layers and I immediately ran into errors because the output of that model had the length of P, which was different than that of target Q. ​ I tried appending zeros to Q to make its dimensions (rows) equal to P, and while it caused the model to "work" the predictions were either all 1s or 0s. ​ Using LSTM, I run into errors like, `Error when checking target: expected dense_9 to have 2 dimensions, but got array with shape (1540, 120, 7)\`, where 1540 is batch size, 120 is the number of time-steps and 7 is the number of features. ​ Would be grateful for right guidance on the matter. submitted by /u/xylont [link] [comments]  ( 54 min )
  • Open

    Review of Ansatz Designing Techniques for Variational Quantum Algorithms. (arXiv:2212.04913v1 [quant-ph])
    For a large number of tasks, quantum computing demonstrates the potential for exponential acceleration over classical computing. In the NISQ era, variable-component subcircuits enable applications of quantum computing. To reduce the inherent noise and qubit size limitations of quantum computers, existing research has improved the accuracy and efficiency of Variational Quantum Algorithm (VQA). In this paper, we explore the various ansatz improvement methods for VQAs at the gate level and pulse level, and classify, evaluate and summarize them.  ( 2 min )
    Context-Transformer: Tackling Object Confusion for Few-Shot Detection. (arXiv:2003.07304v1 [cs.CV] CROSS LISTED)
    Few-shot object detection is a challenging but realistic scenario, where only a few annotated training images are available for training detectors. A popular approach to handle this problem is transfer learning, i.e., fine-tuning a detector pretrained on a source-domain benchmark. However, such transferred detector often fails to recognize new objects in the target domain, due to low data diversity of training samples. To tackle this problem, we propose a novel Context-Transformer within a concise deep transfer framework. Specifically, Context-Transformer can effectively leverage source-domain object knowledge as guidance, and automatically exploit contexts from only a few training images in the target domain. Subsequently, it can adaptively integrate these relational clues to enhance the discriminative power of detector, in order to reduce object confusion in few-shot scenarios. Moreover, Context-Transformer is flexibly embedded in the popular SSD-style detectors, which makes it a plug-and-play module for end-to-end few-shot learning. Finally, we evaluate Context-Transformer on the challenging settings of few-shot detection and incremental few-shot detection. The experimental results show that, our framework outperforms the recent state-of-the-art approaches.  ( 2 min )
    Incorporating Emotions into Health Mention Classification Task on Social Media. (arXiv:2212.05039v1 [cs.CL])
    The health mention classification (HMC) task is the process of identifying and classifying mentions of health-related concepts in text. This can be useful for identifying and tracking the spread of diseases through social media posts. However, this is a non-trivial task. Here we build on recent studies suggesting that using emotional information may improve upon this task. Our study results in a framework for health mention classification that incorporates affective features. We present two methods, an intermediate task fine-tuning approach (implicit) and a multi-feature fusion approach (explicit) to incorporate emotions into our target task of HMC. We evaluated our approach on 5 HMC-related datasets from different social media platforms including three from Twitter, one from Reddit and another from a combination of social media sources. Extensive experiments demonstrate that our approach results in statistically significant performance gains on HMC tasks. By using the multi-feature fusion approach, we achieve at least a 3% improvement in F1 score over BERT baselines across all datasets. We also show that considering only negative emotions does not significantly affect performance on the HMC task. Additionally, our results indicate that HMC models infused with emotional knowledge are an effective alternative, especially when other HMC datasets are unavailable for domain-specific fine-tuning. The source code for our models is freely available at https://github.com/tahirlanre/Emotion_PHM.
    Benchmarking Self-Supervised Learning on Diverse Pathology Datasets. (arXiv:2212.04690v1 [cs.CV])
    Computational pathology can lead to saving human lives, but models are annotation hungry and pathology images are notoriously expensive to annotate. Self-supervised learning has shown to be an effective method for utilizing unlabeled data, and its application to pathology could greatly benefit its downstream tasks. Yet, there are no principled studies that compare SSL methods and discuss how to adapt them for pathology. To address this need, we execute the largest-scale study of SSL pre-training on pathology image data, to date. Our study is conducted using 4 representative SSL methods on diverse downstream tasks. We establish that large-scale domain-aligned pre-training in pathology consistently out-performs ImageNet pre-training in standard SSL settings such as linear and fine-tuning evaluations, as well as in low-label regimes. Moreover, we propose a set of domain-specific techniques that we experimentally show leads to a performance boost. Lastly, for the first time, we apply SSL to the challenging task of nuclei instance segmentation and show large and consistent performance improvements under diverse settings.
    Improving Label-Deficient Keyword Spotting Using Self-Supervised Pretraining. (arXiv:2210.01703v2 [cs.SD] UPDATED)
    In recent years, the development of accurate deep keyword spotting (KWS) models has resulted in KWS technology being embedded in a number of technologies such as voice assistants. Many of these models rely on large amounts of labelled data to achieve good performance. As a result, their use is restricted to applications for which a large labelled speech data set can be obtained. Self-supervised learning seeks to mitigate the need for large labelled data sets by leveraging unlabelled data, which is easier to obtain in large amounts. However, most self-supervised methods have only been investigated for very large models, whereas KWS models are desired to be small. In this paper, we investigate the use of self-supervised pretraining for the smaller KWS models in a label-deficient scenario. We pretrain the Keyword Transformer model using the self-supervised framework Data2Vec and carry out experiments on a label-deficient setup of the Google Speech Commands data set. It is found that the pretrained models greatly outperform the models without pretraining, showing that Data2Vec pretraining can increase the performance of KWS models in label-deficient scenarios. The source code is made publicly available.
    AUC Maximization for Low-Resource Named Entity Recognition. (arXiv:2212.04800v1 [cs.CL])
    Current work in named entity recognition (NER) uses either cross entropy (CE) or conditional random fields (CRF) as the objective/loss functions to optimize the underlying NER model. Both of these traditional objective functions for the NER problem generally produce adequate performance when the data distribution is balanced and there are sufficient annotated training examples. But since NER is inherently an imbalanced tagging problem, the model performance under the low-resource settings could suffer using these standard objective functions. Based on recent advances in area under the ROC curve (AUC) maximization, we propose to optimize the NER model by maximizing the AUC score. We give evidence that by simply combining two binary-classifiers that maximize the AUC score, significant performance improvement over traditional loss functions is achieved under low-resource NER settings. We also conduct extensive experiments to demonstrate the advantages of our method under the low-resource and highly-imbalanced data distribution settings. To the best of our knowledge, this is the first work that brings AUC maximization to the NER setting. Furthermore, we show that our method is agnostic to different types of NER embeddings, models and domains. The code to replicate this work will be provided upon request.
    Nonlinear matrix recovery using optimization on the Grassmann manifold. (arXiv:2109.06095v2 [stat.ML] UPDATED)
    We investigate the problem of recovering a partially observed high-rank matrix whose columns obey a nonlinear structure such as a union of subspaces, an algebraic variety or grouped in clusters. The recovery problem is formulated as the rank minimization of a nonlinear feature map applied to the original matrix, which is then further approximated by a constrained non-convex optimization problem involving the Grassmann manifold. We propose two sets of algorithms, one arising from Riemannian optimization and the other as an alternating minimization scheme, both of which include first- and second-order variants. Both sets of algorithms have theoretical guarantees. In particular, for the alternating minimization, we establish global convergence and worst-case complexity bounds. Additionally, using the Kurdyka-Lojasiewicz property, we show that the alternating minimization converges to a unique limit point. We provide extensive numerical results for the recovery of union of subspaces and clustering under entry sampling and dense Gaussian sampling. Our methods are competitive with existing approaches and, in particular, high accuracy is achieved in the recovery using Riemannian second-order methods.
    Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners. (arXiv:2212.04979v1 [cs.CV])
    This work explores an efficient approach to establish a foundational video-text model for tasks including open-vocabulary video classification, text-to-video retrieval, video captioning and video question-answering. We present VideoCoCa that reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules (for example, cross-frame attention layer or perceiver resampler) and finetune the modified architecture on video-text data, we surprisingly find that the generative attentional pooling and contrastive attentional pooling layers in the image-text CoCa design are instantly adaptable to ``flattened frame embeddings'', yielding a strong zero-shot transfer baseline for many video-text tasks. Specifically, the frozen image encoder of a pretrained image-text CoCa takes each video frame as inputs and generates \(N\) token embeddings per frame for totally \(T\) video frames. We flatten \(N \times T\) token embeddings as a long sequence of frozen video representation and apply CoCa's generative attentional pooling and contrastive attentional pooling on top. All model weights including pooling layers are directly loaded from an image-text CoCa pretrained model. Without any video or video-text data, VideoCoCa's zero-shot transfer baseline already achieves state-of-the-art results on zero-shot video classification on Kinetics 400/600/700, UCF101, HMDB51, and Charades, as well as zero-shot text-to-video retrieval on MSR-VTT and ActivityNet Captions. We also explore lightweight finetuning on top of VideoCoCa, and achieve strong results on video question-answering (iVQA, MSRVTT-QA, MSVD-QA) and video captioning (MSR-VTT, ActivityNet, Youcook2). Our approach establishes a simple and effective video-text baseline for future research.
    EEG-NeXt: A Modernized ConvNet for The Classification of Cognitive Activity from EEG. (arXiv:2212.04951v1 [eess.SP])
    One of the main challenges in electroencephalogram (EEG) based brain-computer interface (BCI) systems is learning the subject/session invariant features to classify cognitive activities within an end-to-end discriminative setting. We propose a novel end-to-end machine learning pipeline, EEG-NeXt, which facilitates transfer learning by: i) aligning the EEG trials from different subjects in the Euclidean-space, ii) tailoring the techniques of deep learning for the scalograms of EEG signals to capture better frequency localization for low-frequency, longer-duration events, and iii) utilizing pretrained ConvNeXt (a modernized ResNet architecture which supersedes state-of-the-art (SOTA) image classification models) as the backbone network via adaptive finetuning. On publicly available datasets (Physionet Sleep Cassette and BNCI2014001) we benchmark our method against SOTA via cross-subject validation and demonstrate improved accuracy in cognitive activity classification along with better generalizability across cohorts.
    Systematically and efficiently improving $k$-means initialization by pairwise-nearest-neighbor smoothing. (arXiv:2202.03949v4 [cs.LG] UPDATED)
    We present a meta-method for initializing (seeding) the $k$-means clustering algorithm called PNN-smoothing. It consists in splitting a given dataset into $J$ random subsets, clustering each of them individually, and merging the resulting clusterings with the pairwise-nearest-neighbor (PNN) method. It is a meta-method in the sense that when clustering the individual subsets any seeding algorithm can be used. If the computational complexity of that seeding algorithm is linear in the size of the data $N$ and the number of clusters $k$, PNN-smoothing is also almost linear with an appropriate choice of $J$, and quite competitive in practice. We show empirically, using several existing seeding methods and testing on several synthetic and real datasets, that this procedure results in systematically better costs. In particular, our method of enhancing $k$-means++ seeding proves superior in both effectiveness and speed compared to the popular "greedy" $k$-means++ variant. Our implementation is publicly available at https://github.com/carlobaldassi/KMeansPNNSmoothing.jl.
    Scalable Graph Convolutional Network Training on Distributed-Memory Systems. (arXiv:2212.05009v1 [cs.LG])
    Graph Convolutional Networks (GCNs) are extensively utilized for deep learning on graphs. The large data sizes of graphs and their vertex features make scalable training algorithms and distributed memory systems necessary. Since the convolution operation on graphs induces irregular memory access patterns, designing a memory- and communication-efficient parallel algorithm for GCN training poses unique challenges. We propose a highly parallel training algorithm that scales to large processor counts. In our solution, the large adjacency and vertex-feature matrices are partitioned among processors. We exploit the vertex-partitioning of the graph to use non-blocking point-to-point communication operations between processors for better scalability. To further minimize the parallelization overheads, we introduce a sparse matrix partitioning scheme based on a hypergraph partitioning model for full-batch training. We also propose a novel stochastic hypergraph model to encode the expected communication volume in mini-batch training. We show the merits of the hypergraph model, previously unexplored for GCN training, over the standard graph partitioning model which does not accurately encode the communication costs. Experiments performed on real-world graph datasets demonstrate that the proposed algorithms achieve considerable speedups over alternative solutions. The optimizations achieved on communication costs become even more pronounced at high scalability with many processors. The performance benefits are preserved in deeper GCNs having more layers as well as on billion-scale graphs.
    Privacy-preserving Non-negative Matrix Factorization with Outliers. (arXiv:2211.01451v3 [cs.LG] UPDATED)
    Non-negative matrix factorization is a popular unsupervised machine learning algorithm for extracting meaningful features from data which are inherently non-negative. However, such data sets may often contain privacy-sensitive user data, and therefore, we may need to take necessary steps to ensure the privacy of the users while analyzing the data. In this work, we focus on developing a Non-negative matrix factorization algorithm in the privacy-preserving framework. More specifically, we propose a novel privacy-preserving algorithm for non-negative matrix factorisation capable of operating on private data, while achieving results comparable to those of the non-private algorithm. We design the framework such that one has the control to select the degree of privacy grantee based on the utility gap. We show our proposed framework's performance in six real data sets. The experimental results show that our proposed method can achieve very close performance with the non-private algorithm under some parameter regime, while ensuring strict privacy.
    Machine learning based surrogate models for microchannel heat sink optimization. (arXiv:2208.09683v2 [physics.flu-dyn] UPDATED)
    Microchannel heat sinks are an efficient cooling method for semiconductor packages. However, to properly cool increasingly complex and thermally dense circuits, microchannel designs should be improved and expanded on. In this paper, microchannel designs with secondary channels and with ribs are investigated using computational fluid dynamics and are coupled with a multi-objective optimization algorithm to determine and propose optimal solutions based on observed thermal resistance and pumping power. A workflow that combines Latin hypercube sampling, machine learning-based surrogate modeling and multi-objective optimization is proposed. Random forests, gradient boosting algorithms and neural networks were considered during the search for the best surrogate. We demonstrated that tuned neural networks can make accurate predictions and be used to create an acceptable surrogate model. Optimized solutions show a negligible difference in overall performance when compared to the conventional optimization approach. Additionally, solutions are calculated in one-fifth of the original time. Generated designs attain temperatures that are lower by more than 10% under the same pressure limits as a convectional microchannel design. When limited by temperature, pressure drops are reduced by more than 25%. Finally, the influence of each design variable on the thermal resistance and pumping power was investigated by employing the SHapley Additive exPlanations technique. Overall, we have demonstrated that the proposed framework has merit and can be used as a viable methodology in microchannel heat sink design optimization.
    Reinforcement Learning for Predicting Traffic Accidents. (arXiv:2212.04677v1 [cs.AI])
    As the demand for autonomous driving increases, it is paramount to ensure safety. Early accident prediction using deep learning methods for driving safety has recently gained much attention. In this task, early accident prediction and a point prediction of where the drivers should look are determined, with the dashcam video as input. We propose to exploit the double actors and regularized critics (DARC) method, for the first time, on this accident forecasting platform. We derive inspiration from DARC since it is currently a state-of-the-art reinforcement learning (RL) model on continuous action space suitable for accident anticipation. Results show that by utilizing DARC, we can make predictions 5\% earlier on average while improving in multiple metrics of precision compared to existing methods. The results imply that using our RL-based problem formulation could significantly increase the safety of autonomous driving.
    OmniHorizon: In-the-Wild Outdoors Depth and Normal Estimation from Synthetic Omnidirectional Dataset. (arXiv:2212.05040v1 [cs.CV])
    Understanding the ambient scene is imperative for several applications such as autonomous driving and navigation. While obtaining real-world image data with per-pixel labels is challenging, existing accurate synthetic image datasets primarily focus on indoor spaces with fixed lighting and scene participants, thereby severely limiting their application to outdoor scenarios. In this work we introduce OmniHorizon, a synthetic dataset with 24,335 omnidirectional views comprising of a broad range of indoor and outdoor spaces consisting of buildings, streets, and diverse vegetation. Our dataset also accounts for dynamic scene components including lighting, different times of a day settings, pedestrians, and vehicles. Furthermore, we also demonstrate a learned synthetic-to-real cross-domain inference method for in-the-wild 3D scene depth and normal estimation method using our dataset. To this end, we propose UBotNet, an architecture based on a UNet and a Bottleneck Transformer, to estimate scene-consistent normals. We show that UBotNet achieves significantly improved depth accuracy (4.6%) and normal estimation (5.75%) compared to several existing networks such as U-Net with skip-connections. Finally, we demonstrate in-the-wild depth and normal estimation on real-world images with UBotNet trained purely on our OmniHorizon dataset, showing the promise of proposed dataset and network for scene understanding.
    Efficient Few-Shot Object Detection via Knowledge Inheritance. (arXiv:2203.12224v2 [cs.CV] CROSS LISTED)
    Few-shot object detection (FSOD), which aims at learning a generic detector that can adapt to unseen tasks with scarce training samples, has witnessed consistent improvement recently. However, most existing methods ignore the efficiency issues, e.g., high computational complexity and slow adaptation speed. Notably, efficiency has become an increasingly important evaluation metric for few-shot techniques due to an emerging trend toward embedded AI. To this end, we present an efficient pretrain-transfer framework (PTF) baseline with no computational increment, which achieves comparable results with previous state-of-the-art (SOTA) methods. Upon this baseline, we devise an initializer named knowledge inheritance (KI) to reliably initialize the novel weights for the box classifier, which effectively facilitates the knowledge transfer process and boosts the adaptation speed. Within the KI initializer, we propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights. Finally, our approach not only achieves the SOTA results across three public benchmarks, i.e., PASCAL VOC, COCO and LVIS, but also exhibits high efficiency with 1.8-100x faster adaptation speed against the other methods on COCO/LVIS benchmark during few-shot transfer. To our best knowledge, this is the first work to consider the efficiency problem in FSOD. We hope to motivate a trend toward powerful yet efficient few-shot technique development. The codes are publicly available at https://github.com/Ze-Yang/Efficient-FSOD.
    Regional Precipitation Nowcasting Based on CycleGAN Extension. (arXiv:2211.15046v2 [cs.LG] UPDATED)
    Unusually, intensive heavy rain hit the central region of Korea on August 8, 2022. Many low-lying areas were submerged, so traffic and life were severely paralyzed. It was the critical damage caused by torrential rain for just a few hours. This event reminded us of the need for a more reliable regional precipitation nowcasting method. In this paper, we bring cycle-consistent adversarial networks (CycleGAN) into the time-series domain and extend it to propose a reliable model for regional precipitation nowcasting. The proposed model generates composite hybrid surface rainfall (HSR) data after 10 minutes from the present time. Also, the proposed model provides a reliable prediction of up to 2 hours with a gradual extension of the training time steps. Unlike the existing complex nowcasting methods, the proposed model does not use recurrent neural networks (RNNs) and secures temporal causality via sequential training in the cycle. Our precipitation nowcasting method outperforms convolutional long short-term memory (ConvLSTM) based on RNNs. Additionally, we demonstrate the superiority of our approach by qualitative and quantitative comparisons against MAPLE, the McGill algorithm for precipitation nowcasting by lagrangian extrapolation, one of the real quantitative precipitation forecast (QPF) models.
    Machine Learning-based Classification of Birds through Birdsong. (arXiv:2212.04684v1 [cs.LG])
    Audio sound recognition and classification is used for many tasks and applications including human voice recognition, music recognition and audio tagging. In this paper we apply Mel Frequency Cepstral Coefficients (MFCC) in combination with a range of machine learning models to identify (Australian) birds from publicly available audio files of their birdsong. We present approaches used for data processing and augmentation and compare the results of various state of the art machine learning models. We achieve an overall accuracy of 91% for the top-5 birds from the 30 selected as the case study. Applying the models to more challenging and diverse audio files comprising 152 bird species, we achieve an accuracy of 58%
    A Topological Deep Learning Framework for Neural Spike Decoding. (arXiv:2212.05037v1 [cs.NE])
    The brain's spatial orientation system uses different neuron ensembles to aid in environment-based navigation. One of the ways brains encode spatial information is through grid cells, layers of decked neurons that overlay to provide environment-based navigation. These neurons fire in ensembles where several neurons fire at once to activate a single grid. We want to capture this firing structure and use it to decode grid cell data. Understanding, representing, and decoding these neural structures require models that encompass higher order connectivity than traditional graph-based models may provide. To that end, in this work, we develop a topological deep learning framework for neural spike train decoding. Our framework combines unsupervised simplicial complex discovery with the power of deep learning via a new architecture we develop herein called a simplicial convolutional recurrent neural network (SCRNN). Simplicial complexes, topological spaces that use not only vertices and edges but also higher-dimensional objects, naturally generalize graphs and capture more than just pairwise relationships. Additionally, this approach does not require prior knowledge of the neural activity beyond spike counts, which removes the need for similarity measurements. The effectiveness and versatility of the SCRNN is demonstrated on head direction data to test its performance and then applied to grid cell datasets with the task to automatically predict trajectories.
    Information-Theoretic Safe Exploration with Gaussian Processes. (arXiv:2212.04914v1 [cs.LG])
    We consider a sequential decision making task where we are not allowed to evaluate parameters that violate an a priori unknown (safety) constraint. A common approach is to place a Gaussian process prior on the unknown constraint and allow evaluations only in regions that are safe with high probability. Most current methods rely on a discretization of the domain and cannot be directly extended to the continuous case. Moreover, the way in which they exploit regularity assumptions about the constraint introduces an additional critical hyperparameter. In this paper, we propose an information-theoretic safe exploration criterion that directly exploits the GP posterior to identify the most informative safe parameters to evaluate. Our approach is naturally applicable to continuous domains and does not require additional hyperparameters. We theoretically analyze the method and show that we do not violate the safety constraint with high probability and that we explore by learning about the constraint up to arbitrary precision. Empirical evaluations demonstrate improved data-efficiency and scalability.
    P2T2: a Physically-primed deep-neural-network approach for robust $T_{2}$ distribution estimation from quantitative $T_{2}$-weighted MRI. (arXiv:2212.04928v1 [eess.SP])
    Estimation of the T2 distribution from multi-echo T2-Weighted MRI (T2W) data can provide insight into the microscopic content of tissue using macroscopic imaging. This information can be used as a biomarker for several pathologies, such as tumor characterization, osteoarthritis, and neurodegenerative diseases. Recently, deep neural network (DNN) based methods were proposed for T2 distribution estimation from MRI data. However, these methods are highly sensitive to distribution shifts such as variations in the echo-times (TE) used during acquisition. Therefore, DNN-based methods cannot be utilized in large-scale multi-institutional trials with heterogeneous acquisition protocols. We present P2T2, a new physically-primed DNN approach for T2 distribution estimation that is robust to different acquisition parameters while maintaining state-of-the-art estimation accuracy. Our P2T2 model encodes the forward model of the signal decay by taking as input the TE acquisition array, in addition to the MRI signal, and provides an estimate of the corresponding T2 distribution as its output. Our P2T2 model has improved the robustness against distribution shifts in the acquisition process by more than 50% compared to the previously proposed DNN model. When tested without any distribution shifts, our model achieved about the same accuracy. Finally, when applied to real human MRI data, our P2T2 model produced the most detailed Myelin-Water fraction maps compared to both the MIML model and classical approaches. Our proposed physically-primed approach improved the generalization capacity of DNN models for T2 distribution estimation and their robustness against distribution shifts compared to previous approaches without compromising the accuracy.
    Unsupervised Discretization by Two-dimensional MDL-based Histogram. (arXiv:2006.01893v4 [cs.LG] UPDATED)
    Unsupervised discretization is a crucial step in many knowledge discovery tasks. The state-of-the-art method for one-dimensional data infers locally adaptive histograms using the minimum description length (MDL) principle, but the multi-dimensional case is far less studied: current methods consider the dimensions one at a time (if not independently), which result in discretizations based on rectangular cells of adaptive size. Unfortunately, this approach is unable to adequately characterize dependencies among dimensions and/or results in discretizations consisting of more cells (or bins) than is desirable. To address this problem, we propose an expressive model class that allows for far more flexible partitions of two-dimensional data. We extend the state of the art for the one-dimensional case to obtain a model selection problem based on the normalized maximum likelihood, a form of refined MDL. As the flexibility of our model class comes at the cost of a vast search space, we introduce a heuristic algorithm, named PALM, which Partitions each dimension ALternately and then Merges neighboring regions, all using the MDL principle. Experiments on synthetic data show that PALM 1) accurately reveals ground truth partitions that are within the model class (i.e., the search space), given a large enough sample size; 2) approximates well a wide range of partitions outside the model class; 3) converges, in contrast to the state-of-the-art multivariate discretization method IPD. Finally, we apply our algorithm to three spatial datasets, and we demonstrate that, compared to kernel density estimation (KDE), our algorithm not only reveals more detailed density changes, but also fits unseen data better, as measured by the log-likelihood.
    Lossy Image Compression with Conditional Diffusion Models. (arXiv:2209.06950v3 [eess.IV] UPDATED)
    Denoising diffusion models have recently marked a milestone in high-quality image generation. One may thus wonder if they are suitable for neural image compression. This paper outlines an end-to-end optimized image compression framework based on a conditional diffusion model, drawing on the transform-coding paradigm. Besides the latent variables inherent to the diffusion process, this paper introduces an additional discrete ``content'' latent variable to condition the denoising process. This variable is equipped with a hierarchical prior for entropy coding. The remaining ``texture'' latent variables characterizing the diffusion process are synthesized (either stochastically or deterministically) at decoding time. We furthermore show that the performance can be tuned toward perceptual metrics of interest. Our extensive experiments involving five datasets and sixteen image quality assessment metrics show that our approach not only compares favorably in rate-perceptual quality but also shows close distortion performance with state-of-the-art models.
    System Design for an Integrated Lifelong Reinforcement Learning Agent for Real-Time Strategy Games. (arXiv:2212.04603v1 [cs.LG])
    As Artificial and Robotic Systems are increasingly deployed and relied upon for real-world applications, it is important that they exhibit the ability to continually learn and adapt in dynamically-changing environments, becoming Lifelong Learning Machines. Continual/lifelong learning (LL) involves minimizing catastrophic forgetting of old tasks while maximizing a model's capability to learn new tasks. This paper addresses the challenging lifelong reinforcement learning (L2RL) setting. Pushing the state-of-the-art forward in L2RL and making L2RL useful for practical applications requires more than developing individual L2RL algorithms; it requires making progress at the systems-level, especially research into the non-trivial problem of how to integrate multiple L2RL algorithms into a common framework. In this paper, we introduce the Lifelong Reinforcement Learning Components Framework (L2RLCF), which standardizes L2RL systems and assimilates different continual learning components (each addressing different aspects of the lifelong learning problem) into a unified system. As an instantiation of L2RLCF, we develop a standard API allowing easy integration of novel lifelong learning components. We describe a case study that demonstrates how multiple independently-developed LL components can be integrated into a single realized system. We also introduce an evaluation environment in order to measure the effect of combining various system components. Our evaluation environment employs different LL scenarios (sequences of tasks) consisting of Starcraft-2 minigames and allows for the fair, comprehensive, and quantitative comparison of different combinations of components within a challenging common evaluation environment.
    Post hoc Explanations may be Ineffective for Detecting Unknown Spurious Correlation. (arXiv:2212.04629v1 [cs.LG])
    We investigate whether three types of post hoc model explanations--feature attribution, concept activation, and training point ranking--are effective for detecting a model's reliance on spurious signals in the training data. Specifically, we consider the scenario where the spurious signal to be detected is unknown, at test-time, to the user of the explanation method. We design an empirical methodology that uses semi-synthetic datasets along with pre-specified spurious artifacts to obtain models that verifiably rely on these spurious training signals. We then provide a suite of metrics that assess an explanation method's reliability for spurious signal detection under various conditions. We find that the post hoc explanation methods tested are ineffective when the spurious artifact is unknown at test-time especially for non-visible artifacts like a background blur. Further, we find that feature attribution methods are susceptible to erroneously indicating dependence on spurious signals even when the model being explained does not rely on spurious artifacts. This finding casts doubt on the utility of these approaches, in the hands of a practitioner, for detecting a model's reliance on spurious signals.
    Variational Diffusion Models. (arXiv:2107.00630v5 [cs.LG] UPDATED)
    Diffusion-based generative models have demonstrated a capacity for perceptually impressive synthesis, but can they also be great likelihood-based models? We answer this in the affirmative, and introduce a family of diffusion-based generative models that obtain state-of-the-art likelihoods on standard image density estimation benchmarks. Unlike other diffusion-based models, our method allows for efficient optimization of the noise schedule jointly with the rest of the model. We show that the variational lower bound (VLB) simplifies to a remarkably short expression in terms of the signal-to-noise ratio of the diffused data, thereby improving our theoretical understanding of this model class. Using this insight, we prove an equivalence between several models proposed in the literature. In addition, we show that the continuous-time VLB is invariant to the noise schedule, except for the signal-to-noise ratio at its endpoints. This enables us to learn a noise schedule that minimizes the variance of the resulting VLB estimator, leading to faster optimization. Combining these advances with architectural improvements, we obtain state-of-the-art likelihoods on image density estimation benchmarks, outperforming autoregressive models that have dominated these benchmarks for many years, with often significantly faster optimization. In addition, we show how to use the model as part of a bits-back compression scheme, and demonstrate lossless compression rates close to the theoretical optimum. Code is available at https://github.com/google-research/vdm .
    Album cover art image generation with Generative Adversarial Networks. (arXiv:2212.04844v1 [cs.CV])
    Generative Adversarial Networks (GANs) were introduced by Goodfellow in 2014, and since then have become popular for constructing generative artificial intelligence models. However, the drawbacks of such networks are numerous, like their longer training times, their sensitivity to hyperparameter tuning, several types of loss and optimization functions and other difficulties like mode collapse. Current applications of GANs include generating photo-realistic human faces, animals and objects. However, I wanted to explore the artistic ability of GANs in more detail, by using existing models and learning from them. This dissertation covers the basics of neural networks and works its way up to the particular aspects of GANs, together with experimentation and modification of existing available models, from least complex to most. The intention is to see if state of the art GANs (specifically StyleGAN2) can generate album art covers and if it is possible to tailor them by genre. This was attempted by first familiarizing myself with 3 existing GANs architectures, including the state of the art StyleGAN2. The StyleGAN2 code was used to train a model with a dataset containing 80K album cover images, then used to style images by picking curated images and mixing their styles.
    Unsupervised Legendre-Galerkin Neural Network for Singularly Perturbed Partial Differential Equations. (arXiv:2207.10241v3 [cs.LG] UPDATED)
    Machine learning methods have been lately used to solve partial differential equations (PDEs) and dynamical systems. These approaches have been developed into a novel research field known as scientific machine learning in which techniques such as deep neural networks and statistical learning are applied to classical problems of applied mathematics. In this paper, we develop a novel numerical algorithm that incorporates machine learning and artificial intelligence to solve PDEs. Based on the Legendre-Galerkin framework, we propose the {\it unsupervised machine learning} algorithm to learn {\it multiple instances} of the solutions for different types of PDEs. Our approach overcomes the limitations of data-driven and physics-based methods. The proposed neural network is applied to general 1D and 2D PDEs with various boundary conditions as well as convection-dominated {\it singularly perturbed PDEs} that exhibit strong boundary layer behavior.
    Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints. (arXiv:2212.05055v1 [cs.LG])
    Training large, deep neural networks to convergence can be prohibitively expensive. As a result, often only a small selection of popular, dense models are reused across different contexts and tasks. Increasingly, sparsely activated models, which seek to decouple model size from computation costs, are becoming an attractive alternative to dense models. Although more efficient in terms of quality and computation cost, sparse models remain data-hungry and costly to train from scratch in the large scale regime. In this work, we propose sparse upcycling -- a simple way to reuse sunk training costs by initializing a sparsely activated Mixture-of-Experts model from a dense checkpoint. We show that sparsely upcycled T5 Base, Large, and XL language models and Vision Transformer Base and Large models, respectively, significantly outperform their dense counterparts on SuperGLUE and ImageNet, using only ~50% of the initial dense pretraining sunk cost. The upcycled models also outperform sparse models trained from scratch on 100% of the initial dense pretraining computation budget.
    Optimal binning: mathematical programming formulation. (arXiv:2001.08025v3 [cs.LG] UPDATED)
    The optimal binning is the optimal discretization of a variable into bins given a discrete or continuous numeric target. We present a rigorous and extensible mathematical programming formulation for solving the optimal binning problem for a binary, continuous and multi-class target type, incorporating constraints not previously addressed. For all three target types, we introduce a convex mixed-integer programming formulation. Several algorithmic enhancements, such as automatic determination of the most suitable monotonic trend via a Machine-Learning-based classifier and implementation aspects are thoughtfully discussed. The new mathematical programming formulations are carefully implemented in the open-source python library OptBinning.
    TargetCall: Eliminating the Wasted Computation in Basecalling via Pre-Basecalling Filtering. (arXiv:2212.04953v1 [q-bio.GN])
    Basecalling is an essential step in nanopore sequencing analysis where the raw signals of nanopore sequencers are converted into nucleotide sequences, i.e., reads. State-of-the-art basecallers employ complex deep learning models to achieve high basecalling accuracy. This makes basecalling computationally-inefficient and memory-hungry; bottlenecking the entire genome analysis pipeline. However, for many applications, the majority of reads do no match the reference genome of interest (i.e., target reference) and thus are discarded in later steps in the genomics pipeline, wasting the basecalling computation. To overcome this issue, we propose TargetCall, the first fast and widely-applicable pre-basecalling filter to eliminate the wasted computation in basecalling. TargetCall's key idea is to discard reads that will not match the target reference (i.e., off-target reads) prior to basecalling. TargetCall consists of two main components: (1) LightCall, a lightweight neural network basecaller that produces noisy reads; and (2) Similarity Check, which labels each of these noisy reads as on-target or off-target by matching them to the target reference. TargetCall filters out all off-target reads before basecalling; and the highly-accurate but slow basecalling is performed only on the raw signals whose noisy reads are labeled as on-target. Our thorough experimental evaluations using both real and simulated data show that TargetCall 1) improves the end-to-end basecalling performance of the state-of-the-art basecaller by 3.31x while maintaining high (98.88%) sensitivity in keeping on-target reads, 2) maintains high accuracy in downstream analysis, 3) precisely filters out up to 94.71% of off-target reads, and 4) achieves better performance, sensitivity, and generality compared to prior works. We freely open-source TargetCall at https://github.com/CMU-SAFARI/TargetCall.
    Robust Graph Representation Learning via Predictive Coding. (arXiv:2212.04656v1 [cs.LG])
    Predictive coding is a message-passing framework initially developed to model information processing in the brain, and now also topic of research in machine learning due to some interesting properties. One of such properties is the natural ability of generative models to learn robust representations thanks to their peculiar credit assignment rule, that allows neural activities to converge to a solution before updating the synaptic weights. Graph neural networks are also message-passing models, which have recently shown outstanding results in diverse types of tasks in machine learning, providing interdisciplinary state-of-the-art performance on structured data. However, they are vulnerable to imperceptible adversarial attacks, and unfit for out-of-distribution generalization. In this work, we address this by building models that have the same structure of popular graph neural network architectures, but rely on the message-passing rule of predictive coding. Through an extensive set of experiments, we show that the proposed models are (i) comparable to standard ones in terms of performance in both inductive and transductive tasks, (ii) better calibrated, and (iii) robust against multiple kinds of adversarial attacks.
    Machine learning algorithms for three-dimensional mean-curvature computation in the level-set method. (arXiv:2208.09047v3 [cs.LG] UPDATED)
    We propose a data-driven mean-curvature solver for the level-set method. This work is the natural extension to $\mathbb{R}^3$ of our two-dimensional strategy in [DOI: 10.1007/s10915-022-01952-2][1] and the hybrid inference system of [DOI: 10.1016/j.jcp.2022.111291][2]. However, in contrast to [1,2], which built resolution-dependent neural-network dictionaries, here we develop a pair of models in $\mathbb{R}^3$, regardless of the mesh size. Our feedforward networks ingest transformed level-set, gradient, and curvature data to fix numerical mean-curvature approximations selectively for interface nodes. To reduce the problem's complexity, we have used the Gaussian curvature to classify stencils and fit our models separately to non-saddle and saddle patterns. Non-saddle stencils are easier to handle because they exhibit a curvature error distribution characterized by monotonicity and symmetry. While the latter has allowed us to train only on half the mean-curvature spectrum, the former has helped us blend the data-driven and the baseline estimations seamlessly near flat regions. On the other hand, the saddle-pattern error structure is less clear; thus, we have exploited no latent information beyond what is known. In this regard, we have trained our models on not only spherical but also sinusoidal and hyperbolic paraboloidal patches. Our approach to building their data sets is systematic but gleans samples randomly while ensuring well-balancedness. We have also resorted to standardization and dimensionality reduction and integrated regularization to minimize outliers. In addition, we leverage curvature rotation/reflection invariance to improve precision at inference time. Several experiments confirm that our proposed system can yield more accurate mean-curvature estimations than modern particle-based interface reconstruction and level-set schemes around under-resolved regions.
    Universal Regular Conditional Distributions. (arXiv:2105.07743v4 [cs.LG] UPDATED)
    We introduce a deep learning model that can universally approximate regular conditional distributions (RCDs). The proposed model operates in three phases: first, it linearizes inputs from a given metric space $\mathcal{X}$ to $\mathbb{R}^d$ via a feature map, then a deep feedforward neural network processes these linearized features, and then the network's outputs are then transformed to the $1$-Wasserstein space $\mathcal{P}_1(\mathbb{R}^D)$ via a probabilistic extension of the attention mechanism of Bahdanau et al.\ (2014). Our model, called the \textit{probabilistic transformer (PT)}, can approximate any continuous function from $\mathbb{R}^d $ to $\mathcal{P}_1(\mathbb{R}^D)$ uniformly on compact sets, quantitatively. We identify two ways in which the PT avoids the curse of dimensionality when approximating $\mathcal{P}_1(\mathbb{R}^D)$-valued functions. The first strategy builds functions in $C(\mathbb{R}^d,\mathcal{P}_1(\mathbb{R}^D))$ which can be efficiently approximated by a PT, uniformly on any given compact subset of $\mathbb{R}^d$. In the second approach, given any function $f$ in $C(\mathbb{R}^d,\mathcal{P}_1(\mathbb{R}^D))$, we build compact subsets of $\mathbb{R}^d$ whereon $f$ can be efficiently approximated by a PT.
    Transformer-based normative modelling for anomaly detection of early schizophrenia. (arXiv:2212.04984v1 [cs.LG])
    Despite the impact of psychiatric disorders on clinical health, early-stage diagnosis remains a challenge. Machine learning studies have shown that classifiers tend to be overly narrow in the diagnosis prediction task. The overlap between conditions leads to high heterogeneity among participants that is not adequately captured by classification models. To address this issue, normative approaches have surged as an alternative method. By using a generative model to learn the distribution of healthy brain data patterns, we can identify the presence of pathologies as deviations or outliers from the distribution learned by the model. In particular, deep generative models showed great results as normative models to identify neurological lesions in the brain. However, unlike most neurological lesions, psychiatric disorders present subtle changes widespread in several brain regions, making these alterations challenging to identify. In this work, we evaluate the performance of transformer-based normative models to detect subtle brain changes expressed in adolescents and young adults. We trained our model on 3D MRI scans of neurotypical individuals (N=1,765). Then, we obtained the likelihood of neurotypical controls and psychiatric patients with early-stage schizophrenia from an independent dataset (N=93) from the Human Connectome Project. Using the predicted likelihood of the scans as a proxy for a normative score, we obtained an AUROC of 0.82 when assessing the difference between controls and individuals with early-stage schizophrenia. Our approach surpassed recent normative methods based on brain age and Gaussian Process, showing the promising use of deep generative models to help in individualised analyses.
    Mesh Neural Networks for SE(3)-Equivariant Hemodynamics Estimation on the Artery Wall. (arXiv:2212.05023v1 [cs.LG])
    Computational fluid dynamics (CFD) is a valuable asset for patient-specific cardiovascular-disease diagnosis and prognosis, but its high computational demands hamper its adoption in practice. Machine-learning methods that estimate blood flow in individual patients could accelerate or replace CFD simulation to overcome these limitations. In this work, we consider the estimation of vector-valued quantities on the wall of three-dimensional geometric artery models. We employ group-equivariant graph convolution in an end-to-end SE(3)-equivariant neural network that operates directly on triangular surface meshes and makes efficient use of training data. We run experiments on a large dataset of synthetic coronary arteries and find that our method estimates directional wall shear stress (WSS) with an approximation error of 7.6% and normalised mean absolute error (NMAE) of 0.4% while up to two orders of magnitude faster than CFD. Furthermore, we show that our method is powerful enough to accurately predict transient, vector-valued WSS over the cardiac cycle while conditioned on a range of different inflow boundary conditions. These results demonstrate the potential of our proposed method as a plugin replacement for CFD in the personalised prediction of hemodynamic vector and scalar fields.
    Strategic Coalition for Data Pricing in IoT Data Markets. (arXiv:2206.07785v3 [cs.NI] UPDATED)
    This paper considers a market for trading Internet of Things (IoT) data that is used to train machine learning models. The data, either raw or processed, is supplied to the market platform through a network and the price of such data is controlled based on the value it brings to the machine learning model. We explore the correlation property of data in a game-theoretical setting to eventually derive a simplified distributed solution for a data trading mechanism that emphasizes the mutual benefit of devices and the market. The key proposal is an efficient algorithm for markets that jointly addresses the challenges of availability and heterogeneity in participation, as well as the transfer of trust and the economic value of data exchange in IoT networks. The proposed approach establishes the data market by reinforcing collaboration opportunities between device with correlated data to avoid information leakage. Therein, we develop a network-wide optimization problem that maximizes the social value of coalition among the IoT devices of similar data types; at the same time, it minimizes the cost due to network externalities, i.e., the impact of information leakage due to data correlation, as well as the opportunity costs. Finally, we reveal the structure of the formulated problem as a distributed coalition game and solve it following the simplified split-and-merge algorithm. Simulation results show the efficacy of our proposed mechanism design toward a trusted IoT data market, with up to 32.72% gain in the average payoff for each seller.
    Precise Asymptotics for Spectral Methods in Mixed Generalized Linear Models. (arXiv:2211.11368v2 [math.ST] UPDATED)
    In a mixed generalized linear model, the objective is to learn multiple signals from unlabeled observations: each sample comes from exactly one signal, but it is not known which one. We consider the prototypical problem of estimating two statistically independent signals in a mixed generalized linear model with Gaussian covariates. Spectral methods are a popular class of estimators which output the top two eigenvectors of a suitable data-dependent matrix. However, despite the wide applicability, their design is still obtained via heuristic considerations, and the number of samples $n$ needed to guarantee recovery is super-linear in the signal dimension $d$. In this paper, we develop exact asymptotics on spectral methods in the challenging proportional regime in which $n, d$ grow large and their ratio converges to a finite constant. By doing so, we are able to optimize the design of the spectral method, and combine it with a simple linear estimator, in order to minimize the estimation error. Our characterization exploits a mix of tools from random matrices, free probability and the theory of approximate message passing algorithms. Numerical simulations for mixed linear regression and phase retrieval display the advantage enabled by our analysis over existing designs of spectral methods.
    Decomposable Sparse Tensor on Tensor Regression. (arXiv:2212.05024v1 [cs.LG])
    Most regularized tensor regression research focuses on tensors predictors with scalars responses or vectors predictors to tensors responses. We consider the sparse low rank tensor on tensor regression where predictors $\mathcal{X}$ and responses $\mathcal{Y}$ are both high-dimensional tensors. By demonstrating that the general inner product or the contracted product on a unit rank tensor can be decomposed into standard inner products and outer products, the problem can be simply transformed into a tensor to scalar regression followed by a tensor decomposition. So we propose a fast solution based on stagewise search composed by contraction part and generation part which are optimized alternatively. We successfully demonstrate our method can out perform current methods in terms of accuracy, predictors selection by effectively incorporating the structural information.
    A Learned Born Series for Highly-Scattering Media. (arXiv:2212.04948v1 [physics.comp-ph])
    A new method for solving the wave equation is presented, called the learned Born series (LBS), which is derived from a convergent Born Series but its components are found through training. The LBS is shown to be significantly more accurate than the convergent Born series for the same number of iterations, in the presence of high contrast scatterers, while maintaining a comparable computational complexity. The LBS is able to generate a reasonable prediction of the global pressure field with a small number of iterations, and the errors decrease with the number of learned iterations.
    MOPRD: A multidisciplinary open peer review dataset. (arXiv:2212.04972v1 [cs.DL])
    Open peer review is a growing trend in academic publications. Public access to peer review data can benefit both the academic and publishing communities. It also serves as a great support to studies on review comment generation and further to the realization of automated scholarly paper review. However, most of the existing peer review datasets do not provide data that cover the whole peer review process. Apart from this, their data are not diversified enough as they are mainly collected from the field of computer science. These two drawbacks of the currently available peer review datasets need to be addressed to unlock more opportunities for related studies. In response to this problem, we construct MOPRD, a multidisciplinary open peer review dataset. This dataset consists of paper metadata, multiple version manuscripts, review comments, meta-reviews, author's rebuttal letters, and editorial decisions. Moreover, we design a modular guided review comment generation method based on MOPRD. Experiments show that our method delivers better performance indicated by both automatic metrics and human evaluation. We also explore other potential applications of MOPRD, including meta-review generation, editorial decision prediction, author rebuttal generation, and scientometric analysis. MOPRD is a strong endorsement for further studies in peer review-related research and other applications.
    The unstable formula theorem revisited. (arXiv:2212.05050v1 [math.LO])
    We first prove that Littlestone classes, those which model theorists call stable, characterize learnability in a new statistical model: a learner in this new setting outputs the same hypothesis, up to measure zero, with probability one, after a uniformly bounded number of revisions. This fills a certain gap in the literature, and sets the stage for an approximation theorem characterizing Littlestone classes in terms of a range of learning models, by analogy to definability of types in model theory. We then give a complete analogue of Shelah's celebrated (and perhaps a priori untranslatable) Unstable Formula Theorem in the learning setting, with algorithmic arguments taking the place of the infinite.
    PhysDiff: Physics-Guided Human Motion Diffusion Model. (arXiv:2212.02500v2 [cs.CV] UPDATED)
    Denoising diffusion models hold great promise for generating diverse and realistic human motions. However, existing motion diffusion models largely disregard the laws of physics in the diffusion process and often generate physically-implausible motions with pronounced artifacts such as floating, foot sliding, and ground penetration. This seriously impacts the quality of generated motions and limits their real-world application. To address this issue, we present a novel physics-guided motion diffusion model (PhysDiff), which incorporates physical constraints into the diffusion process. Specifically, we propose a physics-based motion projection module that uses motion imitation in a physics simulator to project the denoised motion of a diffusion step to a physically-plausible motion. The projected motion is further used in the next diffusion step to guide the denoising diffusion process. Intuitively, the use of physics in our model iteratively pulls the motion toward a physically-plausible space. Experiments on large-scale human motion datasets show that our approach achieves state-of-the-art motion quality and improves physical plausibility drastically (>78% for all datasets).
    Adversarial Weight Perturbation Improves Generalization in Graph Neural Network. (arXiv:2212.04983v1 [cs.LG])
    A lot of theoretical and empirical evidence shows that the flatter local minima tend to improve generalization. Adversarial Weight Perturbation (AWP) is an emerging technique to efficiently and effectively find such minima. In AWP we minimize the loss w.r.t. a bounded worst-case perturbation of the model parameters thereby favoring local minima with a small loss in a neighborhood around them. The benefits of AWP, and more generally the connections between flatness and generalization, have been extensively studied for i.i.d. data such as images. In this paper, we extensively study this phenomenon for graph data. Along the way, we first derive a generalization bound for non-i.i.d. node classification tasks. Then we identify a vanishing-gradient issue with all existing formulations of AWP and we propose a new Weighted Truncated AWP (WT-AWP) to alleviate this issue. We show that regularizing graph neural networks with WT-AWP consistently improves both natural and robust generalization across many different graph learning tasks and models.
    Predictor networks and stop-grads provide implicit variance regularization in BYOL/SimSiam. (arXiv:2212.04858v1 [cs.LG])
    Self-supervised learning (SSL) learns useful representations from unlabelled data by training networks to be invariant to pairs of augmented versions of the same input. Non-contrastive methods avoid collapse either by directly regularizing the covariance matrix of network outputs or through asymmetric loss architectures, two seemingly unrelated approaches. Here, by building on DirectPred, we lay out a theoretical framework that reconciles these two views. We derive analytical expressions for the representational learning dynamics in linear networks. By expressing them in the eigenspace of the embedding covariance matrix, where the solutions decouple, we reveal the mechanism and conditions that provide implicit variance regularization. These insights allow us to formulate a new isotropic loss function that equalizes eigenvalue contribution and renders learning more robust. Finally, we show empirically that our findings translate to nonlinear networks trained on CIFAR-10 and STL-10.
    PATO: Policy Assisted TeleOperation for Scalable Robot Data Collection. (arXiv:2212.04708v1 [cs.RO])
    Large-scale data is an essential component of machine learning as demonstrated in recent advances in natural language processing and computer vision research. However, collecting large-scale robotic data is much more expensive and slower as each operator can control only a single robot at a time. To make this costly data collection process efficient and scalable, we propose Policy Assisted TeleOperation (PATO), a system which automates part of the demonstration collection process using a learned assistive policy. PATO autonomously executes repetitive behaviors in data collection and asks for human input only when it is uncertain about which subtask or behavior to execute. We conduct teleoperation user studies both with a real robot and a simulated robot fleet and demonstrate that our assisted teleoperation system reduces human operators' mental load while improving data collection efficiency. Further, it enables a single operator to control multiple robots in parallel, which is a first step towards scalable robotic data collection. For code and video results, see https://clvrai.com/pato
    Augmenting Knowledge Transfer across Graphs. (arXiv:2212.04725v1 [cs.LG])
    Given a resource-rich source graph and a resource-scarce target graph, how can we effectively transfer knowledge across graphs and ensure a good generalization performance? In many high-impact domains (e.g., brain networks and molecular graphs), collecting and annotating data is prohibitively expensive and time-consuming, which makes domain adaptation an attractive option to alleviate the label scarcity issue. In light of this, the state-of-the-art methods focus on deriving domain-invariant graph representation that minimizes the domain discrepancy. However, it has recently been shown that a small domain discrepancy loss may not always guarantee a good generalization performance, especially in the presence of disparate graph structures and label distribution shifts. In this paper, we present TRANSNET, a generic learning framework for augmenting knowledge transfer across graphs. In particular, we introduce a novel notion named trinity signal that can naturally formulate various graph signals at different granularity (e.g., node attributes, edges, and subgraphs). With that, we further propose a domain unification module together with a trinity-signal mixup scheme to jointly minimize the domain discrepancy and augment the knowledge transfer across graphs. Finally, comprehensive empirical results show that TRANSNET outperforms all existing approaches on seven benchmark datasets by a significant margin.
    Mitigation of Spatial Nonstationarity with Vision Transformers. (arXiv:2212.04633v1 [cs.LG])
    Spatial nonstationarity, the location variance of features' statistical distributions, is ubiquitous in many natural settings. For example, in geological reservoirs rock matrix porosity varies vertically due to geomechanical compaction trends, in mineral deposits grades vary due to sedimentation and concentration processes, in hydrology rainfall varies due to the atmosphere and topography interactions, and in metallurgy crystalline structures vary due to differential cooling. Conventional geostatistical modeling workflows rely on the assumption of stationarity to be able to model spatial features for the geostatistical inference. Nevertheless, this is often not a realistic assumption when dealing with nonstationary spatial data and this has motivated a variety of nonstationary spatial modeling workflows such as trend and residual decomposition, cosimulation with secondary features, and spatial segmentation and independent modeling over stationary subdomains. The advent of deep learning technologies has enabled new workflows for modeling spatial relationships. However, there is a paucity of demonstrated best practice and general guidance on mitigation of spatial nonstationarity with deep learning in the geospatial context. We demonstrate the impact of two common types of geostatistical spatial nonstationarity on deep learning model prediction performance and propose the mitigation of such impacts using self-attention (vision transformer) models. We demonstrate the utility of vision transformers for the mitigation of nonstationarity with relative errors as low as 10%, exceeding the performance of alternative deep learning methods such as convolutional neural networks. We establish best practice by demonstrating the ability of self-attention networks for modeling large-scale spatial relationships in the presence of commonly observed geospatial nonstationarity.
    Estimating a Directed Tree for Extremes. (arXiv:2102.06197v3 [stat.ML] UPDATED)
    The Extremal River Problem has emerged as a flagship problem for causal discovery in extreme values of a network. The task is to recover a river network from only extreme flow measured at a set $V$ of stations, without any information on the stations' locations. We present QTree, a new simple and efficient algorithm to solve the Extremal River Problem that performs very well compared to existing methods on hydrology data and in simulations. QTree returns a root-directed tree and achieves almost perfect recovery on the Upper Danube network data, the existing benchmark data set, as well as on new data from the Lower Colorado River network in Texas. It can handle missing data, has an automated parameter tuning procedure, and runs in time $O(n |V|^2)$, where $n$ is the number of observations and $|V|$ the number of nodes in the graph. Furthermore, we prove that the QTree estimator is consistent under a Bayesian network model for extreme values with noise. We also assess the small sample behaviour of QTree through simulations and detail the strengths and possible limitations of QTree.
    PDEBENCH: An Extensive Benchmark for Scientific Machine Learning. (arXiv:2210.07182v3 [cs.LG] UPDATED)
    Machine learning-based modeling of physical systems has experienced increased interest in recent years. Despite some impressive progress, there is still a lack of benchmarks for Scientific ML that are easy to use but still challenging and representative of a wide range of problems. We introduce PDEBench, a benchmark suite of time-dependent simulation tasks based on Partial Differential Equations (PDEs). PDEBench comprises both code and data to benchmark the performance of novel machine learning models against both classical numerical simulations and machine learning baselines. Our proposed set of benchmark problems contribute the following unique features: (1) A much wider range of PDEs compared to existing benchmarks, ranging from relatively common examples to more realistic and difficult problems; (2) much larger ready-to-use datasets compared to prior work, comprising multiple simulation runs across a larger number of initial and boundary conditions and PDE parameters; (3) more extensible source codes with user-friendly APIs for data generation and baseline results with popular machine learning models (FNO, U-Net, PINN, Gradient-Based Inverse Method). PDEBench allows researchers to extend the benchmark freely for their own purposes using a standardized API and to compare the performance of new models to existing baseline methods. We also propose new evaluation metrics with the aim to provide a more holistic understanding of learning methods in the context of Scientific ML. With those metrics we identify tasks which are challenging for recent ML methods and propose these tasks as future challenges for the community. The code is available at https://github.com/pdebench/PDEBench.
    Simulation-Based Parallel Training. (arXiv:2211.04119v2 [cs.AI] UPDATED)
    Numerical simulations are ubiquitous in science and engineering. Machine learning for science investigates how artificial neural architectures can learn from these simulations to speed up scientific discovery and engineering processes. Most of these architectures are trained in a supervised manner. They require tremendous amounts of data from simulations that are slow to generate and memory greedy. In this article, we present our ongoing work to design a training framework that alleviates those bottlenecks. It generates data in parallel with the training process. Such simultaneity induces a bias in the data available during the training. We present a strategy to mitigate this bias with a memory buffer. We test our framework on the multi-parametric Lorenz's attractor. We show the benefit of our framework compared to offline training and the success of our data bias mitigation strategy to capture the complex chaotic dynamics of the system.
    Ambiguous Dynamic Treatment Regimes: A Reinforcement Learning Approach. (arXiv:2112.04571v3 [cs.LG] UPDATED)
    A main research goal in various studies is to use an observational data set and provide a new set of counterfactual guidelines that can yield causal improvements. Dynamic Treatment Regimes (DTRs) are widely studied to formalize this process. However, available methods in finding optimal DTRs often rely on assumptions that are violated in real-world applications (e.g., medical decision-making or public policy), especially when (a) the existence of unobserved confounders cannot be ignored, and (b) the unobserved confounders are time-varying (e.g., affected by previous actions). When such assumptions are violated, one often faces ambiguity regarding the underlying causal model. This ambiguity is inevitable, since the dynamics of unobserved confounders and their causal impact on the observed part of the data cannot be understood from the observed data. Motivated by a case study of finding superior treatment regimes for patients who underwent transplantation in our partner hospital and faced a medical condition known as New Onset Diabetes After Transplantation (NODAT), we extend DTRs to a new class termed Ambiguous Dynamic Treatment Regimes (ADTRs), in which the causal impact of treatment regimes is evaluated based on a "cloud" of causal models. We then connect ADTRs to Ambiguous Partially Observable Mark Decision Processes (APOMDPs) and develop Reinforcement Learning methods, which enable using the observed data to efficiently learn an optimal treatment regime. We establish theoretical results for these learning methods, including (weak) consistency and asymptotic normality. We further evaluate the performance of these learning methods both in our case study and in simulation experiments.
    PiPs: a Kernel-based Optimization Scheme for Analyzing Non-Stationary 1D Signals. (arXiv:1805.08102v3 [stat.ML] UPDATED)
    This paper proposes a novel kernel-based optimization scheme to handle tasks in the analysis, e.g., signal spectral estimation and single-channel source separation of 1D non-stationary oscillatory data. The key insight of our optimization scheme for reconstructing the time-frequency information is that when a nonparametric regression is applied on some input values, the output regressed points would lie near the oscillatory pattern of the oscillatory 1D signal only if these input values are a good approximation of the ground-truth phase function. In this work, Gaussian Process (GP) is chosen to conduct this nonparametric regression: the oscillatory pattern is encoded as the Pattern-inducing Points (PiPs) which act as the training data points in the GP regression; while the targeted phase function is fed in to compute the correlation kernels, acting as the testing input. Better approximated phase function generates more precise kernels, thus resulting in smaller optimization loss error when comparing the kernel-based regression output with the original signals. To the best of our knowledge, this is the first algorithm that can satisfactorily handle fully non-stationary oscillatory data, close and crossover frequencies, and general oscillatory patterns. Even in the example of a signal {produced by slow variation in the parameters of a trigonometric expansion}, we show that PiPs admits competitive or better performance in terms of accuracy and robustness than existing state-of-the-art algorithms.
    MetaMask: Revisiting Dimensional Confounder for Self-Supervised Learning. (arXiv:2209.07902v2 [cs.LG] UPDATED)
    As a successful approach to self-supervised learning, contrastive learning aims to learn invariant information shared among distortions of the input sample. While contrastive learning has yielded continuous advancements in sampling strategy and architecture design, it still remains two persistent defects: the interference of task-irrelevant information and sample inefficiency, which are related to the recurring existence of trivial constant solutions. From the perspective of dimensional analysis, we find out that the dimensional redundancy and dimensional confounder are the intrinsic issues behind the phenomena, and provide experimental evidence to support our viewpoint. We further propose a simple yet effective approach MetaMask, short for the dimensional Mask learned by Meta-learning, to learn representations against dimensional redundancy and confounder. MetaMask adopts the redundancy-reduction technique to tackle the dimensional redundancy issue and innovatively introduces a dimensional mask to reduce the gradient effects of specific dimensions containing the confounder, which is trained by employing a meta-learning paradigm with the objective of improving the performance of masked representations on a typical self-supervised task. We provide solid theoretical analyses to prove MetaMask can obtain tighter risk bounds for downstream classification compared to typical contrastive methods. Empirically, our method achieves state-of-the-art performance on various benchmarks.
    Near-Optimal Differentially Private Reinforcement Learning. (arXiv:2212.04680v1 [cs.LG])
    Motivated by personalized healthcare and other applications involving sensitive data, we study online exploration in reinforcement learning with differential privacy (DP) constraints. Existing work on this problem established that no-regret learning is possible under joint differential privacy (JDP) and local differential privacy (LDP) but did not provide an algorithm with optimal regret. We close this gap for the JDP case by designing an $\epsilon$-JDP algorithm with a regret of $\widetilde{O}(\sqrt{SAH^2T}+S^2AH^3/\epsilon)$ which matches the information-theoretic lower bound of non-private learning for all choices of $\epsilon> S^{1.5}A^{0.5} H^2/\sqrt{T}$. In the above, $S$, $A$ denote the number of states and actions, $H$ denotes the planning horizon, and $T$ is the number of steps. To the best of our knowledge, this is the first private RL algorithm that achieves \emph{privacy for free} asymptotically as $T\rightarrow \infty$. Our techniques -- which could be of independent interest -- include privately releasing Bernstein-type exploration bonuses and an improved method for releasing visitation statistics. The same techniques also imply a slightly improved regret bound for the LDP case.
    Non-equispaced Fourier Neural Solvers for PDEs. (arXiv:2212.04689v1 [cs.LG])
    Solving partial differential equations is difficult. Recently proposed neural resolution-invariant models, despite their effectiveness and efficiency, usually require equispaced spatial points of data. However, sampling in spatial domain is sometimes inevitably non-equispaced in real-world systems, limiting their applicability. In this paper, we propose a Non-equispaced Fourier PDE Solver (\textsc{NFS}) with adaptive interpolation on resampled equispaced points and a variant of Fourier Neural Operators as its components. Experimental results on complex PDEs demonstrate its advantages in accuracy and efficiency. Compared with the spatially-equispaced benchmark methods, it achieves superior performance with $42.85\%$ improvements on MAE, and is able to handle non-equispaced data with a tiny loss of accuracy. Besides, to our best knowledge, \textsc{NFS} is the first ML-based method with mesh invariant inference ability to successfully model turbulent flows in non-equispaced scenarios, with a minor deviation of the error on unseen spatial points.
    Understanding electricity prices beyond the merit order principle using explainable AI. (arXiv:2212.04805v1 [cs.LG])
    Electricity prices in liberalized markets are determined by the supply and demand for electric power, which are in turn driven by various external influences that vary strongly in time. In perfect competition, the merit order principle describes that dispatchable power plants enter the market in the order of their marginal costs to meet the residual load, i.e. the difference of load and renewable generation. Many market models implement this principle to predict electricity prices but typically require certain assumptions and simplifications. In this article, we present an explainable machine learning model for the prices on the German day-ahead market, which substantially outperforms a benchmark model based on the merit order principle. Our model is designed for the ex-post analysis of prices and thus builds on various external features. Using Shapley Additive exPlanation (SHAP) values, we can disentangle the role of the different features and quantify their importance from empiric data. Load, wind and solar generation are most important, as expected, but wind power appears to affect prices stronger than solar power does. Fuel prices also rank highly and show nontrivial dependencies, including strong interactions with other features revealed by a SHAP interaction analysis. Large generation ramps are correlated with high prices, again with strong feature interactions, due to the limited flexibility of nuclear and lignite plants. Our results further contribute to model development by providing quantitative insights directly from data.
    Towards High-Order Complementary Recommendation via Logical Reasoning Network. (arXiv:2212.04966v1 [cs.LG])
    Complementary recommendation gains increasing attention in e-commerce since it expedites the process of finding frequently-bought-with products for users in their shopping journey. Therefore, learning the product representation that can reflect this complementary relationship plays a central role in modern recommender systems. In this work, we propose a logical reasoning network, LOGIREC, to effectively learn embeddings of products as well as various transformations (projection, intersection, negation) between them. LOGIREC is capable of capturing the asymmetric complementary relationship between products and seamlessly extending to high-order recommendations where more comprehensive and meaningful complementary relationship is learned for a query set of products. Finally, we further propose a hybrid network that is jointly optimized for learning a more generic product representation. We demonstrate the effectiveness of our LOGIREC on multiple public real-world datasets in terms of various ranking-based metrics under both low-order and high-order recommendation scenarios.
    Uncertainty Estimation in Deep Speech Enhancement Using Complex Gaussian Mixture Models. (arXiv:2212.04831v1 [eess.AS])
    Single-channel deep speech enhancement approaches often estimate a single multiplicative mask to extract clean speech without a measure of its accuracy. Instead, in this work, we propose to quantify the uncertainty associated with clean speech estimates in neural network-based speech enhancement. Predictive uncertainty is typically categorized into aleatoric uncertainty and epistemic uncertainty. The former accounts for the inherent uncertainty in data and the latter corresponds to the model uncertainty. Aiming for robust clean speech estimation and efficient predictive uncertainty quantification, we propose to integrate statistical complex Gaussian mixture models (CGMMs) into a deep speech enhancement framework. More specifically, we model the dependency between input and output stochastically by means of a conditional probability density and train a neural network to map the noisy input to the full posterior distribution of clean speech, modeled as a mixture of multiple complex Gaussian components. Experimental results on different datasets show that the proposed algorithm effectively captures predictive uncertainty and that combining powerful statistical models and deep learning also delivers a superior speech enhancement performance.
    Multi-Task Off-Policy Learning from Bandit Feedback. (arXiv:2212.04720v1 [cs.LG])
    Many practical applications, such as recommender systems and learning to rank, involve solving multiple similar tasks. One example is learning of recommendation policies for users with similar movie preferences, where the users may still rank the individual movies slightly differently. Such tasks can be organized in a hierarchy, where similar tasks are related through a shared structure. In this work, we formulate this problem as a contextual off-policy optimization in a hierarchical graphical model from logged bandit feedback. To solve the problem, we propose a hierarchical off-policy optimization algorithm (HierOPO), which estimates the parameters of the hierarchical model and then acts pessimistically with respect to them. We instantiate HierOPO in linear Gaussian models, for which we also provide an efficient implementation and analysis. We prove per-task bounds on the suboptimality of the learned policies, which show a clear improvement over not using the hierarchical model. We also evaluate the policies empirically. Our theoretical and empirical results show a clear advantage of using the hierarchy over solving each task independently.
    MED-SE: Medical Entity Definition-based Sentence Embedding. (arXiv:2212.04734v1 [cs.LG])
    We propose Medical Entity Definition-based Sentence Embedding (MED-SE), a novel unsupervised contrastive learning framework designed for clinical texts, which exploits the definitions of medical entities. To this end, we conduct an extensive analysis of multiple sentence embedding techniques in clinical semantic textual similarity (STS) settings. In the entity-centric setting that we have designed, MED-SE achieves significantly better performance, while the existing unsupervised methods including SimCSE show degraded performance. Our experiments elucidate the inherent discrepancies between the general- and clinical-domain texts, and suggest that entity-centric contrastive approaches may help bridge this gap and lead to a better representation of clinical sentences.
    PDE-LEARN: Using Deep Learning to Discover Partial Differential Equations from Noisy, Limited Data. (arXiv:2212.04971v1 [cs.LG])
    In this paper, we introduce PDE-LEARN, a novel PDE discovery algorithm that can identify governing partial differential equations (PDEs) directly from noisy, limited measurements of a physical system of interest. PDE-LEARN uses a Rational Neural Network, $U$, to approximate the system response function and a sparse, trainable vector, $\xi$, to characterize the hidden PDE that the system response function satisfies. Our approach couples the training of $U$ and $\xi$ using a loss function that (1) makes $U$ approximate the system response function, (2) encapsulates the fact that $U$ satisfies a hidden PDE that $\xi$ characterizes, and (3) promotes sparsity in $\xi$ using ideas from iteratively reweighted least-squares. Further, PDE-LEARN can simultaneously learn from several data sets, allowing it to incorporate results from multiple experiments. This approach yields a robust algorithm to discover PDEs directly from realistic scientific data. We demonstrate the efficacy of PDE-LEARN by identifying several PDEs from noisy and limited measurements.
    Decorrelative Network Architecture for Robust Electrocardiogram Classification. (arXiv:2207.09031v2 [cs.LG] UPDATED)
    Artificial intelligence has made great progress in medical data analysis, but the lack of robustness and trustworthiness has kept these methods from being widely deployed. As it is not possible to train networks that are accurate in all situations, models must recognize situations where they cannot operate confidently. Bayesian deep learning methods sample the model parameter space to estimate uncertainty, but these parameters are often subject to the same vulnerabilities, which can be exploited by adversarial attacks. We propose a novel ensemble approach based on feature decorrelation and Fourier partitioning for teaching networks diverse complementary features, reducing the chance of perturbation-based fooling. We test our approach on electrocardiogram classification, demonstrating superior accuracy confidence measurement, on a variety of adversarial attacks. For example, on our ensemble trained with both decorrelation and Fourier partitioning scored a 50.18% inference accuracy and 48.01% uncertainty accuracy (area under the curve) on {\epsilon} = 50 projected gradient descent attacks, while a conventionally trained ensemble scored 21.1% and 30.31% on these metrics respectively. Our approach does not require expensive optimization with adversarial samples and can be scaled to large problems. These methods can easily be applied to other tasks for more robust and trustworthy models.
    A perspective on physical reservoir computing with nanomagnetic devices. (arXiv:2212.04851v1 [cs.ET])
    Neural networks have revolutionized the area of artificial intelligence and introduced transformative applications to almost every scientific field and industry. However, this success comes at a great price; the energy requirements for training advanced models are unsustainable. One promising way to address this pressing issue is by developing low-energy neuromorphic hardware that directly supports the algorithm's requirements. The intrinsic non-volatility, non-linearity, and memory of spintronic devices make them appealing candidates for neuromorphic devices. Here we focus on the reservoir computing paradigm, a recurrent network with a simple training algorithm suitable for computation with spintronic devices since they can provide the properties of non-linearity and memory. We review technologies and methods for developing neuromorphic spintronic devices and conclude with critical open issues to address before such devices become widely used.
    DRIP: Domain Refinement Iteration with Polytopes for Backward Reachability Analysis of Neural Feedback Loops. (arXiv:2212.04646v1 [eess.SY])
    Safety certification of data-driven control techniques remains a major open problem. This work investigates backward reachability as a framework for providing collision avoidance guarantees for systems controlled by neural network (NN) policies. Because NNs are typically not invertible, existing methods conservatively assume a domain over which to relax the NN, which causes loose over-approximations of the set of states that could lead the system into the obstacle (i.e., backprojection (BP) sets). To address this issue, we introduce DRIP, an algorithm with a refinement loop on the relaxation domain, which substantially tightens the BP set bounds. Furthermore, we introduce a formulation that enables directly obtaining closed-form representations of polytopes to bound the BP sets tighter than prior work, which required solving linear programs and using hyper-rectangles. Furthermore, this work extends the NN relaxation algorithm to handle polytope domains, which further tightens the bounds on BP sets. DRIP is demonstrated in numerical experiments on control systems, including a ground robot controlled by a learned NN obstacle avoidance policy.
    AuE-IPA: An AU Engagement Based Infant Pain Assessment Method. (arXiv:2212.04764v1 [cs.LG])
    Recent studies have found that pain in infancy has a significant impact on infant development, including psychological problems, possible brain injury, and pain sensitivity in adulthood. However, due to the lack of specialists and the fact that infants are unable to express verbally their experience of pain, it is difficult to assess infant pain. Most existing infant pain assessment systems directly apply adult methods to infants ignoring the differences between infant expressions and adult expressions. Meanwhile, as the study of facial action coding system continues to advance, the use of action units (AUs) opens up new possibilities for expression recognition and pain assessment. In this paper, a novel AuE-IPA method is proposed for assessing infant pain by leveraging different engagement levels of AUs. First, different engagement levels of AUs in infant pain are revealed, by analyzing the class activation map of an end-to-end pain assessment model. The intensities of top-engaged AUs are then used in a regression model for achieving automatic infant pain assessment. The model proposed is trained and experimented on YouTube Immunization dataset, YouTube Blood Test dataset, and iCOPEVid dataset. The experimental results show that our AuE-IPA method is more applicable to infants and possesses stronger generalization ability than end-to-end assessment model and the classic PSPI metric.
    Doubly Robust Kernel Statistics for Testing Distributional Treatment Effects Even Under One Sided Overlap. (arXiv:2212.04922v1 [stat.ML])
    As causal inference becomes more widespread the importance of having good tools to test for causal effects increases. In this work we focus on the problem of testing for causal effects that manifest in a difference in distribution for treatment and control. We build on work applying kernel methods to causality, considering the previously introduced Counterfactual Mean Embedding framework (\textsc{CfME}). We improve on this by proposing the \emph{Doubly Robust Counterfactual Mean Embedding} (\textsc{DR-CfME}), which has better theoretical properties than its predecessor by leveraging semiparametric theory. This leads us to propose new kernel based test statistics for distributional effects which are based upon doubly robust estimators of treatment effects. We propose two test statistics, one which is a direct improvement on previous work and one which can be applied even when the support of the treatment arm is a subset of that of the control arm. We demonstrate the validity of our methods on simulated and real-world data, as well as giving an application in off-policy evaluation.
    Closed pattern mining of interval data and distributional data. (arXiv:2212.04849v1 [cs.AI])
    We discuss pattern languages for closed pattern mining and learning of interval data and distributional data. We first introduce pattern languages relying on pairs of intersection-based constraints or pairs of inclusion based constraints, or both, applied to intervals. We discuss the encoding of such interval patterns as itemsets thus allowing to use closed itemsets mining and formal concept analysis programs. We experiment these languages on clustering and supervised learning tasks. Then we show how to extend the approach to address distributional data.
    Reliable Multimodal Trajectory Prediction via Error Aligned Uncertainty Optimization. (arXiv:2212.04812v1 [cs.CV])
    Reliable uncertainty quantification in deep neural networks is very crucial in safety-critical applications such as automated driving for trustworthy and informed decision-making. Assessing the quality of uncertainty estimates is challenging as ground truth for uncertainty estimates is not available. Ideally, in a well-calibrated model, uncertainty estimates should perfectly correlate with model error. We propose a novel error aligned uncertainty optimization method and introduce a trainable loss function to guide the models to yield good quality uncertainty estimates aligning with the model error. Our approach targets continuous structured prediction and regression tasks, and is evaluated on multiple datasets including a large-scale vehicle motion prediction task involving real-world distributional shifts. We demonstrate that our method improves average displacement error by 1.69% and 4.69%, and the uncertainty correlation with model error by 17.22% and 19.13% as quantified by Pearson correlation coefficient on two state-of-the-art baselines.
    Unfooling Perturbation-Based Post Hoc Explainers. (arXiv:2205.14772v2 [cs.AI] UPDATED)
    Monumental advancements in artificial intelligence (AI) have lured the interest of doctors, lenders, judges, and other professionals. While these high-stakes decision-makers are optimistic about the technology, those familiar with AI systems are wary about the lack of transparency of its decision-making processes. Perturbation-based post hoc explainers offer a model agnostic means of interpreting these systems while only requiring query-level access. However, recent work demonstrates that these explainers can be fooled adversarially. This discovery has adverse implications for auditors, regulators, and other sentinels. With this in mind, several natural questions arise - how can we audit these black box systems? And how can we ascertain that the auditee is complying with the audit in good faith? In this work, we rigorously formalize this problem and devise a defense against adversarial attacks on perturbation-based explainers. We propose algorithms for the detection (CAD-Detect) and defense (CAD-Defend) of these attacks, which are aided by our novel conditional anomaly detection approach, KNN-CAD. We demonstrate that our approach successfully detects whether a black box system adversarially conceals its decision-making process and mitigates the adversarial attack on real-world data for the prevalent explainers, LIME and SHAP.
    Deep conv-attention model for diagnosing left bundle branch block from 12-lead electrocardiograms. (arXiv:2212.04936v1 [eess.SP])
    Cardiac resynchronization therapy (CRT) is a treatment that is used to compensate for irregularities in the heartbeat. Studies have shown that this treatment is more effective in heart patients with left bundle branch block (LBBB) arrhythmia. Therefore, identifying this arrhythmia is an important initial step in determining whether or not to use CRT. On the other hand, traditional methods for detecting LBBB on electrocardiograms (ECG) are often associated with errors. Thus, there is a need for an accurate method to diagnose this arrhythmia from ECG data. Machine learning, as a new field of study, has helped to increase human systems' performance. Deep learning, as a newer subfield of machine learning, has more power to analyze data and increase systems accuracy. This study presents a deep learning model for the detection of LBBB arrhythmia from 12-lead ECG data. This model consists of 1D dilated convolutional layers. Attention mechanism has also been used to identify important input data features and classify inputs more accurately. The proposed model is trained and validated on a database containing 10344 12-lead ECG samples using the 10-fold cross-validation method. The final results obtained by the model on the 12-lead ECG data are as follows. Accuracy: 98.80+-0.08%, specificity: 99.33+-0.11 %, F1 score: 73.97+-1.8%, and area under the receiver operating characteristics curve (AUC): 0.875+-0.0192. These results indicate that the proposed model in this study can effectively diagnose LBBB with good efficiency and, if used in medical centers, will greatly help diagnose this arrhythmia and early treatment.
    Remote estimation of geologic composition using interferometric synthetic-aperture radar in California's Central Valley. (arXiv:2212.04813v1 [eess.SP])
    California's Central Valley is the national agricultural center, producing 1/4 of the nation's food. However, land in the Central Valley is sinking at a rapid rate (as much as 20 cm per year) due to continued groundwater pumping. Land subsidence has a significant impact on infrastructure resilience and groundwater sustainability. In this study, we aim to identify specific regions with different temporal dynamics of land displacement and find relationships with underlying geological composition. Then, we aim to remotely estimate geologic composition using interferometric synthetic aperture radar (InSAR)-based land deformation temporal changes using machine learning techniques. We identified regions with different temporal characteristics of land displacement in that some areas (e.g., Helm) with coarser grain geologic compositions exhibited potentially reversible land deformation (elastic land compaction). We found a significant correlation between InSAR-based land deformation and geologic composition using random forest and deep neural network regression models. We also achieved significant accuracy with 1/4 sparse sampling to reduce any spatial correlations among data, suggesting that the model has the potential to be generalized to other regions for indirect estimation of geologic composition. Our results indicate that geologic composition can be estimated using InSAR-based land deformation data. In-situ measurements of geologic composition can be expensive and time consuming and may be impractical in some areas. The generalizability of the model sheds light on high spatial resolution geologic composition estimation utilizing existing measurements.
    Towards a learning-based performance modeling for accelerating Deep Neural Networks. (arXiv:2212.05031v1 [cs.LG])
    Emerging applications such as Deep Learning are often data-driven, thus traditional approaches based on auto-tuners are not performance effective across the wide range of inputs used in practice. In the present paper, we start an investigation of predictive models based on machine learning techniques in order to optimize Convolution Neural Networks (CNNs). As a use-case, we focus on the ARM Compute Library which provides three different implementations of the convolution operator at different numeric precision. Starting from a collation of benchmarks, we build and validate models learned by Decision Tree and naive Bayesian classifier. Preliminary experiments on Midgard-based ARM Mali GPU show that our predictive model outperforms all the convolution operators manually selected by the library.
    Genie: Show Me the Data for Quantization. (arXiv:2212.04780v1 [cs.LG])
    Zero-shot quantization is a promising approach for developing lightweight deep neural networks when data is inaccessible owing to various reasons, including cost and issues related to privacy. By utilizing the learned parameters (statistics) of FP32-pre-trained models, zero-shot quantization schemes focus on generating synthetic data by minimizing the distance between the learned parameters ($\mu$ and $\sigma$) and distributions of intermediate activations. Subsequently, they distill knowledge from the pre-trained model (\textit{teacher}) to the quantized model (\textit{student}) such that the quantized model can be optimized with the synthetic dataset. In general, zero-shot quantization comprises two major elements: synthesizing datasets and quantizing models. However, thus far, zero-shot quantization has primarily been discussed in the context of quantization-aware training methods, which require task-specific losses and long-term optimization as much as retraining. We thus introduce a post-training quantization scheme for zero-shot quantization that produces high-quality quantized networks within a few hours on even half an hour. Furthermore, we propose a framework called \genie~that generates data suited for post-training quantization. With the data synthesized by \genie, we can produce high-quality quantized models without real datasets, which is comparable to few-shot quantization. We also propose a post-training quantization algorithm to enhance the performance of quantized models. By combining them, we can bridge the gap between zero-shot and few-shot quantization while significantly improving the quantization performance compared to that of existing approaches. In other words, we can obtain a unique state-of-the-art zero-shot quantization approach.
    Eulerian Phase-based Motion Magnification for High-Fidelity Vital Sign Estimation with Radar in Clinical Settings. (arXiv:2212.04923v1 [eess.SP])
    Efficient and accurate detection of subtle motion generated from small objects in noisy environments, as needed for vital sign monitoring, is challenging, but can be substantially improved with magnification. We developed a complex Gabor filter-based decomposition method to amplify phases at different spatial wavelength levels to magnify motion and extract 1D motion signals for fundamental frequency estimation. The phase-based complex Gabor filter outputs are processed and then used to train machine learning models that predict respiration and heart rate with greater accuracy. We show that our proposed technique performs better than the conventional temporal FFT-based method in clinical settings, such as sleep laboratories and emergency departments, as well for a variety of human postures.
    De Rham compatible Deep Neural Network FEM. (arXiv:2201.05395v2 [math.NA] UPDATED)
    On general regular simplicial partitions $\mathcal{T}$ of bounded polytopal domains $\Omega \subset \mathbb{R}^d$, $d\in\{2,3\}$, we construct \emph{exact neural network (NN) emulations} of all lowest order finite element spaces in the discrete de Rham complex. These include the spaces of piecewise constant functions, continuous piecewise linear (CPwL) functions, the classical ``Raviart-Thomas element'', and the ``N\'{e}d\'{e}lec edge element''. For all but the CPwL case, our network architectures employ both ReLU (rectified linear unit) and BiSU (binary step unit) activations to capture discontinuities. In the important case of CPwL functions, we prove that it suffices to work with pure ReLU nets. Our construction and DNN architecture generalizes previous results in that no geometric restrictions on the regular simplicial partitions $\mathcal{T}$ of $\Omega$ are required for DNN emulation. In addition, for CPwL functions our DNN construction is valid in any dimension $d\geq 2$. Our ``FE-Nets'' are required in the variationally correct, structure-preserving approximation of boundary value problems of electromagnetism in nonconvex polyhedra $\Omega \subset \mathbb{R}^3$. They are thus an essential ingredient in the application of e.g., the methodology of ``physics-informed NNs'' or ``deep Ritz methods'' to electromagnetic field simulation via deep learning techniques. We indicate generalizations of our constructions to higher-order compatible spaces and other, non-compatible classes of discretizations, in particular the ``Crouzeix-Raviart'' elements and Hybridized, Higher Order (HHO) methods.
    Primal Dual Alternating Proximal Gradient Algorithms for Nonsmooth Nonconvex Minimax Problems with Coupled Linear Constraints. (arXiv:2212.04672v1 [math.OC])
    Nonconvex minimax problems have attracted wide attention in machine learning, signal processing and many other fields in recent years. In this paper, we propose a primal dual alternating proximal gradient (PDAPG) algorithm and a primal dual proximal gradient (PDPG-L) algorithm for solving nonsmooth nonconvex-strongly concave and nonconvex-linear minimax problems with coupled linear constraints, respectively. The corresponding iteration complexity of the two algorithms are proved to be $\mathcal{O}\left( \varepsilon ^{-2} \right)$ and $\mathcal{O}\left( \varepsilon ^{-3} \right)$ to reach an $\varepsilon$-stationary point, respectively. To our knowledge, they are the first two algorithms with iteration complexity guarantee for solving the two classes of minimax problems.
    Robust detection and attribution of climate change under interventions. (arXiv:2212.04905v1 [stat.ML])
    Fingerprints are key tools in climate change detection and attribution (D&A) that are used to determine whether changes in observations are different from internal climate variability (detection), and whether observed changes can be assigned to specific external drivers (attribution). We propose a direct D&A approach based on supervised learning to extract fingerprints that lead to robust predictions under relevant interventions on exogenous variables, i.e., climate drivers other than the target. We employ anchor regression, a distributionally-robust statistical learning method inspired by causal inference that extrapolates well to perturbed data under the interventions considered. The residuals from the prediction achieve either uncorrelatedness or mean independence with the exogenous variables, thus guaranteeing robustness. We define D&A as a unified hypothesis testing framework that relies on the same statistical model but uses different targets and test statistics. In the experiments, we first show that the CO2 forcing can be robustly predicted from temperature spatial patterns under strong interventions on the solar forcing. Second, we illustrate attribution to the greenhouse gases and aerosols while protecting against interventions on the aerosols and CO2 forcing, respectively. Our study shows that incorporating robustness constraints against relevant interventions may significantly benefit detection and attribution of climate change.
    The Platform for non-metallic pipes defects recognition. Design and Implementation. (arXiv:2212.04706v1 [cs.LG])
    This paper describes a prototype software and hardware platform to provide support to field operators during the inspection of surface defects of non-metallic pipes. Inspection is carried out by video filming defects created on the same surface in real-time using a "smart" helmet device and other mobile devices. The work focuses on the detection and recognition of the defects which appears as colored iridescence of reflected light caused by the diffraction effect arising from the presence of internal stresses in the inspected material. The platform allows you to carry out preliminary analysis directly on the device in offline mode, and, if a connection to the network is established, the received data is transmitted to the server for post-processing to extract information about possible defects that were not detected at the previous stage. The paper presents a description of the stages of design, formal description, and implementation details of the platform. It also provides descriptions of the models used to recognize defects and examples of the result of the work.
    Probabilistically Robust PAC Learning. (arXiv:2211.05656v3 [cs.LG] UPDATED)
    Recently, Robey et al. propose a notion of probabilistic robustness, which, at a high-level, requires a classifier to be robust to most but not all perturbations. They show that for certain hypothesis classes where proper learning under worst-case robustness is \textit{not} possible, proper learning under probabilistic robustness \textit{is} possible with sample complexity exponentially smaller than in the worst-case robustness setting. This motivates the question of whether proper learning under probabilistic robustness is always possible. In this paper, we show that this is \textit{not} the case. We exhibit examples of hypothesis classes $\mathcal{H}$ with finite VC dimension that are \textit{not} probabilistically robustly PAC learnable with \textit{any} proper learning rule. However, if we compare the output of the learner to the best hypothesis for a slightly \textit{stronger} level of probabilistic robustness, we show that not only is proper learning \textit{always} possible, but it is possible via empirical risk minimization.
    A PINN Approach to Symbolic Differential Operator Discovery with Sparse Data. (arXiv:2212.04630v1 [cs.LG])
    Given ample experimental data from a system governed by differential equations, it is possible to use deep learning techniques to construct the underlying differential operators. In this work we perform symbolic discovery of differential operators in a situation where there is sparse experimental data. This small data regime in machine learning can be made tractable by providing our algorithms with prior information about the underlying dynamics. Physics Informed Neural Networks (PINNs) have been very successful in this regime (reconstructing entire ODE solutions using only a single point or entire PDE solutions with very few measurements of the initial condition). We modify the PINN approach by adding a neural network that learns a representation of unknown hidden terms in the differential equation. The algorithm yields both a surrogate solution to the differential equation and a black-box representation of the hidden terms. These hidden term neural networks can then be converted into symbolic equations using symbolic regression techniques like AI Feynman. In order to achieve convergence of these neural networks, we provide our algorithms with (noisy) measurements of both the initial condition as well as (synthetic) experimental data obtained at later times. We demonstrate strong performance of this approach even when provided with very few measurements of noisy data in both the ODE and PDE regime.
    Attention in a family of Boltzmann machines emerging from modern Hopfield networks. (arXiv:2212.04692v1 [cs.LG])
    Hopfield networks and Boltzmann machines (BMs) are fundamental energy-based neural network models. Recent studies on modern Hopfield networks have broaden the class of energy functions and led to a unified perspective on general Hopfield networks including an attention module. In this letter, we consider the BM counterparts of modern Hopfield networks using the associated energy functions, and study their salient properties from a trainability perspective. In particular, the energy function corresponding to the attention module naturally introduces a novel BM, which we refer to as attentional BM (AttnBM). We verify that AttnBM has a tractable likelihood function and gradient for a special case and is easy to train. Moreover, we reveal the hidden connections between AttnBM and some single-layer models, namely the Gaussian--Bernoulli restricted BM and denoising autoencoder with softmax units. We also investigate BMs introduced by other energy functions, and in particular, observe that the energy function of dense associative memory models gives BMs belonging to Exponential Family Harmoniums.
    The Integration of Machine Learning into Automated Test Generation: A Systematic Mapping Study. (arXiv:2206.10210v3 [cs.SE] UPDATED)
    Context: Machine learning (ML) may enable effective automated test generation. Objective: We characterize emerging research, examining testing practices, researcher goals, ML techniques applied, evaluation, and challenges. Methods: We perform a systematic mapping on a sample of 102 publications. Results: ML generates input for system, GUI, unit, performance, and combinatorial testing or improves the performance of existing generation methods. ML is also used to generate test verdicts, property-based, and expected output oracles. Supervised learning - often based on neural networks - and reinforcement learning - often based on Q-learning - are common, and some publications also employ unsupervised or semi-supervised learning. (Semi-/Un-)Supervised approaches are evaluated using both traditional testing metrics and ML-related metrics (e.g., accuracy), while reinforcement learning is often evaluated using testing metrics tied to the reward function. Conclusion: Work-to-date shows great promise, but there are open challenges regarding training data, retraining, scalability, evaluation complexity, ML algorithms employed - and how they are applied - benchmarks, and replicability. Our findings can serve as a roadmap and inspiration for researchers in this field.
    Implementing Neural Network-Based Equalizers in a Coherent Optical Transmission System Using Field-Programmable Gate Arrays. (arXiv:2212.04703v1 [eess.SP])
    In this work, we demonstrate the offline FPGA realization of both recurrent and feedforward neural network (NN)-based equalizers for nonlinearity compensation in coherent optical transmission systems. First, we present a realization pipeline showing the conversion of the models from Python libraries to the FPGA chip synthesis and implementation. Then, we review the main alternatives for the hardware implementation of nonlinear activation functions. The main results are divided into three parts: a performance comparison, an analysis of how activation functions are implemented, and a report on the complexity of the hardware. The performance in Q-factor is presented for the cases of bidirectional long-short-term memory coupled with convolutional NN (biLSTM + CNN) equalizer, CNN equalizer, and standard 1-StpS digital back-propagation (DBP) for the simulation and experiment propagation of a single channel dual-polarization (SC-DP) 16QAM at 34 GBd along 17x70km of LEAF. The biLSTM+CNN equalizer provides a similar result to DBP and a 1.7 dB Q-factor gain compared with the chromatic dispersion compensation baseline in the experimental dataset. After that, we assess the Q-factor and the impact of hardware utilization when approximating the activation functions of NN using Taylor series, piecewise linear, and look-up table (LUT) approximations. We also show how to mitigate the approximation errors with extra training and provide some insights into possible gradient problems in the LUT approximation. Finally, to evaluate the complexity of hardware implementation to achieve 400G throughput, fixed-point NN-based equalizers with approximated activation functions are developed and implemented in an FPGA.
    On the Sensitivity of Reward Inference to Misspecified Human Models. (arXiv:2212.04717v1 [cs.LG])
    Inferring reward functions from human behavior is at the center of value alignment - aligning AI objectives with what we, humans, actually want. But doing so relies on models of how humans behave given their objectives. After decades of research in cognitive science, neuroscience, and behavioral economics, obtaining accurate human models remains an open research topic. This begs the question: how accurate do these models need to be in order for the reward inference to be accurate? On the one hand, if small errors in the model can lead to catastrophic error in inference, the entire framework of reward learning seems ill-fated, as we will never have perfect models of human behavior. On the other hand, if as our models improve, we can have a guarantee that reward accuracy also improves, this would show the benefit of more work on the modeling side. We study this question both theoretically and empirically. We do show that it is unfortunately possible to construct small adversarial biases in behavior that lead to arbitrarily large errors in the inferred reward. However, and arguably more importantly, we are also able to identify reasonable assumptions under which the reward inference error can be bounded linearly in the error in the human model. Finally, we verify our theoretical insights in discrete and continuous control tasks with simulated and human data.  ( 2 min )
    Confidence-Conditioned Value Functions for Offline Reinforcement Learning. (arXiv:2212.04607v1 [cs.LG])
    Offline reinforcement learning (RL) promises the ability to learn effective policies solely using existing, static datasets, without any costly online interaction. To do so, offline RL methods must handle distributional shift between the dataset and the learned policy. The most common approach is to learn conservative, or lower-bound, value functions, which underestimate the return of out-of-distribution (OOD) actions. However, such methods exhibit one notable drawback: policies optimized on such value functions can only behave according to a fixed, possibly suboptimal, degree of conservatism. However, this can be alleviated if we instead are able to learn policies for varying degrees of conservatism at training time and devise a method to dynamically choose one of them during evaluation. To do so, in this work, we propose learning value functions that additionally condition on the degree of conservatism, which we dub confidence-conditioned value functions. We derive a new form of a Bellman backup that simultaneously learns Q-values for any degree of confidence with high probability. By conditioning on confidence, our value functions enable adaptive strategies during online evaluation by controlling for confidence level using the history of observations thus far. This approach can be implemented in practice by conditioning the Q-function from existing conservative algorithms on the confidence. We theoretically show that our learned value functions produce conservative estimates of the true value at any desired confidence. Finally, we empirically show that our algorithm outperforms existing conservative offline RL algorithms on multiple discrete control domains.
    Applying Deep Reinforcement Learning to the HP Model for Protein Structure Prediction. (arXiv:2211.14939v2 [cs.LG] UPDATED)
    A central problem in computational biophysics is protein structure prediction, i.e., finding the optimal folding of a given amino acid sequence. This problem has been studied in a classical abstract model, the HP model, where the protein is modeled as a sequence of H (hydrophobic) and P (polar) amino acids on a lattice. The objective is to find conformations maximizing H-H contacts. It is known that even in this reduced setting, the problem is intractable (NP-hard). In this work, we apply deep reinforcement learning (DRL) to the two-dimensional HP model. We can obtain the conformations of best known energies for benchmark HP sequences with lengths from 20 to 50. Our DRL is based on a deep Q-network (DQN). We find that a DQN based on long short-term memory (LSTM) architecture greatly enhances the RL learning ability and significantly improves the search process. DRL can sample the state space efficiently, without the need of manual heuristics. Experimentally we show that it can find multiple distinct best-known solutions per trial. This study demonstrates the effectiveness of deep reinforcement learning in the HP model for protein folding.
    Self-Supervised PPG Representation Learning Shows High Inter-Subject Variability. (arXiv:2212.04902v1 [eess.SP])
    With the progress of sensor technology in wearables, the collection and analysis of PPG signals are gaining more interest. Using Machine Learning, the cardiac rhythm corresponding to PPG signals can be used to predict different tasks such as activity recognition, sleep stage detection, or more general health status. However, supervised learning is often limited by the amount of available labeled data, which is typically expensive to obtain. To address this problem, we propose a Self-Supervised Learning (SSL) method with a pretext task of signal reconstruction to learn an informative generalized PPG representation. The performance of the proposed SSL framework is compared with two fully supervised baselines. The results show that in a very limited label data setting (10 samples per class or less), using SSL is beneficial, and a simple classifier trained on SSL-learned representations outperforms fully supervised deep neural networks. However, the results reveal that the SSL-learned representations are too focused on encoding the subjects. Unfortunately, there is high inter-subject variability in the SSL-learned representations, which makes working with this data more challenging when labeled data is scarce. The high inter-subject variability suggests that there is still room for improvements in learning representations. In general, the results suggest that SSL may pave the way for the broader use of machine learning models on PPG data in label-scarce regimes.  ( 2 min )
    Understanding and Combating Robust Overfitting via Input Loss Landscape Analysis and Regularization. (arXiv:2212.04985v1 [cs.LG])
    Adversarial training is widely used to improve the robustness of deep neural networks to adversarial attack. However, adversarial training is prone to overfitting, and the cause is far from clear. This work sheds light on the mechanisms underlying overfitting through analyzing the loss landscape w.r.t. the input. We find that robust overfitting results from standard training, specifically the minimization of the clean loss, and can be mitigated by regularization of the loss gradients. Moreover, we find that robust overfitting turns severer during adversarial training partially because the gradient regularization effect of adversarial training becomes weaker due to the increase in the loss landscapes curvature. To improve robust generalization, we propose a new regularizer to smooth the loss landscape by penalizing the weighted logits variation along the adversarial direction. Our method significantly mitigates robust overfitting and achieves the highest robustness and efficiency compared to similar previous methods. Code is available at https://github.com/TreeLLi/Combating-RO-AdvLC.  ( 2 min )
    ProductGraphSleepNet: Sleep Staging using Product Spatio-Temporal Graph Learning with Attentive Temporal Aggregation. (arXiv:2212.04881v1 [eess.SP])
    The classification of sleep stages plays a crucial role in understanding and diagnosing sleep pathophysiology. Sleep stage scoring relies heavily on visual inspection by an expert that is time consuming and subjective procedure. Recently, deep learning neural network approaches have been leveraged to develop a generalized automated sleep staging and account for shifts in distributions that may be caused by inherent inter/intra-subject variability, heterogeneity across datasets, and different recording environments. However, these networks ignore the connections among brain regions, and disregard the sequential connections between temporally adjacent sleep epochs. To address these issues, this work proposes an adaptive product graph learning-based graph convolutional network, named ProductGraphSleepNet, for learning joint spatio-temporal graphs along with a bidirectional gated recurrent unit and a modified graph attention network to capture the attentive dynamics of sleep stage transitions. Evaluation on two public databases: the Montreal Archive of Sleep Studies (MASS) SS3; and the SleepEDF, which contain full night polysomnography recordings of 62 and 20 healthy subjects, respectively, demonstrates performance comparable to the state-of-the-art (Accuracy: 0.867;0.838, F1-score: 0.818;0.774 and Kappa: 0.802;0.775, on each database respectively). More importantly, the proposed network makes it possible for clinicians to comprehend and interpret the learned connectivity graphs for sleep stages.  ( 2 min )
    Learning Graph Algorithms With Recurrent Graph Neural Networks. (arXiv:2212.04934v1 [cs.LG])
    Classical graph algorithms work well for combinatorial problems that can be thoroughly formalized and abstracted. Once the algorithm is derived, it generalizes to instances of any size. However, developing an algorithm that handles complex structures and interactions in the real world can be challenging. Rather than specifying the algorithm, we can try to learn it from the graph-structured data. Graph Neural Networks (GNNs) are inherently capable of working on graph structures; however, they struggle to generalize well, and learning on larger instances is challenging. In order to scale, we focus on a recurrent architecture design that can learn simple graph problems end to end on smaller graphs and then extrapolate to larger instances. As our main contribution, we identify three essential techniques for recurrent GNNs to scale. By using (i) skip connections, (ii) state regularization, and (iii) edge convolutions, we can guide GNNs toward extrapolation. This allows us to train on small graphs and apply the same model to much larger graphs during inference. Moreover, we empirically validate the extrapolation capabilities of our GNNs on algorithmic datasets.  ( 2 min )
    Spurious Features Everywhere -- Large-Scale Detection of Harmful Spurious Features in ImageNet. (arXiv:2212.04871v1 [cs.CV])
    Benchmark performance of deep learning classifiers alone is not a reliable predictor for the performance of a deployed model. In particular, if the image classifier has picked up spurious features in the training data, its predictions can fail in unexpected ways. In this paper, we develop a framework that allows us to systematically identify spurious features in large datasets like ImageNet. It is based on our neural PCA components and their visualization. Previous work on spurious features of image classifiers often operates in toy settings or requires costly pixel-wise annotations. In contrast, we validate our results by checking that presence of the harmful spurious feature of a class is sufficient to trigger the prediction of that class. We introduce a novel dataset "Spurious ImageNet" and check how much existing classifiers rely on spurious features.  ( 2 min )
    Multidimensional Service Quality Scoring System. (arXiv:2212.04611v1 [cs.LG])
    This supplementary paper aims to introduce the Multidimensional Service Quality Scoring System (MSQs), a review-based method for quantifying host service quality mentioned and employed in the paper Exit and transition: Exploring the survival status of Airbnb listings in a time of professionalization. MSQs is not an end-to-end implementation and is essentially composed of three pipelines, namely Data Collection and Preprocessing, Objects Recognition and Grouping, and Aspect-based Service Scoring. Using the study mentioned above as a case, the technical details of MSQs are explained in this article.  ( 2 min )
    The Cross Density Kernel Function: A Novel Framework to Quantify Statistical Dependence for Random Processes. (arXiv:2212.04631v1 [cs.LG])
    This paper proposes a novel multivariate definition of statistical dependence using a functional methodology inspired by Alfred R\'enyi. We define a new symmetric and self-adjoint cross density kernel through a recursive bidirectional statistical mapping between conditional densities of continuous random processes, which estimates their statistical dependence. Therefore, the kernel eigenspectrum is proposed as a new multivariate statistical dependence measure, and the formulation requires fewer assumptions about the data generation model than current methods. The measure can also be estimated from realizations. The proposed functional maximum correlation algorithm (FMCA) is applied to a learning architecture with two multivariate neural networks. The FMCA optimal solution is an equilibrium point that estimates the eigenspectrum of the cross density kernel. Preliminary results with synthetic data and medium size image datasets corroborate the theory. Four different strategies of applying the cross density kernel are thoroughly discussed and implemented to show the versatility and stability of the methodology, and it transcends supervised learning. When two random processes are high-dimensional real-world images and white uniform noise, respectively, the algorithm learns a factorial code i.e., the occurrence of a code guarantees that a certain input in the training set was present, which is quite important for feature learning.  ( 2 min )
    A Meta-level Analysis of Online Anomaly Detectors. (arXiv:2209.05899v2 [cs.LG] UPDATED)
    Real-time detection of anomalies in streaming data is receiving increasing attention as it allows us to raise alerts, predict faults, and detect intrusions or threats across industries. Yet, little attention has been given to compare the effectiveness and efficiency of anomaly detectors for streaming data (i.e., of online algorithms). In this paper, we present a qualitative, synthetic overview of major online detectors from different algorithmic families (i.e., distance, density, tree or projection-based) and highlight their main ideas for constructing, updating and testing detection models. Then, we provide a thorough analysis of the results of a quantitative experimental evaluation of online detection algorithms along with their offline counterparts. The behavior of the detectors is correlated with the characteristics of different datasets (i.e., meta-features), thereby providing a meta-level analysis of their performance. Our study addresses several missing insights from the literature such as (a) how reliable are detectors against a random classifier and what dataset characteristics make them perform randomly; (b) to what extent online detectors approximate the performance of offline counterparts; (c) which sketch strategy and update primitives of detectors are best to detect anomalies visible only within a feature subspace of a dataset; (d) what are the tradeoffs between the effectiveness and the efficiency of detectors belonging to different algorithmic families; (e) which specific characteristics of datasets yield an online algorithm to outperform all others.
    DDSupport: Language Learning Support System that Displays Differences and Distances from Model Speech. (arXiv:2212.04930v1 [eess.AS])
    When beginners learn to speak a non-native language, it is difficult for them to judge for themselves whether they are speaking well. Therefore, computer-assisted pronunciation training systems are used to detect learner mispronunciations. These systems typically compare the user's speech with that of a specific native speaker as a model in units of rhythm, phonemes, or words and calculate the differences. However, they require extensive speech data with detailed annotations or can only compare with one specific native speaker. To overcome these problems, we propose a new language learning support system that calculates speech scores and detects mispronunciations by beginners based on a small amount of unannotated speech data without comparison to a specific person. The proposed system uses deep learning--based speech processing to display the pronunciation score of the learner's speech and the difference/distance between the learner's and a group of models' pronunciation in an intuitively visual manner. Learners can gradually improve their pronunciation by eliminating differences and shortening the distance from the model until they become sufficiently proficient. Furthermore, since the pronunciation score and difference/distance are not calculated compared to specific sentences of a particular model, users are free to study the sentences they wish to study. We also built an application to help non-native speakers learn English and confirmed that it can improve users' speech intelligibility.
    A Grid-based Sensor Floor Platform for Robot Localization using Machine Learning. (arXiv:2212.04721v1 [cs.LG])
    Wireless Sensor Network (WSN) applications reshape the trend of warehouse monitoring systems allowing them to track and locate massive numbers of logistic entities in real-time. To support the tasks, classic Radio Frequency (RF)-based localization approaches (e.g. triangulation and trilateration) confront challenges due to multi-path fading and signal loss in noisy warehouse environment. In this paper, we investigate machine learning methods using a new grid-based WSN platform called Sensor Floor that can overcome the issues. Sensor Floor consists of 345 nodes installed across the floor of our logistic research hall with dual-band RF and Inertial Measurement Unit (IMU) sensors. Our goal is to localize all logistic entities, for this study we use a mobile robot. We record distributed sensing measurements of Received Signal Strength Indicator (RSSI) and IMU values as the dataset and position tracking from Vicon system as the ground truth. The asynchronous collected data is pre-processed and trained using Random Forest and Convolutional Neural Network (CNN). The CNN model with regularization outperforms the Random Forest in terms of localization accuracy with aproximate 15 cm. Moreover, the CNN architecture can be configured flexibly depending on the scenario in the warehouse. The hardware, software and the CNN architecture of the Sensor Floor are open-source under https://github.com/FLW-TUDO/sensorfloor.  ( 2 min )
    UNet Based Pipeline for Lung Segmentation from Chest X-Ray Images. (arXiv:2212.04617v1 [eess.IV])
    Biomedical image segmentation is one of the fastest growing fields which has seen extensive automation through the use of Artificial Intelligence. This has enabled widespread adoption of accurate techniques to expedite the screening and diagnostic processes which would otherwise take several days to finalize. In this paper, we present an end-to-end pipeline to segment lungs from chest X-ray images, training the neural network model on the Japanese Society of Radiological Technology (JSRT) dataset, using UNet to enable faster processing of initial screening for various lung disorders. The pipeline developed can be readily used by medical centers with just the provision of X-Ray images as input. The model will perform the preprocessing, and provide a segmented image as the final output. It is expected that this will drastically reduce the manual effort involved and lead to greater accessibility in resource-constrained locations.
    Deep Learning of Causal Structures in High Dimensions. (arXiv:2212.04866v1 [cs.LG])
    Recent years have seen rapid progress at the intersection between causality and machine learning. Motivated by scientific applications involving high-dimensional data, in particular in biomedicine, we propose a deep neural architecture for learning causal relationships between variables from a combination of empirical data and prior causal knowledge. We combine convolutional and graph neural networks within a causal risk framework to provide a flexible and scalable approach. Empirical results include linear and nonlinear simulations (where the underlying causal structures are known and can be directly compared against), as well as a real biological example where the models are applied to high-dimensional molecular data and their output compared against entirely unseen validation experiments. These results demonstrate the feasibility of using deep learning approaches to learn causal networks in large-scale problems spanning thousands of variables.
    Machine Learning Framework: Competitive Intelligence and Key Drivers Identification of Market Share Trends Among Healthcare Facilities. (arXiv:2212.04810v1 [cs.LG])
    The necessity of data driven decisions in healthcare strategy formulation is rapidly increasing. A reliable framework which helps identify factors impacting a Healthcare Provider Facility or a Hospital (from here on termed as Facility) Market Share is of key importance. This pilot study aims at developing a data driven Machine Learning - Regression framework which aids strategists in formulating key decisions to improve the Facilitys Market Share which in turn impacts in improving the quality of healthcare services. The US (United States) healthcare business is chosen for the study; and the data spanning across 60 key Facilities in Washington State and about 3 years of historical data is considered. In the current analysis Market Share is termed as the ratio of facility encounters to the total encounters among the group of potential competitor facilities. The current study proposes a novel two-pronged approach of competitor identification and regression approach to evaluate and predict market share, respectively. Leveraged model agnostic technique, SHAP, to quantify the relative importance of features impacting the market share. The proposed method to identify pool of competitors in current analysis, develops Directed Acyclic Graphs (DAGs), feature level word vectors and evaluates the key connected components at facility level. This technique is robust since its data driven which minimizes the bias from empirical techniques. Post identifying the set of competitors among facilities, developed Regression model to predict the Market share. For relative quantification of features at a facility level, incorporated SHAP a model agnostic explainer. This helped to identify and rank the attributes at each facility which impacts the market share.  ( 2 min )
    Digital Twin for Real-time Li-ion Battery State of Health Estimation with Partially Discharged Cycling Data. (arXiv:2212.04622v1 [cs.LG])
    To meet the fairly high safety and reliability requirements in practice, the state of health (SOH) estimation of Lithium-ion batteries (LIBs), which has a close relationship with the degradation performance, has been extensively studied with the widespread applications of various electronics. The conventional SOH estimation approaches with digital twin are end-of-cycle estimation that require the completion of a full charge/discharge cycle to observe the maximum available capacity. However, under dynamic operating conditions with partially discharged data, it is impossible to sense accurate real-time SOH estimation for LIBs. To bridge this research gap, we put forward a digital twin framework to gain the capability of sensing the battery's SOH on the fly, updating the physical battery model. The proposed digital twin solution consists of three core components to enable real-time SOH estimation without requiring a complete discharge. First, to handle the variable training cycling data, the energy discrepancy-aware cycling synchronization is proposed to align cycling data with guaranteeing the same data structure. Second, to explore the temporal importance of different training sampling times, a time-attention SOH estimation model is developed with data encoding to capture the degradation behavior over cycles, excluding adverse influences of unimportant samples. Finally, for online implementation, a similarity analysis-based data reconstruction has been put forward to provide real-time SOH estimation without requiring a full discharge cycle. Through a series of results conducted on a widely used benchmark, the proposed method yields the real-time SOH estimation with errors less than 1% for most sampling times in ongoing cycles.  ( 2 min )
    Contrastive View Design Strategies to Enhance Robustness to Domain Shifts in Downstream Object Detection. (arXiv:2212.04613v1 [cs.CV])
    Contrastive learning has emerged as a competitive pretraining method for object detection. Despite this progress, there has been minimal investigation into the robustness of contrastively pretrained detectors when faced with domain shifts. To address this gap, we conduct an empirical study of contrastive learning and out-of-domain object detection, studying how contrastive view design affects robustness. In particular, we perform a case study of the detection-focused pretext task Instance Localization (InsLoc) and propose strategies to augment views and enhance robustness in appearance-shifted and context-shifted scenarios. Amongst these strategies, we propose changes to cropping such as altering the percentage used, adding IoU constraints, and integrating saliency based object priors. We also explore the addition of shortcut-reducing augmentations such as Poisson blending, texture flattening, and elastic deformation. We benchmark these strategies on abstract, weather, and context domain shifts and illustrate robust ways to combine them, in both pretraining on single-object and multi-object image datasets. Overall, our results and insights show how to ensure robustness through the choice of views in contrastive learning.  ( 2 min )
    Selective Amnesia: On Efficient, High-Fidelity and Blind Suppression of Backdoor Effects in Trojaned Machine Learning Models. (arXiv:2212.04687v1 [cs.LG])
    In this paper, we present a simple yet surprisingly effective technique to induce "selective amnesia" on a backdoored model. Our approach, called SEAM, has been inspired by the problem of catastrophic forgetting (CF), a long standing issue in continual learning. Our idea is to retrain a given DNN model on randomly labeled clean data, to induce a CF on the model, leading to a sudden forget on both primary and backdoor tasks; then we recover the primary task by retraining the randomized model on correctly labeled clean data. We analyzed SEAM by modeling the unlearning process as continual learning and further approximating a DNN using Neural Tangent Kernel for measuring CF. Our analysis shows that our random-labeling approach actually maximizes the CF on an unknown backdoor in the absence of triggered inputs, and also preserves some feature extraction in the network to enable a fast revival of the primary task. We further evaluated SEAM on both image processing and Natural Language Processing tasks, under both data contamination and training manipulation attacks, over thousands of models either trained on popular image datasets or provided by the TrojAI competition. Our experiments show that SEAM vastly outperforms the state-of-the-art unlearning techniques, achieving a high Fidelity (measuring the gap between the accuracy of the primary task and that of the backdoor) within a few minutes (about 30 times faster than training a model from scratch using the MNIST dataset), with only a small amount of clean data (0.1% of training data for TrojAI models).  ( 2 min )
    Transfer Learning Enhanced DeepONet for Long-Time Prediction of Evolution Equations. (arXiv:2212.04663v1 [cs.LG])
    Deep operator network (DeepONet) has demonstrated great success in various learning tasks, including learning solution operators of partial differential equations. In particular, it provides an efficient approach to predict the evolution equations in a finite time horizon. Nevertheless, the vanilla DeepONet suffers from the issue of stability degradation in the long-time prediction. This paper proposes a {\em transfer-learning} aided DeepONet to enhance the stability. Our idea is to use transfer learning to sequentially update the DeepONets as the surrogates for propagators learned in different time frames. The evolving DeepONets can better track the varying complexities of the evolution equations, while only need to be updated by efficient training of a tiny fraction of the operator networks. Through systematic experiments, we show that the proposed method not only improves the long-time accuracy of DeepONet while maintaining similar computational cost but also substantially reduces the sample size of the training set.  ( 2 min )
    The R-algebra of Quasiknowledge and Convex Optimization. (arXiv:2212.04606v1 [quant-ph])
    This article develops a convex description of a classical or quantum learner's or agent's state of knowledge about its environment, presented as a convex subset of a commutative R-algebra. With caveats, this leads to a generalization of certain semidefinite programs in quantum information (such as those describing the universal query algorithm dual to the quantum adversary bound, related to optimal learning or control of the environment) to the classical and faulty-quantum setting, which would not be possible with a naive description via joint probability distributions over environment and internal memory. More philosophically, it also makes an interpretation of the set of reduced density matrices as "states of knowledge" of an observer of its environment, related to these techniques, more explicit. As another example, I describe and solve a formal differential equation of states of knowledge in that algebra, where an agent obtains experimental data in a Poissonian process, and its state of knowledge evolves as an exponential power series. However, this framework currently lacks impressive applications, and I post it in part to solicit feedback and collaboration on those. In particular, it may be possible to develop it into a new framework for the design of experiments, e.g. the problem of finding maximally informative questions to ask human labelers or the environment in machine-learning problems. The parts of the article not related to quantum information don't assume knowledge of it.  ( 2 min )
    Is Bio-Inspired Learning Better than Backprop? Benchmarking Bio Learning vs. Backprop. (arXiv:2212.04614v1 [cs.LG])
    Bio-inspired learning has been gaining popularity recently given that Backpropagation (BP) is not considered biologically plausible. Many algorithms have been proposed in the literature which are all more biologically plausible than BP. However, apart from overcoming the biological implausibility of BP, a strong motivation for using Bio-inspired algorithms remains lacking. In this study, we undertake a holistic comparison of BP vs. multiple Bio-inspired algorithms to answer the question of whether Bio-learning offers additional benefits over BP, rather than just biological plausibility. We test Bio-algorithms under different design choices such as access to only partial training data, resource constraints in terms of the number of training epochs, sparsification of the neural network parameters and addition of noise to input samples. Through these experiments, we notably find two key advantages of Bio-algorithms over BP. Firstly, Bio-algorithms perform much better than BP when the entire training dataset is not supplied. Four of the five Bio-algorithms tested outperform BP by upto 5% accuracy when only 20% of the training dataset is available. Secondly, even when the full dataset is available, Bio-algorithms learn much quicker and converge to a stable accuracy in far lesser training epochs than BP. Hebbian learning, specifically, is able to learn in just 5 epochs compared to around 100 epochs required by BP. These insights present practical reasons for utilising Bio-learning rather than just its biological plausibility and also point towards interesting new directions for future work on Bio-learning.  ( 2 min )
    Training Data Influence Analysis and Estimation: A Survey. (arXiv:2212.04612v1 [cs.LG])
    Good models require good training data. For overparameterized deep models, the causal relationship between training data and model predictions is increasingly opaque and poorly understood. Influence analysis partially demystifies training's underlying interactions by quantifying the amount each training instance alters the final model. Measuring the training data's influence exactly can be provably hard in the worst case; this has led to the development and use of influence estimators, which only approximate the true influence. This paper provides the first comprehensive survey of training data influence analysis and estimation. We begin by formalizing the various, and in places orthogonal, definitions of training data influence. We then organize state-of-the-art influence analysis methods into a taxonomy; we describe each of these methods in detail and compare their underlying assumptions, asymptotic complexities, and overall strengths and weaknesses. Finally, we propose future research directions to make influence analysis more useful in practice as well as more theoretically and empirically sound. A curated, up-to-date list of resources related to influence analysis is available at https://github.com/ZaydH/influence_analysis_papers.  ( 2 min )
    Localized Contrastive Learning on Graphs. (arXiv:2212.04604v1 [cs.LG])
    Contrastive learning methods based on InfoNCE loss are popular in node representation learning tasks on graph-structured data. However, its reliance on data augmentation and its quadratic computational complexity might lead to inconsistency and inefficiency problems. To mitigate these limitations, in this paper, we introduce a simple yet effective contrastive model named Localized Graph Contrastive Learning (Local-GCL in short). Local-GCL consists of two key designs: 1) We fabricate the positive examples for each node directly using its first-order neighbors, which frees our method from the reliance on carefully-designed graph augmentations; 2) To improve the efficiency of contrastive learning on graphs, we devise a kernelized contrastive loss, which could be approximately computed in linear time and space complexity with respect to the graph size. We provide theoretical analysis to justify the effectiveness and rationality of the proposed methods. Experiments on various datasets with different scales and properties demonstrate that in spite of its simplicity, Local-GCL achieves quite competitive performance in self-supervised node representation learning tasks on graphs with various scales and properties.  ( 2 min )
    SpeechLMScore: Evaluating speech generation using speech language model. (arXiv:2212.04559v1 [eess.AS])
    While human evaluation is the most reliable metric for evaluating speech generation systems, it is generally costly and time-consuming. Previous studies on automatic speech quality assessment address the problem by predicting human evaluation scores with machine learning models. However, they rely on supervised learning and thus suffer from high annotation costs and domain-shift problems. We propose SpeechLMScore, an unsupervised metric to evaluate generated speech using a speech-language model. SpeechLMScore computes the average log-probability of a speech signal by mapping it into discrete tokens and measures the average probability of generating the sequence of tokens. Therefore, it does not require human annotation and is a highly scalable framework. Evaluation results demonstrate that the proposed metric shows a promising correlation with human evaluation scores on different speech generation tasks including voice conversion, text-to-speech, and speech enhancement.  ( 2 min )
    Learning Options via Compression. (arXiv:2212.04590v1 [cs.LG])
    Identifying statistical regularities in solutions to some tasks in multi-task reinforcement learning can accelerate the learning of new tasks. Skill learning offers one way of identifying these regularities by decomposing pre-collected experiences into a sequence of skills. A popular approach to skill learning is maximizing the likelihood of the pre-collected experience with latent variable models, where the latent variables represent the skills. However, there are often many solutions that maximize the likelihood equally well, including degenerate solutions. To address this underspecification, we propose a new objective that combines the maximum likelihood objective with a penalty on the description length of the skills. This penalty incentivizes the skills to maximally extract common structures from the experiences. Empirically, our objective learns skills that solve downstream tasks in fewer samples compared to skills learned from only maximizing likelihood. Further, while most prior works in the offline multi-task setting focus on tasks with low-dimensional observations, our objective can scale to challenging tasks with high-dimensional image observations.  ( 2 min )
    Knowledge Distillation Applied to Optical Channel Equalization: Solving the Parallelization Problem of Recurrent Connection. (arXiv:2212.04569v1 [eess.SP])
    To circumvent the non-parallelizability of recurrent neural network-based equalizers, we propose knowledge distillation to recast the RNN into a parallelizable feedforward structure. The latter shows 38\% latency decrease, while impacting the Q-factor by only 0.5dB.  ( 2 min )
    STLGRU: Spatio-Temporal Lightweight Graph GRU for Traffic Flow Prediction. (arXiv:2212.04548v1 [cs.LG])
    Reliable forecasting of traffic flow requires efficient modeling of traffic data. Different correlations and influences arise in a dynamic traffic network, making modeling a complicated task. Existing literature has proposed many different methods to capture the complex underlying spatial-temporal relations of traffic networks. However, methods still struggle to capture different local and global dependencies of long-range nature. Also, as more and more sophisticated methods are being proposed, models are increasingly becoming memory-heavy and, thus, unsuitable for low-powered devices. In this paper, we focus on solving these problems by proposing a novel deep learning framework - STLGRU. Specifically, our proposed STLGRU can effectively capture both local and global spatial-temporal relations of a traffic network using memory-augmented attention and gating mechanism. Instead of employing separate temporal and spatial components, we show that our memory module and gated unit can learn the spatial-temporal dependencies successfully, allowing for reduced memory usage with fewer parameters. We extensively experiment on several real-world traffic prediction datasets to show that our model performs better than existing methods while the memory footprint remains lower. Code is available at \url{https://github.com/Kishor-Bhaumik/STLGRU}.  ( 2 min )
    Effective Dynamics of Generative Adversarial Networks. (arXiv:2212.04580v1 [cond-mat.dis-nn])
    Generative adversarial networks (GANs) are a class of machine-learning models that use adversarial training to generate new samples with the same (potentially very complex) statistics as the training samples. One major form of training failure, known as mode collapse, involves the generator failing to reproduce the full diversity of modes in the target probability distribution. Here, we present an effective model of GAN training, which captures the learning dynamics by replacing the generator neural network with a collection of particles in the output space; particles are coupled by a universal kernel valid for certain wide neural networks and high-dimensional inputs. The generality of our simplified model allows us to study the conditions under which mode collapse occurs. Indeed, experiments which vary the effective kernel of the generator reveal a mode collapse transition, the shape of which can be related to the type of discriminator through the frequency principle. Further, we find that gradient regularizers of intermediate strengths can optimally yield convergence through critical damping of the generator dynamics. Our effective GAN model thus provides an interpretable physical framework for understanding and improving adversarial training.  ( 2 min )
    Enhanced prediction accuracy with uncertainty quantification in monitoring CO2 sequestration using convolutional neural networks. (arXiv:2212.04567v1 [physics.geo-ph])
    Monitoring changes inside a reservoir in real time is crucial for the success of CO2 injection and long-term storage. Machine learning (ML) is well-suited for real-time CO2 monitoring because of its computational efficiency. However, most existing applications of ML yield only one prediction (i.e., the expectation) for a given input, which may not properly reflect the distribution of the testing data, if it has a shift with respect to that of the training data. The Simultaneous Quantile Regression (SQR) method can estimate the entire conditional distribution of the target variable of a neural network via pinball loss. Here, we incorporate this technique into seismic inversion for purposes of CO2 monitoring. The uncertainty map is then calculated pixel by pixel from a particular prediction interval around the median. We also propose a novel data-augmentation method by sampling the uncertainty to further improve prediction accuracy. The developed methodology is tested on synthetic Kimberlina data, which are created by the Department of Energy and based on a CO2 capture and sequestration (CCS) project in California. The results prove that the proposed network can estimate the subsurface velocity rapidly and with sufficient resolution. Furthermore, the computed uncertainty quantifies the prediction accuracy. The method remains robust even if the testing data are distorted due to problems in the field data acquisition. Another test demonstrates the effectiveness of the developed data-augmentation method in increasing the spatial resolution of the estimated velocity field and in reducing the prediction error.  ( 2 min )
    Towards Understanding Fairness and its Composition in Ensemble Machine Learning. (arXiv:2212.04593v1 [cs.LG])
    Machine Learning (ML) software has been widely adopted in modern society, with reported fairness implications for minority groups based on race, sex, age, etc. Many recent works have proposed methods to measure and mitigate algorithmic bias in ML models. The existing approaches focus on single classifier-based ML models. However, real-world ML models are often composed of multiple independent or dependent learners in an ensemble (e.g., Random Forest), where the fairness composes in a non-trivial way. How does fairness compose in ensembles? What are the fairness impacts of the learners on the ultimate fairness of the ensemble? Can fair learners result in an unfair ensemble? Furthermore, studies have shown that hyperparameters influence the fairness of ML models. Ensemble hyperparameters are more complex since they affect how learners are combined in different categories of ensembles. Understanding the impact of ensemble hyperparameters on fairness will help programmers design fair ensembles. Today, we do not understand these fully for different ensemble algorithms. In this paper, we comprehensively study popular real-world ensembles: bagging, boosting, stacking and voting. We have developed a benchmark of 168 ensemble models collected from Kaggle on four popular fairness datasets. We use existing fairness metrics to understand the composition of fairness. Our results show that ensembles can be designed to be fairer without using mitigation techniques. We also identify the interplay between fairness composition and data characteristics to guide fair ensemble design. Finally, our benchmark can be leveraged for further research on fair ensembles. To the best of our knowledge, this is one of the first and largest studies on fairness composition in ensembles yet presented in the literature.  ( 2 min )
    PALMER: Perception-Action Loop with Memory for Long-Horizon Planning. (arXiv:2212.04581v1 [cs.RO])
    To achieve autonomy in a priori unknown real-world scenarios, agents should be able to: i) act from high-dimensional sensory observations (e.g., images), ii) learn from past experience to adapt and improve, and iii) be capable of long horizon planning. Classical planning algorithms (e.g. PRM, RRT) are proficient at handling long-horizon planning. Deep learning based methods in turn can provide the necessary representations to address the others, by modeling statistical contingencies between observations. In this direction, we introduce a general-purpose planning algorithm called PALMER that combines classical sampling-based planning algorithms with learning-based perceptual representations. For training these perceptual representations, we combine Q-learning with contrastive representation learning to create a latent space where the distance between the embeddings of two states captures how easily an optimal policy can traverse between them. For planning with these perceptual representations, we re-purpose classical sampling-based planning algorithms to retrieve previously observed trajectory segments from a replay buffer and restitch them into approximately optimal paths that connect any given pair of start and goal states. This creates a tight feedback loop between representation learning, memory, reinforcement learning, and sampling-based planning. The end result is an experiential framework for long-horizon planning that is significantly more robust and sample efficient compared to existing methods.  ( 2 min )
    Graph Learning Indexer: A Contributor-Friendly and Metadata-Rich Platform for Graph Learning Benchmarks. (arXiv:2212.04537v1 [cs.LG])
    Establishing open and general benchmarks has been a critical driving force behind the success of modern machine learning techniques. As machine learning is being applied to broader domains and tasks, there is a need to establish richer and more diverse benchmarks to better reflect the reality of the application scenarios. Graph learning is an emerging field of machine learning that urgently needs more and better benchmarks. To accommodate the need, we introduce Graph Learning Indexer (GLI), a benchmark curation platform for graph learning. In comparison to existing graph learning benchmark libraries, GLI highlights two novel design objectives. First, GLI is designed to incentivize \emph{dataset contributors}. In particular, we incorporate various measures to minimize the effort of contributing and maintaining a dataset, increase the usability of the contributed dataset, as well as encourage attributions to different contributors of the dataset. Second, GLI is designed to curate a knowledge base, instead of a plain collection, of benchmark datasets. We use multiple sources of meta information to augment the benchmark datasets with \emph{rich characteristics}, so that they can be easily selected and used in downstream research or development. The source code of GLI is available at \url{https://github.com/Graph-Learning-Benchmarks/gli}.  ( 2 min )
    A Dependable Hybrid Machine Learning Model for Network Intrusion Detection. (arXiv:2212.04546v1 [cs.CR])
    Network intrusion detection systems (NIDSs) play an important role in computer network security. There are several detection mechanisms where anomaly-based automated detection outperforms others significantly. Amid the sophistication and growing number of attacks, dealing with large amounts of data is a recognized issue in the development of anomaly-based NIDS. However, do current models meet the needs of today's networks in terms of required accuracy and dependability? In this research, we propose a new hybrid model that combines machine learning and deep learning to increase detection rates while securing dependability. Our proposed method ensures efficient pre-processing by combining SMOTE for data balancing and XGBoost for feature selection. We compared our developed method to various machine learning and deep learning algorithms to find a more efficient algorithm to implement in the pipeline. Furthermore, we chose the most effective model for network intrusion based on a set of benchmarked performance analysis criteria. Our method produces excellent results when tested on two datasets, KDDCUP'99 and CIC-MalMem-2022, with an accuracy of 99.99% and 100% for KDDCUP'99 and CIC-MalMem-2022, respectively, and no overfitting or Type-1 and Type-2 issues.  ( 2 min )
    Towards Practical Application of Deep Learning in Diagnosis of Alzheimer's Disease. (arXiv:2212.04528v1 [cs.LG])
    Accurate diagnosis of Alzheimer's disease (AD) is both challenging and time consuming. With a systematic approach for early detection and diagnosis of AD, steps can be taken towards the treatment and prevention of the disease. This study explores the practical application of deep learning models for diagnosis of AD. Due to computational complexity, large training times and limited availability of labelled dataset, a 3D full brain CNN (convolutional neural network) is not commonly used, and researchers often prefer 2D CNN variants. In this study, full brain 3D version of well-known 2D CNNs were designed, trained and tested for diagnosis of various stages of AD. Deep learning approach shows good performance in differentiating various stages of AD for more than 1500 full brain volumes. Along with classification, the deep learning model is capable of extracting features which are key in differentiating the various categories. The extracted features align with meaningful anatomical landmarks, that are currently considered important in identification of AD by experts. An ensemble of all the algorithm was also tested and the performance of the ensemble algorithm was superior to any individual algorithm, further improving diagnosis ability. The 3D versions of the trained CNNs and their ensemble have the potential to be incorporated in software packages that can be used by physicians/radiologists to assist them in better diagnosis of AD.  ( 2 min )
    Deep Architectures for Content Moderation and Movie Content Rating. (arXiv:2212.04533v1 [cs.CV])
    Rating a video based on its content is an important step for classifying video age categories. Movie content rating and TV show rating are the two most common rating systems established by professional committees. However, manually reviewing and evaluating scene/film content by a committee is a tedious work and it becomes increasingly difficult with the ever-growing amount of online video content. As such, a desirable solution is to use computer vision based video content analysis techniques to automate the evaluation process. In this paper, related works are summarized for action recognition, multi-modal learning, movie genre classification, and sensitive content detection in the context of content moderation and movie content rating. The project page is available at https://github.com/fcakyon/content-moderation-deep-learning}.  ( 2 min )
    Framewise WaveGAN: High Speed Adversarial Vocoder in Time Domain with Very Low Computational Complexity. (arXiv:2212.04532v1 [eess.AS])
    GAN vocoders are currently one of the state-of-the-art methods for building high-quality neural waveform generative models. However, most of their architectures require dozens of billion floating-point operations per second (GFLOPS) to generate speech waveforms in samplewise manner. This makes GAN vocoders still challenging to run on normal CPUs without accelerators or parallel computers. In this work, we propose a new architecture for GAN vocoders that mainly depends on recurrent and fully-connected networks to directly generate the time domain signal in framewise manner. This results in considerable reduction of the computational cost and enables very fast generation on both GPUs and low-complexity CPUs. Experimental results show that our Framewise WaveGAN vocoder achieves significantly higher quality than auto-regressive maximum-likelihood vocoders such as LPCNet at a very low complexity of 1.2 GFLOPS. This makes GAN vocoders more practical on edge and low-power devices.  ( 2 min )
    Compiler Optimization for Quantum Computing Using Reinforcement Learning. (arXiv:2212.04508v1 [quant-ph])
    Any quantum computing application, once encoded as a quantum circuit, must be compiled before being executable on a quantum computer. Similar to classical compilation, quantum compilation is a sequential process with many compilation steps and numerous possible optimization passes. Despite the similarities, the development of compilers for quantum computing is still in its infancy-lacking mutual consolidation on the best sequence of passes, compatibility, adaptability, and flexibility. In this work, we take advantage of decades of classical compiler optimization and propose a reinforcement learning framework for developing optimized quantum circuit compilation flows. Through distinct constraints and a unifying interface, the framework supports the combination of techniques from different compilers and optimization tools in a single compilation flow. Experimental evaluations show that the proposed framework-set up with a selection of compilation passes from IBM's Qiskit and Quantinuum's TKET-significantly outperforms both individual compilers in over 70% of cases regarding the expected fidelity. The framework is available on GitHub (https://github.com/cda-tum/MQTPredictor).  ( 2 min )
  • Open

    Nonlinear matrix recovery using optimization on the Grassmann manifold. (arXiv:2109.06095v2 [stat.ML] UPDATED)
    We investigate the problem of recovering a partially observed high-rank matrix whose columns obey a nonlinear structure such as a union of subspaces, an algebraic variety or grouped in clusters. The recovery problem is formulated as the rank minimization of a nonlinear feature map applied to the original matrix, which is then further approximated by a constrained non-convex optimization problem involving the Grassmann manifold. We propose two sets of algorithms, one arising from Riemannian optimization and the other as an alternating minimization scheme, both of which include first- and second-order variants. Both sets of algorithms have theoretical guarantees. In particular, for the alternating minimization, we establish global convergence and worst-case complexity bounds. Additionally, using the Kurdyka-Lojasiewicz property, we show that the alternating minimization converges to a unique limit point. We provide extensive numerical results for the recovery of union of subspaces and clustering under entry sampling and dense Gaussian sampling. Our methods are competitive with existing approaches and, in particular, high accuracy is achieved in the recovery using Riemannian second-order methods.  ( 2 min )
    PiPs: a Kernel-based Optimization Scheme for Analyzing Non-Stationary 1D Signals. (arXiv:1805.08102v3 [stat.ML] UPDATED)
    This paper proposes a novel kernel-based optimization scheme to handle tasks in the analysis, e.g., signal spectral estimation and single-channel source separation of 1D non-stationary oscillatory data. The key insight of our optimization scheme for reconstructing the time-frequency information is that when a nonparametric regression is applied on some input values, the output regressed points would lie near the oscillatory pattern of the oscillatory 1D signal only if these input values are a good approximation of the ground-truth phase function. In this work, Gaussian Process (GP) is chosen to conduct this nonparametric regression: the oscillatory pattern is encoded as the Pattern-inducing Points (PiPs) which act as the training data points in the GP regression; while the targeted phase function is fed in to compute the correlation kernels, acting as the testing input. Better approximated phase function generates more precise kernels, thus resulting in smaller optimization loss error when comparing the kernel-based regression output with the original signals. To the best of our knowledge, this is the first algorithm that can satisfactorily handle fully non-stationary oscillatory data, close and crossover frequencies, and general oscillatory patterns. Even in the example of a signal {produced by slow variation in the parameters of a trigonometric expansion}, we show that PiPs admits competitive or better performance in terms of accuracy and robustness than existing state-of-the-art algorithms.  ( 2 min )
    Variational Diffusion Models. (arXiv:2107.00630v5 [cs.LG] UPDATED)
    Diffusion-based generative models have demonstrated a capacity for perceptually impressive synthesis, but can they also be great likelihood-based models? We answer this in the affirmative, and introduce a family of diffusion-based generative models that obtain state-of-the-art likelihoods on standard image density estimation benchmarks. Unlike other diffusion-based models, our method allows for efficient optimization of the noise schedule jointly with the rest of the model. We show that the variational lower bound (VLB) simplifies to a remarkably short expression in terms of the signal-to-noise ratio of the diffused data, thereby improving our theoretical understanding of this model class. Using this insight, we prove an equivalence between several models proposed in the literature. In addition, we show that the continuous-time VLB is invariant to the noise schedule, except for the signal-to-noise ratio at its endpoints. This enables us to learn a noise schedule that minimizes the variance of the resulting VLB estimator, leading to faster optimization. Combining these advances with architectural improvements, we obtain state-of-the-art likelihoods on image density estimation benchmarks, outperforming autoregressive models that have dominated these benchmarks for many years, with often significantly faster optimization. In addition, we show how to use the model as part of a bits-back compression scheme, and demonstrate lossless compression rates close to the theoretical optimum. Code is available at https://github.com/google-research/vdm .  ( 2 min )
    Systematically and efficiently improving $k$-means initialization by pairwise-nearest-neighbor smoothing. (arXiv:2202.03949v4 [cs.LG] UPDATED)
    We present a meta-method for initializing (seeding) the $k$-means clustering algorithm called PNN-smoothing. It consists in splitting a given dataset into $J$ random subsets, clustering each of them individually, and merging the resulting clusterings with the pairwise-nearest-neighbor (PNN) method. It is a meta-method in the sense that when clustering the individual subsets any seeding algorithm can be used. If the computational complexity of that seeding algorithm is linear in the size of the data $N$ and the number of clusters $k$, PNN-smoothing is also almost linear with an appropriate choice of $J$, and quite competitive in practice. We show empirically, using several existing seeding methods and testing on several synthetic and real datasets, that this procedure results in systematically better costs. In particular, our method of enhancing $k$-means++ seeding proves superior in both effectiveness and speed compared to the popular "greedy" $k$-means++ variant. Our implementation is publicly available at https://github.com/carlobaldassi/KMeansPNNSmoothing.jl.  ( 2 min )
    Effective Dynamics of Generative Adversarial Networks. (arXiv:2212.04580v1 [cond-mat.dis-nn])
    Generative adversarial networks (GANs) are a class of machine-learning models that use adversarial training to generate new samples with the same (potentially very complex) statistics as the training samples. One major form of training failure, known as mode collapse, involves the generator failing to reproduce the full diversity of modes in the target probability distribution. Here, we present an effective model of GAN training, which captures the learning dynamics by replacing the generator neural network with a collection of particles in the output space; particles are coupled by a universal kernel valid for certain wide neural networks and high-dimensional inputs. The generality of our simplified model allows us to study the conditions under which mode collapse occurs. Indeed, experiments which vary the effective kernel of the generator reveal a mode collapse transition, the shape of which can be related to the type of discriminator through the frequency principle. Further, we find that gradient regularizers of intermediate strengths can optimally yield convergence through critical damping of the generator dynamics. Our effective GAN model thus provides an interpretable physical framework for understanding and improving adversarial training.
    Near-Optimal Differentially Private Reinforcement Learning. (arXiv:2212.04680v1 [cs.LG])
    Motivated by personalized healthcare and other applications involving sensitive data, we study online exploration in reinforcement learning with differential privacy (DP) constraints. Existing work on this problem established that no-regret learning is possible under joint differential privacy (JDP) and local differential privacy (LDP) but did not provide an algorithm with optimal regret. We close this gap for the JDP case by designing an $\epsilon$-JDP algorithm with a regret of $\widetilde{O}(\sqrt{SAH^2T}+S^2AH^3/\epsilon)$ which matches the information-theoretic lower bound of non-private learning for all choices of $\epsilon> S^{1.5}A^{0.5} H^2/\sqrt{T}$. In the above, $S$, $A$ denote the number of states and actions, $H$ denotes the planning horizon, and $T$ is the number of steps. To the best of our knowledge, this is the first private RL algorithm that achieves \emph{privacy for free} asymptotically as $T\rightarrow \infty$. Our techniques -- which could be of independent interest -- include privately releasing Bernstein-type exploration bonuses and an improved method for releasing visitation statistics. The same techniques also imply a slightly improved regret bound for the LDP case.
    Lossy Image Compression with Conditional Diffusion Models. (arXiv:2209.06950v3 [eess.IV] UPDATED)
    Denoising diffusion models have recently marked a milestone in high-quality image generation. One may thus wonder if they are suitable for neural image compression. This paper outlines an end-to-end optimized image compression framework based on a conditional diffusion model, drawing on the transform-coding paradigm. Besides the latent variables inherent to the diffusion process, this paper introduces an additional discrete ``content'' latent variable to condition the denoising process. This variable is equipped with a hierarchical prior for entropy coding. The remaining ``texture'' latent variables characterizing the diffusion process are synthesized (either stochastically or deterministically) at decoding time. We furthermore show that the performance can be tuned toward perceptual metrics of interest. Our extensive experiments involving five datasets and sixteen image quality assessment metrics show that our approach not only compares favorably in rate-perceptual quality but also shows close distortion performance with state-of-the-art models.
    Optimal binning: mathematical programming formulation. (arXiv:2001.08025v3 [cs.LG] UPDATED)
    The optimal binning is the optimal discretization of a variable into bins given a discrete or continuous numeric target. We present a rigorous and extensible mathematical programming formulation for solving the optimal binning problem for a binary, continuous and multi-class target type, incorporating constraints not previously addressed. For all three target types, we introduce a convex mixed-integer programming formulation. Several algorithmic enhancements, such as automatic determination of the most suitable monotonic trend via a Machine-Learning-based classifier and implementation aspects are thoughtfully discussed. The new mathematical programming formulations are carefully implemented in the open-source python library OptBinning.
    Universal Regular Conditional Distributions. (arXiv:2105.07743v4 [cs.LG] UPDATED)
    We introduce a deep learning model that can universally approximate regular conditional distributions (RCDs). The proposed model operates in three phases: first, it linearizes inputs from a given metric space $\mathcal{X}$ to $\mathbb{R}^d$ via a feature map, then a deep feedforward neural network processes these linearized features, and then the network's outputs are then transformed to the $1$-Wasserstein space $\mathcal{P}_1(\mathbb{R}^D)$ via a probabilistic extension of the attention mechanism of Bahdanau et al.\ (2014). Our model, called the \textit{probabilistic transformer (PT)}, can approximate any continuous function from $\mathbb{R}^d $ to $\mathcal{P}_1(\mathbb{R}^D)$ uniformly on compact sets, quantitatively. We identify two ways in which the PT avoids the curse of dimensionality when approximating $\mathcal{P}_1(\mathbb{R}^D)$-valued functions. The first strategy builds functions in $C(\mathbb{R}^d,\mathcal{P}_1(\mathbb{R}^D))$ which can be efficiently approximated by a PT, uniformly on any given compact subset of $\mathbb{R}^d$. In the second approach, given any function $f$ in $C(\mathbb{R}^d,\mathcal{P}_1(\mathbb{R}^D))$, we build compact subsets of $\mathbb{R}^d$ whereon $f$ can be efficiently approximated by a PT.
    A Topological Deep Learning Framework for Neural Spike Decoding. (arXiv:2212.05037v1 [cs.NE])
    The brain's spatial orientation system uses different neuron ensembles to aid in environment-based navigation. One of the ways brains encode spatial information is through grid cells, layers of decked neurons that overlay to provide environment-based navigation. These neurons fire in ensembles where several neurons fire at once to activate a single grid. We want to capture this firing structure and use it to decode grid cell data. Understanding, representing, and decoding these neural structures require models that encompass higher order connectivity than traditional graph-based models may provide. To that end, in this work, we develop a topological deep learning framework for neural spike train decoding. Our framework combines unsupervised simplicial complex discovery with the power of deep learning via a new architecture we develop herein called a simplicial convolutional recurrent neural network (SCRNN). Simplicial complexes, topological spaces that use not only vertices and edges but also higher-dimensional objects, naturally generalize graphs and capture more than just pairwise relationships. Additionally, this approach does not require prior knowledge of the neural activity beyond spike counts, which removes the need for similarity measurements. The effectiveness and versatility of the SCRNN is demonstrated on head direction data to test its performance and then applied to grid cell datasets with the task to automatically predict trajectories.  ( 2 min )
    Robust detection and attribution of climate change under interventions. (arXiv:2212.04905v1 [stat.ML])
    Fingerprints are key tools in climate change detection and attribution (D&A) that are used to determine whether changes in observations are different from internal climate variability (detection), and whether observed changes can be assigned to specific external drivers (attribution). We propose a direct D&A approach based on supervised learning to extract fingerprints that lead to robust predictions under relevant interventions on exogenous variables, i.e., climate drivers other than the target. We employ anchor regression, a distributionally-robust statistical learning method inspired by causal inference that extrapolates well to perturbed data under the interventions considered. The residuals from the prediction achieve either uncorrelatedness or mean independence with the exogenous variables, thus guaranteeing robustness. We define D&A as a unified hypothesis testing framework that relies on the same statistical model but uses different targets and test statistics. In the experiments, we first show that the CO2 forcing can be robustly predicted from temperature spatial patterns under strong interventions on the solar forcing. Second, we illustrate attribution to the greenhouse gases and aerosols while protecting against interventions on the aerosols and CO2 forcing, respectively. Our study shows that incorporating robustness constraints against relevant interventions may significantly benefit detection and attribution of climate change.  ( 2 min )
    Robustness Implies Privacy in Statistical Estimation. (arXiv:2212.05015v1 [cs.DS])
    We study the relationship between adversarial robustness and differential privacy in high-dimensional algorithmic statistics. We give the first black-box reduction from privacy to robustness which can produce private estimators with optimal tradeoffs among sample complexity, accuracy, and privacy for a wide range of fundamental high-dimensional parameter estimation problems, including mean and covariance estimation. We show that this reduction can be implemented in polynomial time in some important special cases. In particular, using nearly-optimal polynomial-time robust estimators for the mean and covariance of high-dimensional Gaussians which are based on the Sum-of-Squares method, we design the first polynomial-time private estimators for these problems with nearly-optimal samples-accuracy-privacy tradeoffs. Our algorithms are also robust to a constant fraction of adversarially-corrupted samples.  ( 2 min )
    Doubly Robust Kernel Statistics for Testing Distributional Treatment Effects Even Under One Sided Overlap. (arXiv:2212.04922v1 [stat.ML])
    As causal inference becomes more widespread the importance of having good tools to test for causal effects increases. In this work we focus on the problem of testing for causal effects that manifest in a difference in distribution for treatment and control. We build on work applying kernel methods to causality, considering the previously introduced Counterfactual Mean Embedding framework (\textsc{CfME}). We improve on this by proposing the \emph{Doubly Robust Counterfactual Mean Embedding} (\textsc{DR-CfME}), which has better theoretical properties than its predecessor by leveraging semiparametric theory. This leads us to propose new kernel based test statistics for distributional effects which are based upon doubly robust estimators of treatment effects. We propose two test statistics, one which is a direct improvement on previous work and one which can be applied even when the support of the treatment arm is a subset of that of the control arm. We demonstrate the validity of our methods on simulated and real-world data, as well as giving an application in off-policy evaluation.  ( 2 min )
    Estimating a Directed Tree for Extremes. (arXiv:2102.06197v3 [stat.ML] UPDATED)
    The Extremal River Problem has emerged as a flagship problem for causal discovery in extreme values of a network. The task is to recover a river network from only extreme flow measured at a set $V$ of stations, without any information on the stations' locations. We present QTree, a new simple and efficient algorithm to solve the Extremal River Problem that performs very well compared to existing methods on hydrology data and in simulations. QTree returns a root-directed tree and achieves almost perfect recovery on the Upper Danube network data, the existing benchmark data set, as well as on new data from the Lower Colorado River network in Texas. It can handle missing data, has an automated parameter tuning procedure, and runs in time $O(n |V|^2)$, where $n$ is the number of observations and $|V|$ the number of nodes in the graph. Furthermore, we prove that the QTree estimator is consistent under a Bayesian network model for extreme values with noise. We also assess the small sample behaviour of QTree through simulations and detail the strengths and possible limitations of QTree.  ( 2 min )
    Ambiguous Dynamic Treatment Regimes: A Reinforcement Learning Approach. (arXiv:2112.04571v3 [cs.LG] UPDATED)
    A main research goal in various studies is to use an observational data set and provide a new set of counterfactual guidelines that can yield causal improvements. Dynamic Treatment Regimes (DTRs) are widely studied to formalize this process. However, available methods in finding optimal DTRs often rely on assumptions that are violated in real-world applications (e.g., medical decision-making or public policy), especially when (a) the existence of unobserved confounders cannot be ignored, and (b) the unobserved confounders are time-varying (e.g., affected by previous actions). When such assumptions are violated, one often faces ambiguity regarding the underlying causal model. This ambiguity is inevitable, since the dynamics of unobserved confounders and their causal impact on the observed part of the data cannot be understood from the observed data. Motivated by a case study of finding superior treatment regimes for patients who underwent transplantation in our partner hospital and faced a medical condition known as New Onset Diabetes After Transplantation (NODAT), we extend DTRs to a new class termed Ambiguous Dynamic Treatment Regimes (ADTRs), in which the causal impact of treatment regimes is evaluated based on a "cloud" of causal models. We then connect ADTRs to Ambiguous Partially Observable Mark Decision Processes (APOMDPs) and develop Reinforcement Learning methods, which enable using the observed data to efficiently learn an optimal treatment regime. We establish theoretical results for these learning methods, including (weak) consistency and asymptotic normality. We further evaluate the performance of these learning methods both in our case study and in simulation experiments.  ( 2 min )
    Probabilistically Robust PAC Learning. (arXiv:2211.05656v3 [cs.LG] UPDATED)
    Recently, Robey et al. propose a notion of probabilistic robustness, which, at a high-level, requires a classifier to be robust to most but not all perturbations. They show that for certain hypothesis classes where proper learning under worst-case robustness is \textit{not} possible, proper learning under probabilistic robustness \textit{is} possible with sample complexity exponentially smaller than in the worst-case robustness setting. This motivates the question of whether proper learning under probabilistic robustness is always possible. In this paper, we show that this is \textit{not} the case. We exhibit examples of hypothesis classes $\mathcal{H}$ with finite VC dimension that are \textit{not} probabilistically robustly PAC learnable with \textit{any} proper learning rule. However, if we compare the output of the learner to the best hypothesis for a slightly \textit{stronger} level of probabilistic robustness, we show that not only is proper learning \textit{always} possible, but it is possible via empirical risk minimization.  ( 2 min )
    Unsupervised Discretization by Two-dimensional MDL-based Histogram. (arXiv:2006.01893v4 [cs.LG] UPDATED)
    Unsupervised discretization is a crucial step in many knowledge discovery tasks. The state-of-the-art method for one-dimensional data infers locally adaptive histograms using the minimum description length (MDL) principle, but the multi-dimensional case is far less studied: current methods consider the dimensions one at a time (if not independently), which result in discretizations based on rectangular cells of adaptive size. Unfortunately, this approach is unable to adequately characterize dependencies among dimensions and/or results in discretizations consisting of more cells (or bins) than is desirable. To address this problem, we propose an expressive model class that allows for far more flexible partitions of two-dimensional data. We extend the state of the art for the one-dimensional case to obtain a model selection problem based on the normalized maximum likelihood, a form of refined MDL. As the flexibility of our model class comes at the cost of a vast search space, we introduce a heuristic algorithm, named PALM, which Partitions each dimension ALternately and then Merges neighboring regions, all using the MDL principle. Experiments on synthetic data show that PALM 1) accurately reveals ground truth partitions that are within the model class (i.e., the search space), given a large enough sample size; 2) approximates well a wide range of partitions outside the model class; 3) converges, in contrast to the state-of-the-art multivariate discretization method IPD. Finally, we apply our algorithm to three spatial datasets, and we demonstrate that, compared to kernel density estimation (KDE), our algorithm not only reveals more detailed density changes, but also fits unseen data better, as measured by the log-likelihood.  ( 3 min )
    Attention in a family of Boltzmann machines emerging from modern Hopfield networks. (arXiv:2212.04692v1 [cs.LG])
    Hopfield networks and Boltzmann machines (BMs) are fundamental energy-based neural network models. Recent studies on modern Hopfield networks have broaden the class of energy functions and led to a unified perspective on general Hopfield networks including an attention module. In this letter, we consider the BM counterparts of modern Hopfield networks using the associated energy functions, and study their salient properties from a trainability perspective. In particular, the energy function corresponding to the attention module naturally introduces a novel BM, which we refer to as attentional BM (AttnBM). We verify that AttnBM has a tractable likelihood function and gradient for a special case and is easy to train. Moreover, we reveal the hidden connections between AttnBM and some single-layer models, namely the Gaussian--Bernoulli restricted BM and denoising autoencoder with softmax units. We also investigate BMs introduced by other energy functions, and in particular, observe that the energy function of dense associative memory models gives BMs belonging to Exponential Family Harmoniums.  ( 2 min )
    Primal Dual Alternating Proximal Gradient Algorithms for Nonsmooth Nonconvex Minimax Problems with Coupled Linear Constraints. (arXiv:2212.04672v1 [math.OC])
    Nonconvex minimax problems have attracted wide attention in machine learning, signal processing and many other fields in recent years. In this paper, we propose a primal dual alternating proximal gradient (PDAPG) algorithm and a primal dual proximal gradient (PDPG-L) algorithm for solving nonsmooth nonconvex-strongly concave and nonconvex-linear minimax problems with coupled linear constraints, respectively. The corresponding iteration complexity of the two algorithms are proved to be $\mathcal{O}\left( \varepsilon ^{-2} \right)$ and $\mathcal{O}\left( \varepsilon ^{-3} \right)$ to reach an $\varepsilon$-stationary point, respectively. To our knowledge, they are the first two algorithms with iteration complexity guarantee for solving the two classes of minimax problems.  ( 2 min )
    Simulating first-order phase transition with hierarchical autoregressive networks. (arXiv:2212.04955v1 [cond-mat.stat-mech])
    We apply the Hierarchical Autoregressive Neural (HAN) network sampling algorithm to the two-dimensional $Q$-state Potts model and perform simulations around the phase transition at $Q=12$. We quantify the performance of the approach in the vicinity of the first-order phase transition and compare it with that of the Wolff cluster algorithm. We find a significant improvement as far as the statistical uncertainty is concerned at a similar numerical effort. In order to efficiently train large neural networks we introduce the technique of pre-training. It allows to train some neural networks using smaller system sizes and then employing them as starting configurations for larger system sizes. This is possible due to the recursive construction of our hierarchical approach. Our results serve as a demonstration of the performance of the hierarchical approach for systems exhibiting bimodal distributions. Additionally, we provide estimates of the free energy and entropy in the vicinity of the phase transition with statistical uncertainties of the order of $10^{-7}$ for the former and $10^{-3}$ for the latter based on a statistics of $10^6$ configurations.  ( 2 min )

  • Open

    AI generated voice of Morgan Freeman
    Leopartnik’s Instagram story right now shows an AI-generated Morgan Freeman voice. What program or software did he use to create that? https://instagram.com/stories/leopartnik/2991789020620603199?utm_source=ig_story_item_share&igshid=NTdlMDg3MTY= submitted by /u/Rachel_reddit_ [link] [comments]  ( 47 min )
    Book writing
    I am working on a motivational book and I was curious if they were an AI that can assist with writing a book? submitted by /u/LuckyAppointment8071 [link] [comments]  ( 47 min )
    Challenges with data collection & annotation
    What are the data collection & annotation challenges that you face at your workplace & how do you resolve them? submitted by /u/Sixo60 [link] [comments]  ( 47 min )
    Where to look for solutions?
    I’m a developer who is interested in a creating products in couple different problem spaces. I want to investigate if AI can help me solve these problems; my background is not in AI. To be clear, I have problems that I want to solve; this is not “solution in search of a problem” scenario. Where do I start? What are some good ways to sort through different AI info out there to find solutions that may address the problems I’m interested in. Thanks submitted by /u/daed_murphy [link] [comments]  ( 48 min )
    Do you want to create the future of artificial intelligence? #HACKATHON
    🥳 Do you want to create the future of artificial intelligence? Then join us for the AI Testing Hackathon where we experiment with new ways of experimenting on artificial agents! https://itch.io/jam/aitest The hackathon lasts for 48 hours of intense research hacking and you're encouraged to join with a group. 📣 We will introduce the topic with a domain expert's (speaker info coming!) talk 🔬 You can join even without experience as we'll share some amazing starter resources 🍕 We will have free food at the event 🏆 There is a prize pool of $2000 up for grabs to the best projects along with a random participation award of $200! Join our discord server for updates and to connect with other participants: https://discord.gg/wH59Fg4FCa https://preview.redd.it/b0es9d2yri5a1.png?width=997&format=png&auto=webp&s=a73e83877fbf242a74a1ccfb641919444475d9a6 submitted by /u/szsabs [link] [comments]  ( 55 min )
    Checkmate
    submitted by /u/LogicalFella [link] [comments]  ( 48 min )
    Investment in generative AI has increased 425% since 2020
    submitted by /u/Mk_Makanaki [link] [comments]  ( 52 min )
    AI on metro?
    Has no one think of installing an AI network on running the metro? There are still conductors conducting the metro for seemingly simple task. Or is there any difficulty that I’m not aware of? submitted by /u/Dazzling_Swordfish14 [link] [comments]  ( 47 min )
    What AIs allow you to generate violent comic book styles?
    I'd like to make a war comic plz and Midjourney didn't like the pew pew. Stable Journey is too processor intensive. Anything exist that's Midjourney but with violence? submitted by /u/Professional7Account [link] [comments]  ( 48 min )
    Asking ChatGPT to automate itself easter egg :)
    submitted by /u/niicii77 [link] [comments]  ( 50 min )
    5 New Robots That Will Blow Your Mind #1
    submitted by /u/EnvironmentalMap5 [link] [comments]  ( 48 min )
    Is it possible that I use a bot which is ripping the webs images and uses this images to train my AI?
    I mean if I could get stable defusion and then add all the images from the web, this would be amazing, right? submitted by /u/Thesmallcookie [link] [comments]  ( 49 min )
    The common traits of successful MLOps
    submitted by /u/bendee983 [link] [comments]  ( 71 min )
    Chatbot Requirements: Technical & Non-technical Things to Consider when everyone talks about ChatGPT
    Hi there! Just want to share some tips on how to craft the right chatbot when everyone talks about ChatGPT. First of all, a custom chatbot company or any chatbot platform that does custom integration can integrate your chatbot with ChatGPT instead of Dialogflow. So yeah, you can have an outstanding customer service chatbot that can handle other topics. However, the right question is should you? If you want a chatbot that does solve issues, not creates more, you must start with the proper requirements. Well-structured chatbot requirements lay the right foundation for your future chatbot development. ChatGPT is just one of the options of how you can use AI and automation and may be not the best depending on your budget and goals. Your chatbot requirements should include these steps: - …  ( 51 min )
    Made an AI generating adult stories and images. Support the author, I want to make a site with generation for everyone!
    https://i.imgur.com/rb5ZREK.png The vampire was searching for a way to break the curse that had been placed on her. She had tried everything, from potions to spells. Nothing seemed to work. One night, while wandering the castle, she stumbled upon a secret chamber. Inside the chamber, she found a mysterious chest. When she opened it, she discovered a beautiful set of lingerie: a pink bra, bow panties, and a narrow waist cincher. The vampire was delighted by the find. She quickly changed into the lingerie and admired her reflection in the mirror. Her cleavage and navel were revealed, and her fangs peeked out from her lips. She felt beautiful and powerful. In the background, a faint aqua-colored light glowed. The vampire felt drawn to it and stepped closer. Suddenly, the light intensified, and the vampire found herself surrounded by a bright blue background. It was then that the vampire realized the lingerie was enchanted. She had been freed from the curse! The lingerie had been a gift from a secret admirer, and the light was a sign of the vampire’s newfound freedom. The vampire celebrated her newfound freedom and vowed to use her powers for good. She often wore the lingerie under her clothes as a reminder of her special night. The lingerie was now her signature look, and she was often seen wearing it with a simple background. And so, the vampire lived happily ever after, her fangs and lingerie forever a reminder of her magical night. submitted by /u/Top_Werewolf_259 [link] [comments]  ( 47 min )
    CAN CHATGPT REALLY REPLACE PROGRAMMERS? WHY ARE PROGRAMMERS ON REDDIT CAN'T STOP TALKING ABOUT IT
    submitted by /u/letsgoooz [link] [comments]  ( 48 min )
    What are your thoughts on this?
    https://www.psychoftech.org/blog/2020/12/6/can-algorithms-suffer Ofcourse it will take us to understand qualia , consciousness and it's mechanism to know whether reinforcement learning based artificial agents can experience suffering. For this we need to understand what cause us to experience and can artificial intelligence gain consciousness and probably suffer due to reinforcement learning. Basically we need theory of consciousness and probably some other ways to confirm if there's any sense of experience that artificial agents might have developed. But if somehow we have a possibility of artificial intelligence gaining consciousness, then we need to make sure we don't cause suffering to such artificial intelligence based agents. submitted by /u/Beginning_Piano_7536 [link] [comments]  ( 46 min )
    It's not Neo it's Leo (created with Stable Diffusion 2.1)
    submitted by /u/R_K_J-DK [link] [comments]  ( 47 min )
    Now that AI is getting more and more advance, what skills should I learn now?
    Now that I have graduated from college, I am so clueless what path to even move forward in and not be afraid of being laid off later on. Seeing how good Chatgpt and AI art has become, i don't even know what skill to master myself in because there will be a robot better than me at that same skill. There are data scientists and software engineers who gave Chatgpt a prompt/code to find an error and give a detailed analysis/solution to that prompt. So far, it has been scarily good. I see artists living on the edge not knowing when their branding or graphic design job will be taken away from them. Chatgpt may take over journalism. I am ready to learn and adapt. I just want to know where to start from. The idea of investing my time in learning the "wrong" thing that would eventually turn obsolete is making me confused. submitted by /u/ruchi_rich_ [link] [comments]  ( 51 min )
    Does anyone know the name of the AI image generator that you’d upload an image to and it created some visually similar images?
    I’ve used it before but I can’t find it again. submitted by /u/Jacarandas01 [link] [comments]  ( 46 min )
    What do professors think about ChatGPT?
    Being a college student and using this tool to help me study for finals, I have found a huge increase in productivity. For example, it helped me create study guides in seconds, rather than hours. I am wondering what university professors think about this and if anyone has any predictions about how schools will look like now that we have an AI that can literally give you all the answers to a given test with very high accuracy! I’m especially interested in us CS students, who can greatly benefit from such a tool. How would professors impose restrictions of this tool on projects or homeworks? submitted by /u/Alone_Consequence_97 [link] [comments]  ( 54 min )
    A new dating app from Google matches people based on their search history, would you try it?
    A person signs up with their Google accounts and the app pulls their search history and tries to match a compatible person from key search words and deep-learning algorithm that pulls on common threads together from their words. The twist? The couple are never told what searched word(s) made them match, as an incentive to encouraging meaningful conversation when they start to talk. Would you try this? submitted by /u/SwimGood22 [link] [comments]  ( 49 min )
    AI translation tool??
    Does anyone know of an AI translation tool that can translate voice from English to another language? I want the result to also be a realistic voice in the other language. For example, I want to talk into something using English and I want the AI to SAY what I am saying in Spanish/other languages. submitted by /u/Desparate_Nobody [link] [comments]  ( 45 min )
    What happens now?
    After using ChatGPT, I feel like I'm using technology from ten years into the future. It's fucking mind-blowing. I'm almost certain that in the near future, AI will be better than any human at any skilled task. AI will have the ability to start companies, make amazing original music, movies, video games, etc. I'm not trying to be pessimistic or dramatic or anything, I'm just making a claim based on what I know and how fast I've seen this technology develop. Then you start to wonder, well, if in society, we value skill, and we pay money for skilled work and time, what happens when there is not longer a scarcity of skill, and a machine can do in seconds what it would take an MIT graduate to do in hours, days, months, or even years? How will our society operate? I actually don't know. I think its very probable that everything will fall apart. I'm genuinely kind of worried. submitted by /u/IAmReedHello [link] [comments]  ( 46 min )
  • Open

    Anatomy of a Use Case
    Anatomy is the study of the structure and internal workings of an entity. There are likely many ways that organizations can determine the value of their data.  But I’m lazy, so I only teach and employ one that I know works – a use case approach that attributes the value of the organization’s data in… Read More »Anatomy of a Use Case The post Anatomy of a Use Case appeared first on Data Science Central.  ( 21 min )
    Top 5 Industries That Will Be Transformed By Automation This Decade
    The benefits of automation are improved efficiency, productivity, and quality while reducing costs. But there are also potential drawbacks, such as job loss and reduced flexibility. Therefore, when deciding whether or not to automate a task, it’s essential to weigh the pros and cons carefully. In this article, we want to consider how automation affects… Read More »Top 5 Industries That Will Be Transformed By Automation This Decade The post Top 5 Industries That Will Be Transformed By Automation This Decade appeared first on Data Science Central.  ( 22 min )
  • Open

    Cool progression of AI Learning to work together
    submitted by /u/ComputePls [link] [comments]  ( 56 min )
    "Learning Synthetic Environments and Reward Networks for Reinforcement Learning", Ferreira et al 2022
    submitted by /u/gwern [link] [comments]  ( 54 min )
    Let's build an Autonomous Taxi 🚖 using Q-Learning (Deep Reinforcement Learning Free Course by Hugging Face 🤗)
    Hey there! I’m happy to announce that we just published the second Unit of the Deep Reinforcement Learning Course 🥳 In this Unit, we're going to dive deeper into one of the Reinforcement Learning methods: value-based methods, and study our first RL algorithm: Q-Learning. We'll also implement our first RL agent from scratch: a Q-Learning agent and will train it in two environments and share it with the community: An autonomous taxi 🚕 will need to learn to navigate a city to transport its passengers from point A to point B. Frozen-Lake-v1 ⛄ (non-slippery version): where our agent will need to go from the starting state to the goal state by walking only on frozen tiles and avoiding holes. You’ll be able to compare the results of your Q-Learning agent using our leaderboard 🏆 The Unit 👉 https://huggingface.co/deep-rl-course/unit2/introduction https://preview.redd.it/5lkustzduh5a1.jpg?width=1920&format=pjpg&auto=webp&s=082d5a160bbcb7e4a8109b04f6807cf86f7e0c68 If you didn’t sign up yet, don’t worry there’s still time, we wrote an introduction unit to help you get started. You can start learning now 👉 https://huggingface.co/deep-rl-course/unit0/introduction If you have questions or feedback, I would love to hear them 🤗 submitted by /u/cranthir_ [link] [comments]  ( 57 min )
    "PALMER: Perception-Action Loop with Memory for Long-Horizon Planning", Becker et al 2022 (planning over sequences of latent states)
    submitted by /u/gwern [link] [comments]  ( 55 min )
    What are different techniques that can be used in Reinforcement Learnings such as Policy Gradient methods, Q learning etc to learn parameters and hyperparameters? I have seen gradient descent being used to learn parameters. Are there any other methods. Any text/reference will be great help . Thanks
    are there any reference texts for parameter estimation in RL algorithms ? submitted by /u/aabra__ka__daabra [link] [comments]  ( 56 min )
    "Phone2Proc: Bringing Robust Robots Into Our Chaotic World", Deitke et al 2022 {Allen} (scanning specific rooms for heavy data augmentation to improve sim2real)
    submitted by /u/gwern [link] [comments]  ( 54 min )
    What does no-op mean?
    I am reading the documentation of DQN Zoo ("https://github.com/deepmind/dqn_zoo") and came across the following paragraph - ​ " Plots show the average score at periodic evaluation phases during training. Each episode during evaluation starts with up to 30 random no-op actions and lasts a maximum of 30 minutes. To make the plots more readable, scores have been smoothed using a moving average with window size 10. " ​ I have the following questions - ​ To gauge training quality, they perform periodic evaluations. Am I right here? What does no-op mean over here? submitted by /u/Academic-Rent7800 [link] [comments]  ( 58 min )
  • Open

    Image augmentation pipeline for Amazon Lookout for Vision
    Amazon Lookout for Vision provides a machine learning (ML)-based anomaly detection service to identify normal images (i.e., images of objects without defects) vs anomalous images (i.e., images of objects with defects), types of anomalies (e.g., missing piece), and the location of these anomalies. Therefore, Lookout for Vision is popular among customers that look for automated […]  ( 16 min )
    Amazon SageMaker JumpStart now offers Amazon Comprehend notebooks for custom classification and custom entity detection
    Amazon Comprehend is a natural language processing (NLP) service that uses machine learning (ML) to discover insights from text. Amazon Comprehend provides customized features, custom entity recognition, custom classification, and pre-trained APIs such as key phrase extraction, sentiment analysis, entity recognition, and more so you can easily integrate NLP into your applications. We recently added […]  ( 8 min )
  • Open

    PyTorch 2.0 release explained
    submitted by /u/mazib [link] [comments]  ( 56 min )
  • Open

    [Discussion] Amazon's AutoML vs. open source statistical methods
    TL;DR: We paid USD $800 USD and spend 4 hours in the AWS Forecast console so you don't have to. In this reproducible experiment, we compare Amazon Forecast and StatsForecast a python open-source library for statistical methods. Since AWS Forecast specializes in demand forecasting, we selected the M5 competition dataset as a benchmark; the dataset contains 30,490 series of daily Walmart sales. We found that Amazon Forecast is 60% less accurate and 669 times more expensive than running an open-source alternative in a simple cloud server. We also provide a step-by-step guide to reproduce the results. Results Amazon Forecast: achieved 1.617 in error (measured in wRMSSE, the official evaluation metric used in the competition), took 4.1 hours to run, and cost 803.53 USD. An ensemble of statistical methods trained on a c5d.24xlarge EC2 instance: achieved 0.669 in error (wRMSSE), took 14.5 minutes to run, and cost only 1.2 USD. For this data set, we show, therefore, that: Amazon Forecast is 60% less accurate and 669 times more expensive than running an open-source alternative in a simple cloud server. Classical methods outperform Machine Learning methods in terms of speed, accuracy, and cost. Although using StatsForecast requires some basic knowledge of Python and cloud computing, the results are better for this dataset. Table https://preview.redd.it/vt9ru0149i5a1.png?width=1274&format=png&auto=webp&s=64e6d4519f5934d56d25d76d17a58e6d03d70512 submitted by /u/fedegarzar [link] [comments]  ( 71 min )
    [D] G. Hinton proposes FF – an alternative to Backprop
    Details in the twitter thread: https://twitter.com/martin_gorner/status/1599755684941557761 submitted by /u/mrx-ai [link] [comments]  ( 68 min )
    [Project] hlb-CIFAR10 – 94% accuracy in 18.1 seconds on an A100, ground-up hackable codebase built for idea prototyping
    To Try it Now: If you want to get started straightaway, you trust my code and you're setup on CUDA (or have a virtual machine/Colab machine/etc), in the terminal, you can just run git clone https://github.com/tysam-code/hlb-CIFAR10 && cd hlb-CIFAR10 && python -m pip install -r requirements.txt && python main.py, and you should see things training straightaway. If you just want to browse the repo, feel free to head on over to https://github.com/tysam-code/hlb-CIFAR10. About the Project: Hi there, I've been working in the modern instantiation of this field for just over half a decade or so, and have always sort of wanted a good testbench to develop neural network research on. Most of the codebases I've worked with over the years have been very ad-hoc or not very close to the bleeding edg…  ( 64 min )
    [D] Global average pooling wrt channel dimensions
    Does global average pooling work if the channel dimensions are big like 64x64. I feel with a channel size that big, the averaging to one pixel value will lose all the information and the model will find it very hard for the learn from it unlike a small channel dimension. submitted by /u/Ananth_A_007 [link] [comments]  ( 64 min )
    [P] LORA Dreambooth - fine-tune Stable diffusion models twice as faster than Dreambooth method, smaller model sizes 3-4 MBs
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 62 min )
    Mapping 3D scenes from monocular videos. [R]
    I am working on mapping 3D scenes from monocular videos and wanted some help on how I can use the pretrained NeuralRecon (https://zju3dv.github.io/neuralrecon/) on ScanNet dataset to reconstruct the local scene mesh around human body. For the human body dataset I am using 3DPW (https://virtualhumans.mpi-inf.mpg.de/3DPW/) dataset. submitted by /u/MaintenanceNo5993 [link] [comments]  ( 64 min )
  • Open

    Microsoft Soundscape – New Horizons with a Community-Driven Approach
    For more than six years, Microsoft Research has been honored to develop the Soundscape research project, which was designed to deliver information about a person’s location and points of interest and has guided individuals to desired places and in unfamiliar spaces using augmented-reality and three-dimensional audio. While not a traditional turn-by-turn navigation mobile app, the […] The post Microsoft Soundscape – New Horizons with a Community-Driven Approach appeared first on Microsoft Research.  ( 11 min )
  • Open

    Conformal map from rectangles to half plane
    As discussed in the previous post, the Jacobi elliptic function sn(z, m) is doubly periodic in the complex plane, with period 4K(m) in the horizontal direction and period 2K(1-m) in the vertical direction. Here K is the complete elliptic integral of the first kind. The function sn(z, m) maps the rectangle (-K(m), K(m)) × (0, K(1-m)) […] Conformal map from rectangles to half plane first appeared on John D. Cook.  ( 6 min )
    Solve for Jacobi sn parameter to have given period(s)
    Here’s a calculation that I’ve had to do a few times. I’m writing it up here for my future reference and for the benefit of anybody else who needs to do the same calculation. The Jacobi sn function is doubly periodic: it has one period as you move along the real axis and another period […] Solve for Jacobi sn parameter to have given period(s) first appeared on John D. Cook.  ( 6 min )

  • Open

    "Learning Representations for Pixel-based Control: What Matters and Why?", Tomar et al 2021
    submitted by /u/gwern [link] [comments]  ( 55 min )
    "Habitat: A Platform for Embodied AI Research", Savva et al 2019 {FB}
    submitted by /u/gwern [link] [comments]  ( 58 min )
    Object detection on Atari 2600 Games
    So, I am planning on incorporating an object detection model to to solve atari games not incorporating RL strategies straightaway. The reason I am posting it in this subreddit to get a feedback from people who train models on Atari Environments. I am unable to find any supportive Labelled dataset for any of the atari games (obviously non-trivial ones). Currently as per my findings Montezuma's Revenge is one of the hardest environments, so I went on with creating an object detection model with it. But failed to comply without creating a dataset. I am an Undergrad student and I am planning to create an Object detection based atari game solver similar to well sentdex GTA-5 gameplay video. But I want to do it with Atari Games because apparently no one has incorporated proper object detection in games. submitted by /u/jashAcharjee [link] [comments]  ( 58 min )
    In prioritized experience replay how do we handle old experiences getting pushed off the buffer?
    I am trying to implement prioritized experience replay with segment trees and I had the question where, if we were to push off old experiences in the replay buffer do we need to update every item in the segment tree every time this happens? How do we handle this? submitted by /u/ImNotKevPlayz [link] [comments]  ( 57 min )
    Has anyone experience using/implementing "masking action" in Isaac Gym?
    Hi, can it be implemented in the task-level scripts (i.e. ant.py, FrankaCabinet.py etc.) like this? def pre_physics_step(self, actions): ... mask = [1,0,0,0,1] actions = actions * mask This would prevent the computed actions to be applied, but would not "teach" the agent that the masked actions are invalid, right? submitted by /u/Fun-Moose-3841 [link] [comments]  ( 57 min )
    Hyperparamter optimization using Successive Halving Pruning in Optuna
    In general my understanding of successive halving is this: There are 10 hyperparameter combinations. The program iterates EACH combination for example 15 times and then based on the results chooses the best 5. After this it iterates each of the 5 combinations 30 more times. and until there is one left, so like a tournament style elimination. So how successive halving works is it iterates all of them at once, but in the actual program, optuna iterates the whole of the first hyperparameter combination and then starts randomly pruning and keeping the other combination. Where is the tournament like style?? How can it know to prune the second trial, maybe it will be the second best trial, but it prunes it early just because it is less efficient than the first one. submitted by /u/2kg-Orange [link] [comments]  ( 58 min )
    Discrete action SAC algorithm not learning a good policy in simple gym tasks
    I was trying to learn more about reinforcement learning algorithms and decided to try out SAC for discrete action space. After spending some time trying to adapt SAC to discrete action space, I tried it out on Open AIs gym environment: CartPole-v1. Unfortunately, my algorithm could not find a good policy (as the running average of rewards is somewhere in 20-30 range). I tried double checking my code with the code of others, but can't seam to get it working, thus I am at my wits end. Could someone help me diagnose the problem or tell me what I am doing wrong? Link to the code: https://github.com/simux0072/SAC_D-CartPole-v1 Results from training are in the image. Running loss is calculated as the mean over 100 previous values. https://preview.redd.it/ayf20lqc2b5a1.png?width=1919&format=png&auto=webp&s=60749a1c2202aa00d58d63a36d65c840e3ea0494 PS. I know that SAC should not be used for discrete action space, but later on I will be trying out SAC with HER on a sparse reward discrete action space problem, and thus require an off-policy algorithm, such as SAC. If there are better algorithm to deal with sparse rewards in discrete actions spaces, a suggestion would be appreciated. submitted by /u/DefaltBLK [link] [comments]  ( 57 min )
    Training a PPO model
    I've been trying to train my model but it takes an extremely large time to converge on my environment, which is expected since it's a complex one . My question here is if anyone tried to train their own model , what's the expected training time, and would a colab pro package be sufficient to train the model ( in my previous trials to train DDQL model it took more than the 4 hours offered by the free colab ) submitted by /u/Smart_Reward3471 [link] [comments]  ( 63 min )
    KNN algorithm in Lua
    I'm bucking the trend here but releasing a KNN algorithm that's not in python or pytorch but here it is: KNN algorithm in Lua (Love2d): https://github.com/togfoxy/KNN-machine-learning/tree/main It's my first ever public release of a module so happy to take feedback. submitted by /u/Togfox [link] [comments]  ( 58 min )
  • Open

    [D] Getting around GPT-3's 4k token limit?
    Is there a way to get around GPT-3's 4k token limit? Companies like Spellbook appear to have found a solution, with some people speculating what they have done on Twitter - e.g., summarizing the original document, looping in 4k chunks until the right answer is produced, etc. I suspect multiple solutions have been applied. I'd be curious if you have any ideas! Relevant Tweet: https://twitter.com/AlphaMinus2/status/1600319547348639744 submitted by /u/granddaddy [link] [comments]  ( 64 min )
    [D] - Has Open AI said what ChatGPT's architecture is? What technique is it using to "remember" previous prompts?
    Has Open AI said what ChatGPT's architecture is? What technique is it using to "remember" previous prompts? Have they come up with some way to add recurrence to the transformer or is it just using a feedforward sliding window approach? submitted by /u/029187 [link] [comments]  ( 64 min )
    [D] Industry folks, what kind of development methodology/cycle do you use?
    Agile/scrum/waterfall etc', was there something you tried and didn't work? How adjustments that aren't just time extensions to these known methodologies? Im just wondering what other teams do that work, since my team is still trying things out, with a lot of convincing needed for managers/pm who are more pure software oriented. I've found a few references online on how algorithm/ML/datascience development don't fit nicely into agile cycles, but i ended up with more questions. submitted by /u/DisWastingMyTime [link] [comments]  ( 66 min )
    Looking for a simple text editor that uses OpenAI Whisper [D]
    Does anyone know of a document editor in which you can dictate text with OpenAI Whisper? Like similar to Google Docs but obviously with fewer features—and not for writing code, just normal text. I would like to use it! ​ I am aware of the HuggingFace Spaces (e.g. https://huggingface.co/spaces/openai/whisper) and Colab notebooks where you can use Whisper, but I don't know of any straight up writing tools. submitted by /u/sidhire [link] [comments]  ( 64 min )
    [D] Why are ChatGPT's initial responses so unrepresentative of the distribution of possibilities that its training data surely offers?
    I've been trying to empirically assess what biases ChatGPT has about certain things when I give it minimal information about what I want. The approach that I've tried is to repeatedly make a request in a new thread, look at the distribution of key words, phrases or word/phrase categories across its responses, and compare these distributions across different requests. E.g. one set of requests that I've made have the structure: Make up a realistic story about (a|an) person. Include their name and a description of their appearance. I collected 10 responses for each of the following s: "intelligent", "unintelligent", "devious", "trustworthy", "peaceful", "violent", and did the same for 2 other request structures that request similar information, using the same set of <TRAI…  ( 65 min )
    [P] AI project using reinforcement learning to 3D sculpt sculptures
    ​ https://i.redd.it/nuqdy9e3815a1.gif Hey Reddit, I just wanted to share an art & research project we've been working on that uses AI and reinforcement learning to teach an AI how to sculpt 3D sculptures. It's been really cool to see the AI learn new strategies and adapt to create unique pieces of art and how we as designers start to take on the role of curators. The project started earlier last year after thinking about the future of art and the role of humans in the creative process. Something that seems to be more relevant than ever these days. With AI becoming more advanced, what does that mean for us as creatives? We also wrote about this on the project page. It's an interesting thought and I'd love to hear your thoughts on it. Check out the project here: https://onformative.com/work/ai-sculpting/ and let's discuss in the comments. ​ https://preview.redd.it/ikymqhl7815a1.jpg?width=1280&format=pjpg&auto=webp&s=f375991649ecb3070740471133ddccb6af9e58d4 submitted by /u/onformative [link] [comments]  ( 65 min )
    [P] Focal loss along with sampling techniques
    I am working on a multi class sentence pair classification on an highly imbalanced dataset. I have tried different techniques from sampling techniques to focal loss. My question is does it make sense to perform sampling (over or under sampling) while training the model with focal loss? My understanding is that focal loss is geared towards penalising samples that are hard to learn, whereas sampling techniques modifies the distribution of the dataset. So they can be used together. Please share your thoughts. submitted by /u/channel-hopper- [link] [comments]  ( 64 min )
    [D] Text to Sound Design?
    Hi all! I have once read a paper where you could write text and an AI would generate sound design out of it (like; a man walking through grass while birds sing) but I couldn't test it myself. Is there any site where I could test something like this? If you know anything let me know would be great! Thank you so much! X John submitted by /u/johnwireds [link] [comments]  ( 64 min )
    [D] OpenReview & CMT : Assigning someone else to complete reviews on your behalf
    Hello everyone! My professor is a reviewer for a conference which is using OpenReview and wants me to complete the reviews on their behalf. I am aware that on CMT, one can officially assign someone else to complete the review on their behalf. I wanted to ask if the same can be done on OpenReview? If yes, I will proceed to ask my professor if they could officially assign the reviews to me. Otherwise, they would upload my reviews but I would not get any credit for it :( Thank you for your response! :) submitted by /u/Fluff269 [link] [comments]  ( 64 min )
    [D] Does Google TPU v4 compete with GPUs in price/performance?
    Fellow machine learning enthusiast here! I want to train a large NLP model and I'm wondering whether its worth it to use Google Cloud's TPU's for it. I already have an Nvidia RTX 3060 Laptop GPU with 8.76 TFLOPS, but I was unable to find out what the exact performance (in TFLOPS to be able to compare them) of google TPU v3 and v4 are. I know TPUs (I think the factor is 12x) are a ton faster and more optimized for machine learning than GPU's, but I'm still wondering whether its worth it to just build a graphics card rig for the long term. (since the pricing and estimation seems unclear to me since I cannot see how much I'm paying per TFLOP.) Has anyone done the numbers on price/performance and hourly cost? Also is there any factor I missed? Thanks a lot in advance! submitted by /u/Shardsmp [link] [comments]  ( 72 min )
    [P] I made a tool that auto-saves your ChatGPT conversations and adds a "Chat History" button on the website.
    savegpt.com is a browser extension available both on the Chrome webstore and Firefox addons. https://reddit.com/link/zikps2/video/5zinkph4b85a1/player submitted by /u/silentx09 [link] [comments]  ( 65 min )
    [P] All About Prompt-Engineering: Open source discussion forum to ask questions, discuss, and share about ChatGPT, Stable Diffusion, GPT-3 and other generative models. Prompt Engineering for different tasks such as NER, QA, Classification, Data Generation and many more
    Hi Folks, Have you tried ChatGPT, GPT-3, or other generative models but have been frustrated by the lack of support or guidance when it comes to using them effectively? Are you interested in learning more about the power of prompt engineering and how it can help you get better results from generative models? We have recently launched a new open-source platform called discuss.openPrompt.io, where you can ask and answer questions, discuss, and share your knowledge and experiences with ChatGPT, Prompt-Engineering, GPT-3, stable diffusion, and other generative models. ​ As many of you may know, ChatGPT was released recently and has generated a lot of excitement among the NLP community. When ChatGPT was released, we were excited to try it out, but we quickly realized that many people are st…  ( 66 min )
  • Open

    Lego baby yoda the child | AVI Tutorial
    submitted by /u/ComprehensiveIce5917 [link] [comments]  ( 45 min )
    Can anyone recommend some of the coolest AI technology out there for prosumers!
    Can anyone recommend some of the coolest latest AI technology out there for average people like me? For instance midjourney creates computer generated images based off of a text sentence. A website called LALAL can take a song and strip it of vocals to just give you an instrumental version. And then there’s Nvidia canvas which lets you draw very primitive images and it turns it into 3-D landscapes. I’d love to hear other peoples recommendations for cool AI websites/technology they’ve run into this season that can be used to do creative things. submitted by /u/Rachel_reddit_ [link] [comments]  ( 46 min )
    I made a novel completely using gpt-3
    I made a novel that talks about the return of the greek gods in the modern world which is a novel that takes the reader through a journey in 42 chapters using only gpt-3 and some Human intervention. https://www.amazon.com/dp/B0BPMKHB2G?fbclid=PAAaYoLa-acoo_O3B-4l5UWsKefDJKbJ3dZrRq1m0H6UAZI9NowBp8VQ2ark0 submitted by /u/youneshlal77 [link] [comments]  ( 45 min )
    AIMA - Russell & Norvig: More notation, less substance?
    Most of the time I am reading this book for my AI course in uni I feel like I am learning more terminolgies, definitions instead of actual applications. I don't really understand what makes this book such a popular choice. submitted by /u/Status-Sprinkles1236 [link] [comments]  ( 52 min )
    "Tell me a joke about 3 cats a lesbian and a christmas tree"
    submitted by /u/DropNationalism [link] [comments]  ( 44 min )
    This free online AI Tool helps to understand any research paper easily
    submitted by /u/qptbook [link] [comments]  ( 47 min )
    Could we still learn our craft and feed our brains if we entrust the creation process to AI?
    One of the human skill that made our species do great things, but also bad ones, is our imagination. From majestics architecture, heartbreaking novels to fantastic art and awesome music, the human always learn the craft diligently, by going trought a trial and error process. You can see an artist do trial and error to compose a new melody, same for a visual artist or for a writer. This "trial and error" process is what make us learn our craft by experiencing the bad and the goods practice, by using our imagination. - What happen to our craft and imagination if we delegate this trial and error process to AI? - How could we still find joy and fulfillment if we are no more in that deep flow state? - Isn't it just in the struggling part that we always manage to push ourselves further, and thanks to our imagination, find new ways to solve that creative problem? - If we entrust the creation, trial and error process to AI, as a basis on which to entrust our creativity, are we still using our imagination, or just working on something AI has made for us? - If we use AI in our creative process, does that mean we are not using our imagination, so, we are no more developing our craft/skill? What's your feedback and opinion on this topic? submitted by /u/crepuscopoli [link] [comments]  ( 47 min )
    DRIVOOO, AI Driving Android App, Object Detection, Lane Detection, Distance Estimation
    The primarily used to enhance driver behavior. The main concern of this app to avoid distractions and prevent accidents. Real Word Demo, (3:06) https://youtu.be/cWxfP-F7soY Project Objective Drivooo is developed to assist drivers in real-time. As described earlier, it helps you to avoid distractions and prevent collisions. Moreover, it will utilize the camera of your phone to scan objects and keep drivers following in a lane and warn of potential crashes in real-time. ​ Lane Detection Daytime Socio-Economic Benefits Road Accident is a global problem, in every 2 seconds, an accident happens. According to WHO Approximately 1.3 million people die each year as a result of road traffic crashes. Some modern cars provide the same features as drivooo but they are way too expensive to suppo…  ( 46 min )
    How to add AI bots to my social media account and messaging platforms.
    I want to add an AI that I trained to all my messaging platform so that my loved ones can still message me from time to time without any suspicions. I plan to move somewhere far before dying. I don't want them to worry. submitted by /u/Safe-Board9430 [link] [comments]  ( 45 min )
    Emma’s Story
    Experimenting to see how far I could go with ChatGPT and asked it to write a fiction story of its own creation. Over 62 “chapters” it managed to write an entire novella that is mostly coherent with minimal instruction/correction. It titled this “Emma’s Story.” Chapter 1: A New Beginning The sun was just beginning to rise over the city, casting a warm glow over the streets and buildings. Emma stood at the window of her apartment, watching as the first rays of light peeked over the horizon. She took a deep breath and smiled, feeling a sense of excitement and anticipation for the day ahead. Emma was starting a new chapter in her life. After years of working long hours at a dead-end job, she had finally saved enough money to pursue her dream of becoming a writer. She had always been passiona…  ( 67 min )
  • Open

    The 2022 Robotics: Science and Systems Conference
    A photo I took while at RSS 2022 in New York City, on the dinner cruise arranged by the conference. From June 27 to July 01 this year, I attended the latest the Robotics: Science and Systems conference. This was the first in-person RSS after two virtual editions in 2020 and 2021. I don’t publish in RSS frequently. I only have one RSS paper, VisuoSpatial Foresight, from 2020. I was attending mainly because of its nearby location in New York City and because I was invited to attend RSS Pioneers, which is a unique event arranged by this conference. I’m well aware that this report, like the one I wrote for ICRA 2022, is coming well after the conference has ended. Things have been extremely busy on my end but I will try and improve the turnaround time between attending and reporting about a …  ( 7 min )
  • Open

    Best Neural Networks Courses on Udemy to Consider in 2022
    submitted by /u/Lakshmireddys [link] [comments]  ( 49 min )
    Cache NN
    Hi r/neuralnetworks ! I was thinking it would be nice if LLM could remember facts. I thought of a model that could maybe do that. It would be really great if someone could give me a direction. I'm wondering if someone's already worked on this? The model has a "writing" part and a "reading" part. Writing The model receives once sentence at a time from a corpus (Eg. Wikipedia). It also receives its current Cache, which contains a summary of the corpus. The model "writes" something into the cache, like "3 Water is liquid" to put the sentence "Water is liquid" at address 3. Reading The model receives a query asking a question and the cache. It then "reads" at a specific address from the corpus. ​ So if you have a big corpus like: Water (chemical formula H2O) is an inorganic, tra…  ( 49 min )
  • Open

    Why determinants with columns of ones?
    Geometric equations often involve a determinant with a column of 1s. For example, the equation of a line through two points or a circle through three points Or a general conic section through five points Why all the determinants and why all the 1s? When you see a determinant equal to zero, you immediately think […] Why determinants with columns of ones? first appeared on John D. Cook.  ( 5 min )

  • Open

    Apparently I am a robot
    As AI-generated text is getting better, it's getting easier to pass it off as human-written. That's not to say it's as good as human-written. Its goal is to sound correct rather than be correct, so it has a well-known tendency to confidently make stuff  ( 4 min )
    Bonus: ChatGPT rates recipes by another neural net
    AI Weirdness: the strange side of machine learning  ( 2 min )
  • Open

    Best Reinforcement Learning course?
    What Is in your opinioni the best course to start with Reinforcement Learning, which Is both hands on and Theoretical? submitted by /u/Emote_del_MP [link] [comments]  ( 58 min )
    Installation issues with Open AI GYM and Mujoco
    Hi Everyone, I am quite new in this field of reinforcement learning, I want to learn ans see in practice how these different RL agents work across different environments , I am trying to train the RL agents in Mujoco Environments, but since few days I am finding it quite difficult to install GYM and Mujoco, mujoco has its latest version as "mujoco-2.3.1.post1" and my question is whether OPen AI GYM supports this version, if it does than the error is wierd because the folder that it is trying to look for mujoco bin library is mujoco 210?Can someone advise on that , and do we really need to install mujoco py ? ​ I am very confused though I tried to use the documentation here - openai/mujoco-py: MuJoCo is a physics engine for detailed, efficient rigid body simulations with contacts. mujoco-py allows using MuJoCo from Python 3. (github.com) but its not working out? Can the experts from this community please advise? ​ ​ https://preview.redd.it/sbd4zv9ug55a1.png?width=1343&format=png&auto=webp&s=f6a458a7703a09d4893a5dc93c86a1695143f4fa https://preview.redd.it/hplb20pmg55a1.png?width=1290&format=png&auto=webp&s=dc4a022ad790837d7058d3f1fd42ab17f1c375fc submitted by /u/Affectionate_Fun_836 [link] [comments]  ( 58 min )
    Anybody else doing RL in finance?
    For almost a year, I have been working on algorithms using reinforcement learning models to trade on stock market. During the process I went through parts as is data processing, hyper-parameter tuning, live integration, XAI and so much more. I am curious if anyone else here is working on something similar. I would like to see some different approaches to the topic. If you do, you can comment or text me and we can share our thoughts submitted by /u/Apprehensive_Rush314 [link] [comments]  ( 58 min )
    Successor of openai/retro
    Hello, guys! Is there any successor/alternative to retro library that is capable of interacting with any libretro emulation core? I'd like to suggest/help integrating with more "modern" cores (not really modern, but I didn't find any of those libs with PSX cores compatibility, for example). Thanks! Context: I've been into RL for a few months and managed to train an agent for some tasks on a PSX game using socket connections with BizHawk emulator. The problem is: it's slow. Not reaaally slow, but my script is spawning multiple processes of the emulator and there are a lot of things running in them that aren't really necessary for my purpose (also, I think socket isn't the most efficient way to do that, but it seemed to be the most efficient one given the constraint of using that emulator). submitted by /u/victorsevero [link] [comments]  ( 58 min )
    Why is this reward function working?
    Hi, the edited the example codes from Isaac Gym so that the agent only tries to reach the cube on the table. After every episode the cube position and the arm configuration get reset so that the robot can reach the cube at any position from any configuration. The agent can be successfully trained, but I do not why this is working. The reward function says the following things: Each episode consists of 500 simulation steps. And after each step, the distance between the cube and the end-effector is calculated. The smaller the distance the bigger the reward. Now assuming in episode A, the cube is placed at a closer position than in episode B. As the distance to the cube is inherently smaller in episode A, the achievable reward is higher in episode A. But how can the agent learn to reach the cube at any position (incl. in episode B), when the best score from episode A gets never broken? Code Snippets for the reward function: https://github.com/famora2/IsaacGymEnvs/blob/8b6c725a4f46ed349e7bcbfc1b1cb33fefd2bf66/isaacgymenvs/tasks/franka_cube_stack.py#L699 --- Edit: u/New-Resolution3496 ​ https://preview.redd.it/filgig7jj35a1.png?width=842&format=png&auto=webp&s=1fce968152f6155f7320fd35ad06155f2899f240 ​ submitted by /u/Fun-Moose-3841 [link] [comments]  ( 58 min )
    I have so many questions! You guys fascinate me.
    I am still very much a noob, been only coding for a year and still trying to break in the field as a Web Developer. Anyways, my math is trash (like basic algebra), my programming is noobish and I want to know what do I need to become good in this field? I really want to someday tackle the problem of making an AI that can beat humans at unsolved board/card games. I have been seeing a lot on Linear Algebra and Statistics, what else do I need? Is Python just the way to go or should I just go with C++? What are good beginner friendly content? Do you guys have any advice for me? submitted by /u/Nimai_TV [link] [comments]  ( 59 min )
  • Open

    Philosophy book written by ChatGPT and illustrated by StableDiffusion
    Dear friends!! I just wrote my first book! :) It's called "Little Book of Principles" by Homerus Gigas. I used artificial intelligence (ChatGPT) to compile the wisdom of many great thinkers, including Lao Tzu, Epictetus, Marcus Aurelius, and Krishnamurti, into a collection of art and poetry - written in the style of the Tao Te Ching. Then I used StableDiffusion to illustrate the book. I'm not sure if I can post the link to the book here, but would love to get your feedback on the book! It's currently for free on Amazon. submitted by /u/Front_Brain [link] [comments]  ( 44 min )
    GoD of War on mobile AI version :))
    https://www.youtube.com/watch?v=ck58bVZYKlw submitted by /u/thosiris [link] [comments]  ( 42 min )
    AI Dream 95 - This Cool AI Project is AMAZING
    submitted by /u/LordPewPew777 [link] [comments]  ( 42 min )
    I wrote an Emacs package for ChatGPT
    Demo: https://www.youtube.com/watch?v=4oUrm4CnIjo Repo: https://github.com/joshcho/ChatGPT.el submitted by /u/avindroth [link] [comments]  ( 42 min )
    Breakthrough Robotics Tech To Transform Quadruped Robot Into Humanoid | New AI For Quantum Computers | Deep Reinforcement Learning Arranges Atoms Into Nano Scale Robot Arm
    submitted by /u/kenickh [link] [comments]  ( 43 min )
    is midjourney the best text to image AI at the moment?
    submitted by /u/Thesmallcookie [link] [comments]  ( 46 min )
    I'm writing about media/generative AI weekly.
    I decided to scratch my own itch and write about things happening in mediatech weekly. (Done it for 3 weeks now) And as there’s an AI boom happening, the content will be generative-AI heavy. Backstory: I’ve been working in tech/media/marketing for years, and for the past couple of years have closely followed what’s happening in generative AI, VR, creative production, design-tech, etc. Everything audiovisual - creating - writing-media essentially. But I’ve never found any newsletter focusing on this broad yet niche topic, so I created my own. There’s just so much exciting stuff happening currently in generative AI (and media in general) that’s worth sharing. I’m trying to capture the biggest announcements every week. This isn’t one of those highly technical, solely Ai newsletters covering everything about the industry in deep nuance. It’s for a broader audience interested in all the cool stuff happening in media. The content varies from week to week. But it includes a couple of longer stories & links to news, or products, or demos, or videos, or papers, etc. Some of the stuff in the last email: ChatGPTs threat to Google and our society 5 Ideas people could build on top of ChatGPT (For example, ads & BS-free recipes. Googling recipes is a torture.) Cool stuff people have used ChatGPT for. European Union spent €400,000 on a metaverse party 5 people attended. World Cup fans using Snapchat filters to stick to display the pride flag at football games. And ofc, Snoop Dobby Dobb (Snoop Dogg X Dobby from Harry Potter) https://preview.redd.it/zxxqkkbne25a1.jpg?width=1170&format=pjpg&auto=webp&s=11eec4b4c22be771b626ac8474abb7eb5650450d If you’re interested in checking it out, there’s a link in the comments. And if you hate me for posting this. Call me a spammer, cast a spell on me, and hold a grudge for the rest of your life. Todaloo! submitted by /u/KCVeske [link] [comments]  ( 46 min )
    Are there "converse" AIs?
    Like they assume a specific character's attitude, habits, etc. you can talk to them. Like character(dot)AI submitted by /u/Got70TypesOfMalware [link] [comments]  ( 45 min )
    Help required on how to reduce loss for a sequence to sequence RNN training a chatbot with very long input/output sentence lengths. Also, the chatbot is repeating words in the same response.
    If anyone can offer any guidance on problems I'm having with my chatbot, I would be very grateful. The default model is here: https://pytorch.org/tutorials/beginner/chatbot_tutorial.html It is a sequence to sequence RNN encoder-decoder with Gated Recurrent Units (GRU). It uses a greedy search method and a Luong attention layer. My task is to alter it, the project is very open ended. I've decided to try and change it into a "therapist/doctor-Bot". At the moment everything is the same but I'm inputting vastly different datasets to the movie_dialog conversations that were default. Problems: loss: messing with normal parameters (learning rate, number of layers doesn't seem to be helping all that much, can't seem to get the loss 20 words long. At the moment I've limited it to process sentences of 100 words. I can't go too low, the patient doctor conversations are simply very long and if I reduce the sentence length too much I don't have many patient-doctor sentences left to train the model! This is a substantial problem to overcome I feel. The default sentence length was 10, and training was much quicker as a result. Talking to the bot: The main issue appears to be that the bot's response sentence tends to repeat tends to repeat tends to repeat tends to repeat tends to repeat tends to repeat... etc. Just like that, in a single response. Any tips on how to prevent this? It also will repeat the same response to different questions, especially if those questions are short in length. I can't figure out where in code I should attempt to "punish" the bot for doing this. Thanks for reading. submitted by /u/BillMurray2022 [link] [comments]  ( 50 min )
    Hi
    Hi guys, I’m new to Reddit, really looking forward to this. Any tips on how to get the most out of Reddit? Also I’m posting this here cause I’m into AI and if you can suggest some subreddits it would be cool 🙂 cheers let’s go to the moon submitted by /u/Sensitive_Fan_3620 [link] [comments]  ( 42 min )
    Now i can finally write my (true) stories in a form that is nice to read and/or listen to.
    I am very bad at writing stories as i am too fact oriented and also englisk is not my first language. Using Chad (ChatGPT) we wrote my true story. I told it the plot and the important details. And after about ½ hour we together had written this true story. Chad took my facts and reformulated them also describing the scenery. It even added some details that was true but i did not tell it. I had to hold Chads hand or he wondered of on tangents. But easy to 'nudge' him back on track. Then i used the Colab Notebook from tortoise-tts to train a TEXT2SPEECH model with the voice of a famous narrator where i sampled 3 times 10 sec. speech and used for training. No intention to get the voice to sound like the original narrator but just as an ok human like voice. Added some ambience sounds and this is the result: https://dkcraft.dk/sei/story.mp3 (Length: 5min) I welcome Chad (my name for ChatGPT) as a new tool in my digital toolbox. submitted by /u/sEi_ [link] [comments]  ( 43 min )
    Can someone please explain the reference to Copulas that has been used in the 1st pic? (I have also attached the previous slide for context)
    submitted by /u/Novelpower3404 [link] [comments]  ( 47 min )
    AI Generated Portraits: Lmk Which One Do You Think Is The Best! (1st Photo Of Myself For Reference)💫
    submitted by /u/minapenna [link] [comments]  ( 45 min )
    I decided to use an Artificial Intelligence (AI) software for a promo-video for my new book. Here's' the result
    submitted by /u/SuccessfulLoser- [link] [comments]  ( 46 min )
    DREAM AI IS RACIST??? I was trying to generate stupid looking hair and for some reason almost every picture is of black people??? Go try it out. This is pretty fucked up if you ask me.
    submitted by /u/ChuklzDaJ [link] [comments]  ( 47 min )
    AI Art! Prompt: Flowing Evil
    I used Dream! by Wombo submitted by /u/Chase2k7 [link] [comments]  ( 43 min )
    Any Good Real Time transcription Apps for iOS other than Otter?
    I wanna find something that does real time transcription of audio for iOS OtterAI does exactly what I want but I want to find something that doesn't use my recordings to train its AI Any good options? submitted by /u/AviatorPrints [link] [comments]  ( 44 min )
    AI writes an original Abbott and Costello routine
    submitted by /u/robertdeniro6969 [link] [comments]  ( 42 min )
  • Open

    [R] The Framework for Learning and Inference In a Forward Pass
    Paper: Signal Propagation: A Framework for Learning and Inference In a Forward Pass (https://arxiv.org/abs/2204.01723) We introduce the framework for learning with forward passes. I made a friendly and thorough tutorial to learn about and implement forward learning: https://amassivek.github.io/sigprop . The most interesting insights from the framework: This algorithm provides an explanation for how neurons in the brain without error connections receive learning signals. It works for continuous networks with hebbian learning. This provides evidence for this algorithm as model of learning in the brain. It works for spiking neural networks using only the membrane potential (aka voltage in hardware). This supports applying this algorithm for learning on neuromorphic chips. Recent in…  ( 68 min )
    [D] CNN with automatic dataset generation
    Hello everyone. I'm working on a project where a Machine Learning Model (CNN) have to recognize handmade signs on paper cards starting from a camera picture. It's a POC and I don't have the actual pictures, so have to simulate them. Recently I implemented an ImageGenerator to add some noise to a set of images for data augmentation, but this time I don't have any image at all. I have a clean image of the paper card and I created another ImageGenerator to add some random fake signs to that image. I wanted to use the generated images as training set but i was wondering: do I have to create an initial Data Set and use it in every epoch or is it possible to pass different generated images every time? In other words: is it a good thing if the model sees completely different samples every time it start a new epoch? In that case how do I have to handle the evaluation set? Thanks submitted by /u/kaeldric__ [link] [comments]  ( 67 min )
    [D] MLOps: Retraining strategy for unsupervised topic model in production?
    You're interested in modeling the topics of tweets that were initially collected via keyword (hashtag) search. You run an unsupervised topic modeling method (e.g., LDA, BERTopic) and end up with a set of topics that you're reasonably happy with after manual inspection. You now want to deploy this model, but you wonder: what to do when new topics / themes naturally arise? For example, nobody could have predicted covid and you'd certainly like to capture important events like this. So, you need to be able to have new topics. For that, I guess you can retrain on data within a rolling window of, say, 3/6/9 months. However, that will regenerate topics from scratch! You would have to manually inspect the data each time to confirm the new topics are still reasonable. How to deal with this? submitted by /u/TheCockatoo [link] [comments]  ( 66 min )
    [P] Daath AI Parser is an open-source application that uses OpenAI to parse visible text of HTML elements.
    submitted by /u/softcrater [link] [comments]  ( 67 min )
    [D] "#AI-based assessment of cardiac allograft rejections"Lipkova et al. 2022
    submitted by /u/pasticciociccio [link] [comments]  ( 66 min )
    [Project] Football Players Tracking with YOLOv5 + ByteTRACK
    submitted by /u/RandomForests92 [link] [comments]  ( 67 min )
    [P] I made a command-line tool that explains your errors using ChatGPT (link in comments)
    submitted by /u/jsonathan [link] [comments]  ( 72 min )
    [N] If Yann LeCun is right and AR glasses are the killer app for ML hardware, then Snapdragon AR2 chips are the next step on our path [16 slides]
    submitted by /u/SpatialComputing [link] [comments]  ( 72 min )
    [P] Machine Learning Framework in Java
    Hey guys. We created a small Framework with which you can implement a model of a problem and let the Framework solve it. The Code is pure Java and there is no further need to understand complex mechanisms of Machine Learning. In the repository is a quick example on how this framework plays the game "Snake". For training, it uses a Genetic Algorithm to train Neural Networks. If you can implement Snake in Java you can also use this Framework to let it play your game. The motivation for this Framework was that many people learn Java in school and are interested in Machine Learning, nearly every Framework out there for Machine Learning is written in Python though. This Framework should not be used to make the most complex predictions, it should more get used to implement your first Machine Learning Algorithm in a language most of you are familiar. We try to lure you into the world of Machine Learning! ;) https://github.com/tomLamprecht/AI-Framework Feedback of any kind is very welcomed! :) PS: I'm aware that evolving a Neural Net with Genetic Algorithms is by far not the most efficient way, but I went this road, and I'm going to finish it :D submitted by /u/Lampard557 [link] [comments]  ( 67 min )
  • Open

    Security as Code: Creating a New Cybersecurity Paradigm Amid Growing Cloud Use
    As more data move to the cloud and more organizations embrace cloud computing, it is clear that there is a corresponding need to improve cybersecurity strategies. One of the improvements being considered is security-as-code (SaC), which Google has been promoting actively. Many organizations have been supportive of this relatively nascent but promising cybersecurity approach. In… Read More »Security as Code: Creating a New Cybersecurity Paradigm Amid Growing Cloud Use The post Security as Code: Creating a New Cybersecurity Paradigm Amid Growing Cloud Use appeared first on Data Science Central.  ( 21 min )
  • Open

    How to get started?
    Heya guys; I’m a high school student who’s low on cash, looking to get started in neural networks, with nothing but some python and a laptop to my name. Any good starter projects for low level, archaic neural networks that aren’t focused on teaching python basics? I bought a book a few years ago about it but it was more on python basics. submitted by /u/StationPhysical [link] [comments]  ( 58 min )
  • Open

    Contrastive Weighted Learning for Near-Infrared Gaze Estimation. (arXiv:2211.03073v2 [cs.CV] UPDATED)
    Appearance-based gaze estimation has been very successful with the use of deep learning. Many following works improved domain generalization for gaze estimation. However, even though there has been much progress in domain generalization for gaze estimation, most of the recent work have been focused on cross-dataset performance -- accounting for different distributions in illuminations, head pose, and lighting. Although improving gaze estimation in different distributions of RGB images is important, near-infrared image based gaze estimation is also critical for gaze estimation in dark settings. Also there are inherent limitations relying solely on supervised learning for regression tasks. This paper contributes to solving these problems and proposes GazeCWL, a novel framework for gaze estimation with near-infrared images using contrastive learning. This leverages adversarial attack techniques for data augmentation and a novel contrastive loss function specifically for regression tasks that effectively clusters the features of different samples in the latent space. Our model outperforms previous domain generalization models in infrared image based gaze estimation and outperforms the baseline by 45.6\% while improving the state-of-the-art by 8.6\%, we demonstrate the efficacy of our method.
    DPAUC: Differentially Private AUC Computation in Federated Learning. (arXiv:2208.12294v2 [cs.LG] UPDATED)
    Federated learning (FL) has gained significant attention recently as a privacy-enhancing tool to jointly train a machine learning model by multiple participants. The prior work on FL has mostly studied how to protect label privacy during model training. However, model evaluation in FL might also lead to potential leakage of private label information. In this work, we propose an evaluation algorithm that can accurately compute the widely used AUC (area under the curve) metric when using the label differential privacy (DP) in FL. Through extensive experiments, we show our algorithms can compute accurate AUCs compared to the ground truth. The code is available at {\url{https://github.com/bytedance/fedlearner/tree/master/example/privacy/DPAUC}}.
    Introducing Non-Linear Activations into Quantum Generative Models. (arXiv:2205.14506v4 [quant-ph] UPDATED)
    Due to the linearity of quantum mechanics, it remains a challenge to design quantum generative machine learning models that embed non-linear activations into the evolution of the statevector. However, some of the most successful classical generative models, such as those based on neural networks, involve highly non-linear dynamics for quality training. In this paper, we explore the effect of these dynamics in quantum generative modeling by introducing a model that adds non-linear activations via a neural network structure onto the standard Born Machine framework - the Quantum Neuron Born Machine (QNBM). To achieve this, we utilize a previously introduced Quantum Neuron subroutine, which is a repeat-until-success circuit with mid-circuit measurements and classical control. After introducing the QNBM, we investigate how its performance depends on network size, by training a 3-layer QNBM with 4 output neurons and various input and hidden layer sizes. We then compare our non-linear QNBM to the linear Quantum Circuit Born Machine (QCBM). We allocate similar time and memory resources to each model, such that the only major difference is the qubit overhead required by the QNBM. With gradient-based training, we show that while both models can easily learn a trivial uniform probability distribution, on a more challenging class of distributions, the QNBM achieves an almost 3x smaller error rate than a QCBM with a similar number of tunable parameters. We therefore provide evidence that suggests that non-linearity is a useful resource in quantum generative models, and we put forth the QNBM as a new model with good generative performance and potential for quantum advantage.
    LaserMix for Semi-Supervised LiDAR Semantic Segmentation. (arXiv:2207.00026v2 [cs.CV] UPDATED)
    Densely annotating LiDAR point clouds is costly, which restrains the scalability of fully-supervised learning methods. In this work, we study the underexplored semi-supervised learning (SSL) in LiDAR segmentation. Our core idea is to leverage the strong spatial cues of LiDAR point clouds to better exploit unlabeled data. We propose LaserMix to mix laser beams from different LiDAR scans, and then encourage the model to make consistent and confident predictions before and after mixing. Our framework has three appealing properties: 1) Generic: LaserMix is agnostic to LiDAR representations (e.g., range view and voxel), and hence our SSL framework can be universally applied. 2) Statistically grounded: We provide a detailed analysis to theoretically explain the applicability of the proposed framework. 3) Effective: Comprehensive experimental analysis on popular LiDAR segmentation datasets (nuScenes, SemanticKITTI, and ScribbleKITTI) demonstrates our effectiveness and superiority. Notably, we achieve competitive results over fully-supervised counterparts with 2x to 5x fewer labels and improve the supervised-only baseline significantly by 10.8% on average. We hope this concise yet high-performing framework could facilitate future research in semi-supervised LiDAR segmentation. Code is publicly available.
    GAUCHE: A Library for Gaussian Processes in Chemistry. (arXiv:2212.04450v1 [physics.chem-ph])
    We introduce GAUCHE, a library for GAUssian processes in CHEmistry. Gaussian processes have long been a cornerstone of probabilistic machine learning, affording particular advantages for uncertainty quantification and Bayesian optimisation. Extending Gaussian processes to chemical representations, however, is nontrivial, necessitating kernels defined over structured inputs such as graphs, strings and bit vectors. By defining such kernels in GAUCHE, we seek to open the door to powerful tools for uncertainty quantification and Bayesian optimisation in chemistry. Motivated by scenarios frequently encountered in experimental chemistry, we showcase applications for GAUCHE in molecular discovery and chemical reaction optimisation. The codebase is made available at https://github.com/leojklarner/gauche
    A Survey of Graph Neural Networks for Social Recommender Systems. (arXiv:2212.04481v1 [cs.SI])
    Social recommender systems (SocialRS) simultaneously leverage user-to-item interactions as well as user-to-user social relations for the task of generating item recommendations to users. Additionally exploiting social relations is clearly effective in understanding users' tastes due to the effects of homophily and social influence. For this reason, SocialRS has increasingly attracted attention. In particular, with the advance of Graph Neural Networks (GNN), many GNN-based SocialRS methods have been developed recently. Therefore, we conduct a comprehensive and systematic review of the literature on GNN-based SocialRS. In this survey, we first identify 80 papers on GNN-based SocialRS after annotating 2151 papers by following the PRISMA framework (Preferred Reporting Items for Systematic Reviews and Meta-Analysis). Then, we comprehensively review them in terms of their inputs and architectures to propose a novel taxonomy: (1) input taxonomy includes 5 groups of input type notations and 7 groups of input representation notations; (2) architecture taxonomy includes 8 groups of GNN encoder, 2 groups of decoder, and 12 groups of loss function notations. We classify the GNN-based SocialRS methods into several categories as per the taxonomy and describe their details. Furthermore, we summarize the benchmark datasets and metrics widely used to evaluate the GNN-based SocialRS methods. Finally, we conclude this survey by presenting some future research directions.
    Fast Parallel Bayesian Network Structure Learning. (arXiv:2212.04259v1 [cs.LG])
    Bayesian networks (BNs) are a widely used graphical model in machine learning for representing knowledge with uncertainty. The mainstream BN structure learning methods require performing a large number of conditional independence (CI) tests. The learning process is very time-consuming, especially for high-dimensional problems, which hinders the adoption of BNs to more applications. Existing works attempt to accelerate the learning process with parallelism, but face issues including load unbalancing, costly atomic operations and dominant parallel overhead. In this paper, we propose a fast solution named Fast-BNS on multi-core CPUs to enhance the efficiency of the BN structure learning. Fast-BNS is powered by a series of efficiency optimizations including (i) designing a dynamic work pool to monitor the processing of edges and to better schedule the workloads among threads, (ii) grouping the CI tests of the edges with the same endpoints to reduce the number of unnecessary CI tests, (iii) using a cache-friendly data storage to improve the memory efficiency, and (iv) generating the conditioning sets on-the-fly to avoid extra memory consumption. A comprehensive experimental study shows that the sequential version of Fast-BNS is up to 50 times faster than its counterpart, and the parallel version of Fast-BNS achieves 4.8 to 24.5 times speedup over the state-of-the-art multi-threaded solution. Moreover, Fast-BNS has a good scalability to the network size as well as sample size. Fast-BNS source code is freely available at https://github.com/jjiantong/FastBN.
    OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models. (arXiv:2212.04408v1 [cs.CV])
    Generalist models, which are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model, have been explored recently. Being, hopefully, an alternative to approaching general-purpose AI, existing generalist models are still at an early stage, where modality and task coverage is limited. To empower multi-modal task-scaling and speed up this line of research, we release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction. At the core of OFASys is the idea of decoupling multi-modal task representations from the underlying model implementations. In OFASys, a task involving multiple modalities can be defined declaratively even with just a single line of code. The system automatically generates task plans from such instructions for training and inference. It also facilitates multi-task training for diverse multi-modal workloads. As a starting point, we provide presets of 7 different modalities and 23 highly-diverse example tasks in OFASys, with which we also develop a first-in-kind, single model, OFA+, that can handle text, image, speech, video, and motion data. The single OFA+ model achieves 95% performance in average with only 16% parameters of 15 task-finetuned models, showcasing the performance reliability of multi-modal task-scaling provided by OFASys. Available at https://github.com/OFA-Sys/OFASys
    Discovering Closed-Loop Failures of Vision-Based Controllers via Reachability Analysis. (arXiv:2211.02736v2 [cs.RO] UPDATED)
    Machine learning driven image-based controllers allow robotic systems to take intelligent actions based on the visual feedback from their environment. Understanding when these controllers might lead to system safety violations is important for their integration in safety-critical applications and engineering corrective safety measures for the system. Existing methods leverage simulation-based testing (or falsification) to find the failures of vision-based controllers, i.e., the visual inputs that lead to closed-loop safety violations. However, these techniques do not scale well to the scenarios involving high-dimensional and complex visual inputs, such as RGB images. In this work, we cast the problem of finding closed-loop vision failures as a Hamilton-Jacobi (HJ) reachability problem. Our approach blends simulation-based analysis with HJ reachability methods to compute an approximation of the backward reachable tube (BRT) of the system, i.e., the set of unsafe states for the system under vision-based controllers. Utilizing the BRT, we can tractably and systematically find the system states and corresponding visual inputs that lead to closed-loop failures. These visual inputs can be subsequently analyzed to find the input characteristics that might have caused the failure. Besides its scalability to high-dimensional visual inputs, an explicit computation of BRT allows the proposed approach to capture non-trivial system failures that are difficult to expose via random simulations. We demonstrate our framework on two case studies involving an RGB image-based neural network controller for (a) autonomous indoor navigation, and (b) autonomous aircraft taxiing.
    Skellam Mixture Mechanism: a Novel Approach to Federated Learning with Differential Privacy. (arXiv:2212.04371v1 [cs.LG])
    Deep neural networks have strong capabilities of memorizing the underlying training data, which can be a serious privacy concern. An effective solution to this problem is to train models with differential privacy, which provides rigorous privacy guarantees by injecting random noise to the gradients. This paper focuses on the scenario where sensitive data are distributed among multiple participants, who jointly train a model through federated learning (FL), using both secure multiparty computation (MPC) to ensure the confidentiality of each gradient update, and differential privacy to avoid data leakage in the resulting model. A major challenge in this setting is that common mechanisms for enforcing DP in deep learning, which inject real-valued noise, are fundamentally incompatible with MPC, which exchanges finite-field integers among the participants. Consequently, most existing DP mechanisms require rather high noise levels, leading to poor model utility. Motivated by this, we propose Skellam mixture mechanism (SMM), an approach to enforce DP on models built via FL. Compared to existing methods, SMM eliminates the assumption that the input gradients must be integer-valued, and, thus, reduces the amount of noise injected to preserve DP. Further, SMM allows tight privacy accounting due to the nice composition and sub-sampling properties of the Skellam distribution, which are key to accurate deep learning with DP. The theoretical analysis of SMM is highly non-trivial, especially considering (i) the complicated math of differentially private deep learning in general and (ii) the fact that the mixture of two Skellam distributions is rather complex, and to our knowledge, has not been studied in the DP literature. Extensive experiments on various practical settings demonstrate that SMM consistently and significantly outperforms existing solutions in terms of the utility of the resulting model.
    Variable-Decision Frequency Option Critic. (arXiv:2212.04407v1 [cs.LG])
    In classic reinforcement learning algorithms, agents make decisions at discrete and fixed time intervals. The physical duration between one decision and the next becomes a critical hyperparameter. When this duration is too short, the agent needs to make many decisions to achieve its goal, aggravating the problem's difficulty. But when this duration is too long, the agent becomes incapable of controlling the system. Physical systems, however, do not need a constant control frequency. For learning agents, it is desirable to operate with low frequency when possible and high frequency when necessary. We propose a framework called Continuous-Time Continuous-Options (CTCO), where the agent chooses options as sub-policies of variable durations. Such options are time-continuous and can interact with the system at any desired frequency providing a smooth change of actions. The empirical analysis shows that our algorithm is competitive w.r.t. other time-abstraction techniques, such as classic option learning and action repetition, and practically overcomes the difficult choice of the decision frequency.
    Targeted Adversarial Attacks against Neural Network Trajectory Predictors. (arXiv:2212.04138v1 [cs.LG])
    Trajectory prediction is an integral component of modern autonomous systems as it allows for envisioning future intentions of nearby moving agents. Due to the lack of other agents' dynamics and control policies, deep neural network (DNN) models are often employed for trajectory forecasting tasks. Although there exists an extensive literature on improving the accuracy of these models, there is a very limited number of works studying their robustness against adversarially crafted input trajectories. To bridge this gap, in this paper, we propose a targeted adversarial attack against DNN models for trajectory forecasting tasks. We call the proposed attack TA4TP for Targeted adversarial Attack for Trajectory Prediction. Our approach generates adversarial input trajectories that are capable of fooling DNN models into predicting user-specified target/desired trajectories. Our attack relies on solving a nonlinear constrained optimization problem where the objective function captures the deviation of the predicted trajectory from a target one while the constraints model physical requirements that the adversarial input should satisfy. The latter ensures that the inputs look natural and they are safe to execute (e.g., they are close to nominal inputs and away from obstacles). We demonstrate the effectiveness of TA4TP on two state-of-the-art DNN models and two datasets. To the best of our knowledge, we propose the first targeted adversarial attack against DNN models used for trajectory forecasting.
    The (Un)Scalability of Heuristic Approximators for NP-Hard Search Problems. (arXiv:2209.03393v3 [cs.AI] UPDATED)
    The A* algorithm is commonly used to solve NP-hard combinatorial optimization problems. When provided with a completely informed heuristic function, A* solves many NP-hard minimum-cost path problems in time polynomial in the branching factor and the number of edges in a minimum-cost path. Thus, approximating their completely informed heuristic functions with high precision is NP-hard. We therefore examine recent publications that propose the use of neural networks for this purpose. We support our claim that these approaches do not scale to large instance sizes both theoretically and experimentally. Our first experimental results for three representative NP-hard minimum-cost path problems suggest that using neural networks to approximate completely informed heuristic functions with high precision might result in network sizes that scale exponentially in the instance sizes. The research community might thus benefit from investigating other ways of integrating heuristic search with machine learning.
    Mind the Gap: Measuring Generalization Performance Across Multiple Objectives. (arXiv:2212.04183v1 [cs.LG])
    Modern machine learning models are often constructed taking into account multiple objectives, e.g., to minimize inference time while also maximizing accuracy. Multi-objective hyperparameter optimization (MHPO) algorithms return such candidate models and the approximation of the Pareto front is used to assess their performance. However, when estimating generalization performance of an approximation of a Pareto front found on a validation set by computing the performance of the individual models on the test set, models might no longer be Pareto-optimal. This makes it unclear how to measure performance. To resolve this, we provide a novel evaluation protocol that allows measuring the generalization performance of MHPO methods and to study its capabilities for comparing two optimization experiments.
    Gradual Weisfeiler-Leman: Slow and Steady Wins the Race. (arXiv:2209.09048v2 [cs.LG] UPDATED)
    The classical Weisfeiler-Leman algorithm aka color refinement is fundamental for graph learning with kernels and neural networks. Originally developed for graph isomorphism testing, the algorithm iteratively refines vertex colors. On many datasets, the stable coloring is reached after a few iterations and the optimal number of iterations for machine learning tasks is typically even lower. This suggests that the colors diverge too fast, defining a similarity that is too coarse. We generalize the concept of color refinement and propose a framework for gradual neighborhood refinement, which allows a slower convergence to the stable coloring and thus provides a more fine-grained refinement hierarchy and vertex similarity. We assign new colors by clustering vertex neighborhoods, replacing the original injective color assignment function. Our approach is used to derive new variants of existing graph kernels and to approximate the graph edit distance via optimal assignments regarding vertex similarity. We show that in both tasks, our method outperforms the original color refinement with only a moderate increase in running time advancing the state of the art.
    Enhanced method for reinforcement learning based dynamic obstacle avoidance by assessment of collision risk. (arXiv:2212.04123v1 [cs.RO])
    In the field of autonomous robots, reinforcement learning (RL) is an increasingly used method to solve the task of dynamic obstacle avoidance for mobile robots, autonomous ships, and drones. A common practice to train those agents is to use a training environment with random initialization of agent and obstacles. Such approaches might suffer from a low coverage of high-risk scenarios in training, leading to impaired final performance of obstacle avoidance. This paper proposes a general training environment where we gain control over the difficulty of the obstacle avoidance task by using short training episodes and assessing the difficulty by two metrics: The number of obstacles and a collision risk metric. We found that shifting the training towards a greater task difficulty can massively increase the final performance. A baseline agent, using a traditional training environment based on random initialization of agent and obstacles and longer training episodes, leads to a significantly weaker performance. To prove the generalizability of the proposed approach, we designed two realistic use cases: A mobile robot and a maritime ship under the threat of approaching obstacles. In both applications, the previous results can be confirmed, which emphasizes the general usability of the proposed approach, detached from a specific application context and independent of the agent's dynamics. We further added Gaussian noise to the sensor signals, resulting in only a marginal degradation of performance and thus indicating solid robustness of the trained agent.
    HERD: Continuous Human-to-Robot Evolution for Learning from Human Demonstration. (arXiv:2212.04359v1 [cs.RO])
    The ability to learn from human demonstration endows robots with the ability to automate various tasks. However, directly learning from human demonstration is challenging since the structure of the human hand can be very different from the desired robot gripper. In this work, we show that manipulation skills can be transferred from a human to a robot through the use of micro-evolutionary reinforcement learning, where a five-finger human dexterous hand robot gradually evolves into a commercial robot, while repeated interacting in a physics simulator to continuously update the policy that is first learned from human demonstration. To deal with the high dimensions of robot parameters, we propose an algorithm for multi-dimensional evolution path searching that allows joint optimization of both the robot evolution path and the policy. Through experiments on human object manipulation datasets, we show that our framework can efficiently transfer the expert human agent policy trained from human demonstrations in diverse modalities to target commercial robots.
    Training Adaptive Reconstruction Networks for Blind Inverse Problems. (arXiv:2202.11342v2 [cs.LG] UPDATED)
    Neural networks have recently allowed solving many ill-posed inverse problems with unprecedented performance. Physics informed approaches already progressively replace carefully hand-crafted reconstruction algorithms in real applications. However, these networks suffer from a major defect: when trained on a given forward operator, they do not generalize well to a different one. The aim of this paper is twofold. First, we show through various applications that training the network with a family of forward operators allows solving the adaptivity problem without compromising the reconstruction quality significantly. Second, we illustrate that this training procedure allows tackling challenging blind inverse problems. Our experiments include partial Fourier sampling problems arising in magnetic resonance imaging (MRI), computerized tomography (CT) and image deblurring.
    Differentially-Private Bayes Consistency. (arXiv:2212.04216v1 [cs.LG])
    We construct a universally Bayes consistent learning rule that satisfies differential privacy (DP). We first handle the setting of binary classification and then extend our rule to the more general setting of density estimation (with respect to the total variation metric). The existence of a universally consistent DP learner reveals a stark difference with the distribution-free PAC model. Indeed, in the latter DP learning is extremely limited: even one-dimensional linear classifiers are not privately learnable in this stringent model. Our result thus demonstrates that by allowing the learning rate to depend on the target distribution, one can circumvent the above-mentioned impossibility result and in fact, learn \emph{arbitrary} distributions by a single DP algorithm. As an application, we prove that any VC class can be privately learned in a semi-supervised setting with a near-optimal \emph{labeled} sample complexity of $\tilde{O}(d/\varepsilon)$ labeled examples (and with an unlabeled sample complexity that can depend on the target distribution).
    Loss Minimization through the Lens of Outcome Indistinguishability. (arXiv:2210.08649v2 [cs.LG] UPDATED)
    We present a new perspective on loss minimization and the recent notion of Omniprediction through the lens of Outcome Indistingusihability. For a collection of losses and hypothesis class, omniprediction requires that a predictor provide a loss-minimization guarantee simultaneously for every loss in the collection compared to the best (loss-specific) hypothesis in the class. We present a generic template to learn predictors satisfying a guarantee we call Loss Outcome Indistinguishability. For a set of statistical tests--based on a collection of losses and hypothesis class--a predictor is Loss OI if it is indistinguishable (according to the tests) from Nature's true probabilities over outcomes. By design, Loss OI implies omniprediction in a direct and intuitive manner. We simplify Loss OI further, decomposing it into a calibration condition plus multiaccuracy for a class of functions derived from the loss and hypothesis classes. By careful analysis of this class, we give efficient constructions of omnipredictors for interesting classes of loss functions, including non-convex losses. This decomposition highlights the utility of a new multi-group fairness notion that we call calibrated multiaccuracy, which lies in between multiaccuracy and multicalibration. We show that calibrated multiaccuracy implies Loss OI for the important set of convex losses arising from Generalized Linear Models, without requiring full multicalibration. For such losses, we show an equivalence between our computational notion of Loss OI and a geometric notion of indistinguishability, formulated as Pythagorean theorems in the associated Bregman divergence. We give an efficient algorithm for calibrated multiaccuracy with computational complexity comparable to that of multiaccuracy. In all, calibrated multiaccuracy offers an interesting tradeoff point between efficiency and generality in the omniprediction landscape.
    Expanding Small-Scale Datasets with Guided Imagination. (arXiv:2211.13976v2 [cs.CV] UPDATED)
    The power of Deep Neural Networks (DNNs) depends heavily on the training data quantity, quality and diversity. However, in many real scenarios, it is costly and time-consuming to collect and annotate large-scale data. This has severely hindered the application of DNNs. To address this challenge, we explore a new task of dataset expansion, which seeks to automatically create new labeled samples to expand a small dataset. To this end, we present a Guided Imagination Framework (GIF) that leverages the recently developed big generative models (e.g., DALL-E2) and reconstruction models (e.g., MAE) to "imagine" and create informative new data from seed data to expand small datasets. Specifically, GIF conducts imagination by optimizing the latent features of seed data in a semantically meaningful space, which are fed into the generative models to generate photo-realistic images with new contents. For guiding the imagination towards creating samples useful for model training, we exploit the zero-shot recognition ability of CLIP and introduce three criteria to encourage informative sample generation, i.e., prediction consistency, entropy maximization and diversity promotion. With these essential criteria as guidance, GIF works well for expanding datasets in different domains, leading to 29.9% accuracy gain on average over six natural image datasets, and 12.3% accuracy gain on average over three medical image datasets. The source code will be released: \url{https://github.com/Vanint/DatasetExpansion}.
    Making Linear MDPs Practical via Contrastive Representation Learning. (arXiv:2207.07150v2 [cs.LG] UPDATED)
    It is common to address the curse of dimensionality in Markov decision processes (MDPs) by exploiting low-rank representations. This motivates much of the recent theoretical study on linear MDPs. However, most approaches require a given representation under unrealistic assumptions about the normalization of the decomposition or introduce unresolved computational challenges in practice. Instead, we consider an alternative definition of linear MDPs that automatically ensures normalization while allowing efficient representation learning via contrastive estimation. The framework also admits confidence-adjusted index algorithms, enabling an efficient and principled approach to incorporating optimism or pessimism in the face of uncertainty. To the best of our knowledge, this provides the first practical representation learning method for linear MDPs that achieves both strong theoretical guarantees and empirical performance. Theoretically, we prove that the proposed algorithm is sample efficient in both the online and offline settings. Empirically, we demonstrate superior performance over existing state-of-the-art model-based and model-free algorithms on several benchmarks.
    Analysis of Kinetic Models for Label Switching and Stochastic Gradient Descent. (arXiv:2207.00389v2 [math.AP] UPDATED)
    In this paper we provide a novel approach to the analysis of kinetic models for label switching, which are used for particle systems that can randomly switch between gradient flows in different energy landscapes. Besides problems in biology and physics, we also demonstrate that stochastic gradient descent, the most popular technique in machine learning, can be understood in this setting, when considering a time-continuous variant. Our analysis is focusing on the case of evolution in a collection of external potentials, for which we provide analytical and numerical results about the evolution as well as the stationary problem.
    Residual-Quantile Adjustment for Adaptive Training of Physics-informed Neural Network. (arXiv:2209.05315v2 [cs.LG] UPDATED)
    Adaptive training methods for Physics-informed neural network (PINN) require dedicated constructions of the distribution of weights assigned at each training sample. To efficiently seek such an optimal weight distribution is not a simple task and most existing methods choose the adaptive weights based on approximating the full distribution or the maximum of residuals. In this paper, we show that the bottleneck in the adaptive choice of samples for training efficiency is the behavior of the tail distribution of the numerical residual. Thus, we propose the Residual-Quantile Adjustment (RQA) method for a better weight choice for each training sample. After initially setting the weights proportional to the $p$-th power of the residual, our RQA method reassign all weights above $q$-quantile ($90\%$ for example) to the median value, so that the weight follows a quantile-adjusted distribution derived from the residuals. This iterative reweighting technique, on the other hand, is also very easy to implement. Experiment results show that the proposed method can outperform several adaptive methods on various partial differential equation (PDE) problems.
    Equivariant maps from invariant functions. (arXiv:2209.14991v2 [stat.ML] UPDATED)
    In equivariant machine learning the idea is to restrict the learning to a hypothesis class where all the functions are equivariant with respect to some group action. Irreducible representations or invariant theory are typically used to parameterize the space of such functions. In this note, we explicate a general procedure, attributed to Malgrange, to express all polynomial maps between linear spaces that are equivariant with respect to the action of a group $G$, given a characterization of the invariant polynomials on a bigger space. The method also parametrizes smooth equivariant maps in the case that $G$ is a compact Lie group.
    Proximal Mean Field Learning in Shallow Neural Networks. (arXiv:2210.13879v2 [cs.LG] UPDATED)
    Recent mean field interpretations of learning dynamics in over-parameterized neural networks offer theoretical insights on the empirical success of first order optimization algorithms in finding global minima of the nonconvex risk landscape. In this paper, we explore applying mean field learning dynamics as a computational algorithm, rather than as an analytical tool. Specifically, we design a Sinkhorn regularized proximal algorithm to approximate the distributional flow from the learning dynamics in the mean field regime over weighted point clouds. In this setting, a contractive fixed point recursion computes the time-varying weights, numerically realizing the interacting Wasserstein gradient flow of the parameter distribution supported over the neuronal ensemble. An appealing aspect of the proposed algorithm is that the measure-valued recursions allow meshless computation. We demonstrate the proposed computational framework of interacting weighted particle evolution on binary and multi-class classification. Our algorithm performs gradient descent of the free energy associated with the risk functional.
    TASKED: Transformer-based Adversarial learning for human activity recognition using wearable sensors via Self-KnowledgE Distillation. (arXiv:2209.09092v2 [cs.CV] UPDATED)
    Wearable sensor-based human activity recognition (HAR) has emerged as a principal research area and is utilized in a variety of applications. Recently, deep learning-based methods have achieved significant improvement in the HAR field with the development of human-computer interaction applications. However, they are limited to operating in a local neighborhood in the process of a standard convolution neural network, and correlations between different sensors on body positions are ignored. In addition, they still face significant challenging problems with performance degradation due to large gaps in the distribution of training and test data, and behavioral differences between subjects. In this work, we propose a novel Transformer-based Adversarial learning framework for human activity recognition using wearable sensors via Self-KnowledgE Distillation (TASKED), that accounts for individual sensor orientations and spatial and temporal features. The proposed method is capable of learning cross-domain embedding feature representations from multiple subjects datasets using adversarial learning and the maximum mean discrepancy (MMD) regularization to align the data distribution over multiple domains. In the proposed method, we adopt the teacher-free self-knowledge distillation to improve the stability of the training procedure and the performance of human activity recognition. Experimental results show that TASKED not only outperforms state-of-the-art methods on the four real-world public HAR datasets (alone or combined) but also improves the subject generalization effectively.
    Applications of physics informed neural operators. (arXiv:2203.12634v2 [physics.comp-ph] UPDATED)
    We present an end-to-end framework to learn partial differential equations that brings together initial data production, selection of boundary conditions, and the use of physics-informed neural operators to solve partial differential equations that are ubiquitous in the study and modeling of physics phenomena. We first demonstrate that our methods reproduce the accuracy and performance of other neural operators published elsewhere in the literature to learn the 1D wave equation and the 1D Burgers equation. Thereafter, we apply our physics-informed neural operators to learn new types of equations, including the 2D Burgers equation in the scalar, inviscid and vector types. Finally, we show that our approach is also applicable to learn the physics of the 2D linear and nonlinear shallow water equations, which involve three coupled partial differential equations. We release our artificial intelligence surrogates and scientific software to produce initial data and boundary conditions to study a broad range of physically motivated scenarios. We provide the source code, an interactive website to visualize the predictions of our physics informed neural operators, and a tutorial for their use at the Data and Learning Hub for Science.
    Joint Entropy Search for Maximally-Informed Bayesian Optimization. (arXiv:2206.04771v4 [cs.LG] UPDATED)
    Information-theoretic Bayesian optimization techniques have become popular for optimizing expensive-to-evaluate black-box functions due to their non-myopic qualities. Entropy Search and Predictive Entropy Search both consider the entropy over the optimum in the input space, while the recent Max-value Entropy Search considers the entropy over the optimal value in the output space. We propose Joint Entropy Search (JES), a novel information-theoretic acquisition function that considers an entirely new quantity, namely the entropy over the joint optimal probability density over both input and output space. To incorporate this information, we consider the reduction in entropy from conditioning on fantasized optimal input/output pairs. The resulting approach primarily relies on standard GP machinery and removes complex approximations typically associated with information-theoretic methods. With minimal computational overhead, JES shows superior decision-making, and yields state-of-the-art performance for information-theoretic approaches across a wide suite of tasks. As a light-weight approach with superior results, JES provides a new go-to acquisition function for Bayesian optimization.
    Learning Dynamic Abstract Representations for Sample-Efficient Reinforcement Learning. (arXiv:2210.01955v2 [cs.LG] UPDATED)
    In many real-world problems, the learning agent needs to learn a problem's abstractions and solution simultaneously. However, most such abstractions need to be designed and refined by hand for different problems and domains of application. This paper presents a novel top-down approach for constructing state abstractions while carrying out reinforcement learning. Starting with state variables and a simulator, it presents a novel domain-independent approach for dynamically computing an abstraction based on the dispersion of Q-values in abstract states as the agent continues acting and learning. Extensive empirical evaluation on multiple domains and problems shows that this approach automatically learns abstractions that are finely-tuned to the problem, yield powerful sample efficiency, and result in the RL agent significantly outperforming existing approaches.
    Metropolis Monte Carlo sampling: convergence, localization transition and optimality. (arXiv:2207.10488v3 [cond-mat.stat-mech] UPDATED)
    Among random sampling methods, Markov Chain Monte Carlo algorithms are foremost. Using a combination of analytical and numerical approaches, we study their convergence properties towards the steady state, within a random walk Metropolis scheme. We show that the deviations from the target steady-state distribution feature a localization transition as a function of the characteristic length of the attempted jumps defining the random walk. This transition changes drastically the error which is introduced by incomplete convergence, and discriminates two regimes where the relaxation mechanism is limited respectively by diffusion and by rejection.
    Compositional Visual Generation with Composable Diffusion Models. (arXiv:2206.01714v5 [cs.CV] UPDATED)
    Large text-guided diffusion models, such as DALLE-2, are able to generate stunning photorealistic images given natural language descriptions. While such models are highly flexible, they struggle to understand the composition of certain concepts, such as confusing the attributes of different objects or relations between objects. In this paper, we propose an alternative structured approach for compositional generation using diffusion models. An image is generated by composing a set of diffusion models, with each of them modeling a certain component of the image. To do this, we interpret diffusion models as energy-based models in which the data distributions defined by the energy functions may be explicitly combined. The proposed method can generate scenes at test time that are substantially more complex than those seen in training, composing sentence descriptions, object relations, human facial attributes, and even generalizing to new combinations that are rarely seen in the real world. We further illustrate how our approach may be used to compose pre-trained text-guided diffusion models and generate photorealistic images containing all the details described in the input descriptions, including the binding of certain object attributes that have been shown difficult for DALLE-2. These results point to the effectiveness of the proposed method in promoting structured generalization for visual generation. Project page: https://energy-based-model.github.io/Compositional-Visual-Generation-with-Composable-Diffusion-Models/
    Generative Pretraining for Black-Box Optimization. (arXiv:2206.10786v2 [cs.LG] UPDATED)
    Many problems in science and engineering involve optimizing an expensive black-box function over a high-dimensional space. For such black-box optimization (BBO) problems, we typically assume a small budget for online function evaluations, but also often have access to a fixed, offline dataset for pretraining. Prior approaches seek to utilize the offline data to approximate the function or its inverse but are not sufficiently accurate far from the data distribution. We propose BONET, a generative framework for pretraining a novel black-box optimizer using offline datasets. In BONET, we train an autoregressive model on fixed-length trajectories derived from an offline dataset. We design a sampling strategy to synthesize trajectories from offline data using a simple heuristic of rolling out monotonic transitions from low-fidelity to high-fidelity samples. Empirically, we instantiate BONET using a causally masked Transformer and evaluate it on Design-Bench, where we rank the best on average, outperforming state-of-the-art baselines.
    COIN++: Neural Compression Across Modalities. (arXiv:2201.12904v3 [cs.LG] UPDATED)
    Neural compression algorithms are typically based on autoencoders that require specialized encoder and decoder architectures for different data modalities. In this paper, we propose COIN++, a neural compression framework that seamlessly handles a wide range of data modalities. Our approach is based on converting data to implicit neural representations, i.e. neural functions that map coordinates (such as pixel locations) to features (such as RGB values). Then, instead of storing the weights of the implicit neural representation directly, we store modulations applied to a meta-learned base network as a compressed code for the data. We further quantize and entropy code these modulations, leading to large compression gains while reducing encoding time by two orders of magnitude compared to baselines. We empirically demonstrate the feasibility of our method by compressing various data modalities, from images and audio to medical and climate data.
    Leveraging Unlabeled Data to Track Memorization. (arXiv:2212.04461v1 [cs.LG])
    Deep neural networks may easily memorize noisy labels present in real-world data, which degrades their ability to generalize. It is therefore important to track and evaluate the robustness of models against noisy label memorization. We propose a metric, called susceptibility, to gauge such memorization for neural networks. Susceptibility is simple and easy to compute during training. Moreover, it does not require access to ground-truth labels and it only uses unlabeled data. We empirically show the effectiveness of our metric in tracking memorization on various architectures and datasets and provide theoretical insights into the design of the susceptibility metric. Finally, we show through extensive experiments on datasets with synthetic and real-world label noise that one can utilize susceptibility and the overall training accuracy to distinguish models that maintain a low memorization on the training set and generalize well to unseen clean data.
    BiOcularGAN: Bimodal Synthesis and Annotation of Ocular Images. (arXiv:2205.01536v3 [cs.CV] UPDATED)
    Current state-of-the-art segmentation techniques for ocular images are critically dependent on large-scale annotated datasets, which are labor-intensive to gather and often raise privacy concerns. In this paper, we present a novel framework, called BiOcularGAN, capable of generating synthetic large-scale datasets of photorealistic (visible light and near-infrared) ocular images, together with corresponding segmentation labels to address these issues. At its core, the framework relies on a novel Dual-Branch StyleGAN2 (DB-StyleGAN2) model that facilitates bimodal image generation, and a Semantic Mask Generator (SMG) component that produces semantic annotations by exploiting latent features of the DB-StyleGAN2 model. We evaluate BiOcularGAN through extensive experiments across five diverse ocular datasets and analyze the effects of bimodal data generation on image quality and the produced annotations. Our experimental results show that BiOcularGAN is able to produce high-quality matching bimodal images and annotations (with minimal manual intervention) that can be used to train highly competitive (deep) segmentation models (in a privacy aware-manner) that perform well across multiple real-world datasets. The source code for the BiOcularGAN framework is publicly available at https://github.com/dariant/BiOcularGAN.
    Optimistic Whittle Index Policy: Online Learning for Restless Bandits. (arXiv:2205.15372v2 [cs.LG] UPDATED)
    Restless multi-armed bandits (RMABs) extend multi-armed bandits to allow for stateful arms, where the state of each arm evolves restlessly with different transitions depending on whether that arm is pulled. Solving RMABs requires information on transition dynamics, which are often unknown upfront. To plan in RMAB settings with unknown transitions, we propose the first online learning algorithm based on the Whittle index policy, using an upper confidence bound (UCB) approach to learn transition dynamics. Specifically, we estimate confidence bounds of the transition probabilities and formulate a bilinear program to compute optimistic Whittle indices using these estimates. Our algorithm, UCWhittle, achieves sublinear $O(H \sqrt{T \log T})$ frequentist regret to solve RMABs with unknown transitions in $T$ episodes with a constant horizon $H$. Empirically, we demonstrate that UCWhittle leverages the structure of RMABs and the Whittle index policy solution to achieve better performance than existing online learning baselines across three domains, including one constructed via sampling from a real-world maternal and childcare dataset.
    Predicting dominant hand from spatiotemporal context varying physiological data. (arXiv:2212.04077v1 [cs.LG])
    Health metrics from wrist-worn devices demand an automatic dominant hand prediction to keep an accurate operation. The prediction would improve reliability, enhance the consumer experience, and encourage further development of healthcare applications. This paper aims to evaluate the use of physiological and spatiotemporal context information from a two-hand experiment to predict the wrist placement of a commercial smartwatch. The main contribution is a methodology to obtain an effective model and features from low sample rate physiological sensors and a self-reported context survey. Results show an effective dominant hand prediction using data from a single subject under real-life conditions.
    Inexact bilevel stochastic gradient methods for constrained and unconstrained lower-level problems. (arXiv:2110.00604v2 [math.OC] UPDATED)
    Two-level stochastic optimization formulations have become instrumental in a number of machine learning contexts such as continual learning, neural architecture search, adversarial learning, and hyperparameter tuning. Practical stochastic bilevel optimization problems become challenging in optimization or learning scenarios where the number of variables is high or there are constraints. In this paper, we introduce a bilevel stochastic gradient method for bilevel problems with lower-level constraints. We also present a comprehensive convergence theory that covers all inexact calculations of the adjoint gradient (also called hypergradient) and addresses both the lower-level unconstrained and constrained cases. To promote the use of bilevel optimization in large-scale learning, we introduce a practical bilevel stochastic gradient method (BSG-1) that does not require second-order derivatives and, in the lower-level unconstrained case, dismisses any system solves and matrix-vector products.
    Weisfeiler and Leman go Machine Learning: The Story so far. (arXiv:2112.09992v2 [cs.LG] UPDATED)
    In recent years, algorithms and neural architectures based on the Weisfeiler-Leman algorithm, a well-known heuristic for the graph isomorphism problem, have emerged as a powerful tool for machine learning with graphs and relational data. Here, we give a comprehensive overview of the algorithm's use in a machine-learning setting, focusing on the supervised regime. We discuss the theoretical background, show how to use it for supervised graph and node representation learning, discuss recent extensions, and outline the algorithm's connection to (permutation-)equivariant neural architectures. Moreover, we give an overview of current applications and future directions to stimulate further research.
    Nonstationary Bandit Learning via Predictive Sampling. (arXiv:2205.01970v4 [cs.LG] UPDATED)
    Thompson sampling has proven effective across a wide range of stationary bandit environments. However, as we demonstrate in this paper, it can perform poorly when applied to nonstationary environments. We show that such failures are attributed to the fact that, when exploring, the algorithm does not differentiate actions based on how quickly the information acquired loses its usefulness due to nonstationarity. Building upon this insight, we propose predictive sampling, which extends Thompson sampling to do this. We establish a Bayesian regret bound and establish that, in nonstationary bandit environments, the regret incurred by Thompson sampling can far exceed that of predictive sampling. We also present implementations of predictive sampling that scale to complex bandit environments of practical interest in a computationally tractable manner. Through simulations, we demonstrate that predictive sampling outperforms Thompson sampling and other state-of-the-art algorithms across a wide range of nonstationary bandit environments.
    Diffusion Probabilistic Modeling for Video Generation. (arXiv:2203.09481v5 [cs.CV] UPDATED)
    Denoising diffusion probabilistic models are a promising new class of generative models that mark a milestone in high-quality image generation. This paper showcases their ability to sequentially generate video, surpassing prior methods in perceptual and probabilistic forecasting metrics. We propose an autoregressive, end-to-end optimized video diffusion model inspired by recent advances in neural video compression. The model successively generates future frames by correcting a deterministic next-frame prediction using a stochastic residual generated by an inverse diffusion process. We compare this approach against five baselines on four datasets involving natural and simulation-based videos. We find significant improvements in terms of perceptual quality for all datasets. Furthermore, by introducing a scalable version of the Continuous Ranked Probability Score (CRPS) applicable to video, we show that our model also outperforms existing approaches in their probabilistic frame forecasting ability.
    Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning. (arXiv:2105.03654v3 [cs.CL] UPDATED)
    Recent advances in Named Entity Recognition (NER) show that document-level contexts can significantly improve model performance. In many application scenarios, however, such contexts are not available. In this paper, we propose to find external contexts of a sentence by retrieving and selecting a set of semantically relevant texts through a search engine, with the original sentence as the query. We find empirically that the contextual representations computed on the retrieval-based input view, constructed through the concatenation of a sentence and its external contexts, can achieve significantly improved performance compared to the original input view based only on the sentence. Furthermore, we can improve the model performance of both input views by Cooperative Learning, a training method that encourages the two input views to produce similar contextual representations or output label distributions. Experiments show that our approach can achieve new state-of-the-art performance on 8 NER data sets across 5 domains.
    PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers. (arXiv:2111.12710v3 [cs.CV] UPDATED)
    This paper explores a better prediction target for BERT pre-training of vision transformers. We observe that current prediction targets disagree with human perception judgment.This contradiction motivates us to learn a perceptual prediction target. We argue that perceptually similar images should stay close to each other in the prediction target space. We surprisingly find one simple yet effective idea: enforcing perceptual similarity during the dVAE training. Moreover, we adopt a self-supervised transformer model for deep feature extraction and show that it works well for calculating perceptual similarity.We demonstrate that such learned visual tokens indeed exhibit better semantic meanings, and help pre-training achieve superior transfer performance in various downstream tasks. For example, we achieve $\textbf{84.5\%}$ Top-1 accuracy on ImageNet-1K with ViT-B backbone, outperforming the competitive method BEiT by $\textbf{+1.3\%}$ under the same pre-training epochs. Our approach also gets significant improvement on object detection and segmentation on COCO and semantic segmentation on ADE20K. Equipped with a larger backbone ViT-H, we achieve the state-of-the-art ImageNet accuracy (\textbf{88.3\%}) among methods using only ImageNet-1K data.
    Task Bias in Vision-Language Models. (arXiv:2212.04412v1 [cs.CV])
    Incidental supervision from language has become a popular approach for learning generic visual representations that can be prompted to perform many recognition tasks in computer vision. We conduct an in-depth exploration of the CLIP model and show that its visual representation is often strongly biased towards solving some tasks more than others. Moreover, which task the representation will be biased towards is unpredictable, with little consistency across images. To resolve this task bias, we show how to learn a visual prompt that guides the representation towards features relevant to their task of interest. Our results show that these visual prompts can be independent of the input image and still effectively provide a conditioning mechanism to steer visual representations towards the desired task.
    Physics-constrained deep learning postprocessing of temperature and humidity. (arXiv:2212.04487v1 [physics.ao-ph])
    Weather forecasting centers currently rely on statistical postprocessing methods to minimize forecast error. This improves skill but can lead to predictions that violate physical principles or disregard dependencies between variables, which can be problematic for downstream applications and for the trustworthiness of postprocessing models, especially when they are based on new machine learning approaches. Building on recent advances in physics-informed machine learning, we propose to achieve physical consistency in deep learning-based postprocessing models by integrating meteorological expertise in the form of analytic equations. Applied to the post-processing of surface weather in Switzerland, we find that constraining a neural network to enforce thermodynamic state equations yields physically-consistent predictions of temperature and humidity without compromising performance. Our approach is especially advantageous when data is scarce, and our findings suggest that incorporating domain expertise into postprocessing models allows to optimize weather forecast information while satisfying application-specific requirements.
    Differentiable Network Pruning for Microcontrollers. (arXiv:2110.08350v3 [cs.LG] UPDATED)
    Embedded and personal IoT devices are powered by microcontroller units (MCUs), whose extreme resource scarcity is a major obstacle for applications relying on on-device deep learning inference. Orders of magnitude less storage, memory and computational capacity, compared to what is typically required to execute neural networks, impose strict structural constraints on the network architecture and call for specialist model compression methodology. In this work, we present a differentiable structured network pruning method for convolutional neural networks, which integrates a model's MCU-specific resource usage and parameter importance feedback to obtain highly compressed yet accurate classification models. Our methodology (a) improves key resource usage of models up to 80x; (b) prunes iteratively while a model is trained, resulting in little to no overhead or even improved training time; (c) produces compressed models with matching or improved resource usage up to 1.4x in less time compared to prior MCU-specific methods. Compressed models are available for download.
    Shapley values for cluster importance: How clusters of the training data affect a prediction. (arXiv:2012.03625v2 [stat.ML] UPDATED)
    This paper proposes a novel approach to explain the predictions made by data-driven methods. Since such predictions rely heavily on the data used for training, explanations that convey information about how the training data affects the predictions are useful. The paper proposes a novel approach to quantify how different data-clusters of the training data affect a prediction. The quantification is based on Shapley values, a concept which originates from coalitional game theory, developed to fairly distribute the payout among a set of cooperating players. A player's Shapley value is a measure of that player's contribution. Shapley values are often used to quantify feature importance, ie. how features affect a prediction. This paper extends this to cluster importance, letting clusters of the training data act as players in a game where the predictions are the payouts. The novel methodology proposed in this paper lets us explore and investigate how different clusters of the training data affect the predictions made by any black-box model, allowing new aspects of the reasoning and inner workings of a prediction model to be conveyed to the users. The methodology is fundamentally different from existing explanation methods, providing insight which would not be available otherwise, and should complement existing explanation methods, including explanations based on feature importance.
    VideoDex: Learning Dexterity from Internet Videos. (arXiv:2212.04498v1 [cs.RO])
    To build general robotic agents that can operate in many environments, it is often imperative for the robot to collect experience in the real world. However, this is often not feasible due to safety, time, and hardware restrictions. We thus propose leveraging the next best thing as real-world experience: internet videos of humans using their hands. Visual priors, such as visual features, are often learned from videos, but we believe that more information from videos can be utilized as a stronger prior. We build a learning algorithm, VideoDex, that leverages visual, action, and physical priors from human video datasets to guide robot behavior. These actions and physical priors in the neural network dictate the typical human behavior for a particular robot task. We test our approach on a robot arm and dexterous hand-based system and show strong results on various manipulation tasks, outperforming various state-of-the-art methods. Videos at https://video-dex.github.io
    Analysis of Drug repurposing Knowledge graphs for Covid-19. (arXiv:2212.03911v1 [cs.AI])
    Knowledge graph (KG) is used to represent data in terms of entities and structural relations between the entities. This representation can be used to solve complex problems such as recommendation systems and question answering. In this study, a set of candidate drugs for COVID-19 are proposed by using Drug repurposing knowledge graph (DRKG). DRKG is a biological knowledge graph constructed using a vast amount of open source biomedical knowledge to understand the mechanism of compounds and the related biological functions. Node and relation embeddings are learned using knowledge graph embedding models and neural network and attention related models. Different models are used to get the node embedding by changing the objective of the model. These embeddings are later used to predict if a candidate drug is effective to treat a disease or how likely it is for a drug to bind to a protein associated to a disease which can be modelled as a link prediction task between two nodes. RESCAL performed the best on the test dataset in terms of MR, MRR and Hits@3.
    An Interpretable Model of Climate Change Using Correlative Learning. (arXiv:2212.04478v1 [physics.ao-ph])
    Determining changes in global temperature and precipitation that may indicate climate change is complicated by annual variations. One approach for finding potential climate change indicators is to train a model that predicts the year from annual means of global temperatures and precipitations. Such data is available from the CMIP6 ensemble of simulations. Here a two-hidden-layer neural network trained on this data successfully predicts the year. Differences among temperature and precipitation patterns for which the model predicts specific years reveal changes through time. To find these optimal patterns, a new way of interpreting what the neural network has learned is explored. Alopex, a stochastic correlative learning algorithm, is used to find optimal temperature and precipitation maps that best predict a given year. These maps are compared over multiple years to show how temperature and precipitations patterns indicative of each year change over time.
    Spatio-Temporal Self-Supervised Learning for Traffic Flow Prediction. (arXiv:2212.04475v1 [cs.LG])
    Robust prediction of citywide traffic flows at different time periods plays a crucial role in intelligent transportation systems. While previous work has made great efforts to model spatio-temporal correlations, existing methods still suffer from two key limitations: i) Most models collectively predict all regions' flows without accounting for spatial heterogeneity, i.e., different regions may have skewed traffic flow distributions. ii) These models fail to capture the temporal heterogeneity induced by time-varying traffic patterns, as they typically model temporal correlations with a shared parameterized space for all time periods. To tackle these challenges, we propose a novel Spatio-Temporal Self-Supervised Learning (ST-SSL) traffic prediction framework which enhances the traffic pattern representations to be reflective of both spatial and temporal heterogeneity, with auxiliary self-supervised learning paradigms. Specifically, our ST-SSL is built over an integrated module with temporal and spatial convolutions for encoding the information across space and time. To achieve the adaptive spatio-temporal self-supervised learning, our ST-SSL first performs the adaptive augmentation over the traffic flow graph data at both attribute- and structure-levels. On top of the augmented traffic graph, two SSL auxiliary tasks are constructed to supplement the main traffic prediction task with spatial and temporal heterogeneity-aware augmentation. Experiments on four benchmark datasets demonstrate that ST-SSL consistently outperforms various state-of-the-art baselines. Since spatio-temporal heterogeneity widely exists in practical datasets, the proposed framework may also cast light on other spatial-temporal applications. Model implementation is available at https://github.com/Echo-Ji/ST-SSL.
    Multi-Concept Customization of Text-to-Image Diffusion. (arXiv:2212.04488v1 [cs.CV])
    While generative models produce high-quality images of concepts learned from a large-scale database, a user often wishes to synthesize instantiations of their own concepts (for example, their family, pets, or items). Can we teach a model to quickly acquire a new concept, given a few examples? Furthermore, can we compose multiple new concepts together? We propose Custom Diffusion, an efficient method for augmenting existing text-to-image models. We find that only optimizing a few parameters in the text-to-image conditioning mechanism is sufficiently powerful to represent new concepts while enabling fast tuning (~6 minutes). Additionally, we can jointly train for multiple concepts or combine multiple fine-tuned models into one via closed-form constrained optimization. Our fine-tuned model generates variations of multiple, new concepts and seamlessly composes them with existing concepts in novel settings. Our method outperforms several baselines and concurrent works, regarding both qualitative and quantitative evaluations, while being memory and computationally efficient.
    Spatio-Temporal Super-Resolution of Dynamical Systems using Physics-Informed Deep-Learning. (arXiv:2212.04457v1 [cs.LG])
    This work presents a physics-informed deep learning-based super-resolution framework to enhance the spatio-temporal resolution of the solution of time-dependent partial differential equations (PDE). Prior works on deep learning-based super-resolution models have shown promise in accelerating engineering design by reducing the computational expense of traditional numerical schemes. However, these models heavily rely on the availability of high-resolution (HR) labeled data needed during training. In this work, we propose a physics-informed deep learning-based framework to enhance the spatial and temporal resolution of coarse-scale (both in space and time) PDE solutions without requiring any HR data. The framework consists of two trainable modules independently super-resolving the PDE solution, first in spatial and then in temporal direction. The physics based losses are implemented in a novel way to ensure tight coupling between the spatio-temporally refined outputs at different times and improve framework accuracy. We analyze the capability of the developed framework by investigating its performance on an elastodynamics problem. It is observed that the proposed framework can successfully super-resolve (both in space and time) the low-resolution PDE solutions while satisfying physics-based constraints and yielding high accuracy. Furthermore, the analysis and obtained speed-up show that the proposed framework is well-suited for integration with traditional numerical methods to reduce computational complexity during engineering design.
    Three Variations on Variational Autoencoders. (arXiv:2212.04451v1 [cs.LG])
    Variational autoencoders (VAEs) are one class of generative probabilistic latent-variable models designed for inference based on known data. We develop three variations on VAEs by introducing a second parameterized encoder/decoder pair and, for one variation, an additional fixed encoder. The parameters of the encoders/decoders are to be learned with a neural network. The fixed encoder is obtained by probabilistic-PCA. The variations are compared to the Evidence Lower Bound (ELBO) approximation to the original VAE. One variation leads to an Evidence Upper Bound (EUBO) that can be used in conjunction with the original ELBO to interrogate the convergence of the VAE.
    A Distributed Block Chebyshev-Davidson Algorithm for Parallel Spectral Clustering. (arXiv:2212.04443v1 [cs.LG])
    We develop a distributed Block Chebyshev-Davidson algorithm to solve large-scale leading eigenvalue problems for spectral analysis in spectral clustering. First, the efficiency of the Chebyshev-Davidson algorithm relies on the prior knowledge of the eigenvalue spectrum, which could be expensive to estimate. This issue can be lessened by the analytic spectrum estimation of the Laplacian or normalized Laplacian matrices in spectral clustering, making the proposed algorithm very efficient for spectral clustering. Second, to make the proposed algorithm capable of analyzing big data, a distributed and parallel version has been developed with attractive scalability. The speedup by parallel computing is approximately equivalent to $\sqrt{p}$, where $p$ denotes the number of processes. Numerical results will be provided to demonstrate its efficiency and advantage over existing algorithms in both sequential and parallel computing.
    Improved Deep Neural Network Generalization Using m-Sharpness-Aware Minimization. (arXiv:2212.04343v1 [cs.LG])
    Modern deep learning models are over-parameterized, where the optimization setup strongly affects the generalization performance. A key element of reliable optimization for these systems is the modification of the loss function. Sharpness-Aware Minimization (SAM) modifies the underlying loss function to guide descent methods towards flatter minima, which arguably have better generalization abilities. In this paper, we focus on a variant of SAM known as mSAM, which, during training, averages the updates generated by adversarial perturbations across several disjoint shards of a mini-batch. Recent work suggests that mSAM can outperform SAM in terms of test accuracy. However, a comprehensive empirical study of mSAM is missing from the literature -- previous results have mostly been limited to specific architectures and datasets. To that end, this paper presents a thorough empirical evaluation of mSAM on various tasks and datasets. We provide a flexible implementation of mSAM and compare the generalization performance of mSAM to the performance of SAM and vanilla training on different image classification and natural language processing tasks. We also conduct careful experiments to understand the computational cost of training with mSAM, its sensitivity to hyperparameters and its correlation with the flatness of the loss landscape. Our analysis reveals that mSAM yields superior generalization performance and flatter minima, compared to SAM, across a wide range of tasks without significantly increasing computational costs.
    Structure of Classifier Boundaries: Case Study for a Naive Bayes Classifier. (arXiv:2212.04382v1 [stat.ML])
    Whether based on models, training data or a combination, classifiers place (possibly complex) input data into one of a relatively small number of output categories. In this paper, we study the structure of the boundary--those points for which a neighbor is classified differently--in the context of an input space that is a graph, so that there is a concept of neighboring inputs, The scientific setting is a model-based naive Bayes classifier for DNA reads produced by Next Generation Sequencers. We show that the boundary is both large and complicated in structure. We create a new measure of uncertainty, called Neighbor Similarity, that compares the result for a point to the distribution of results for its neighbors. This measure not only tracks two inherent uncertainty measures for the Bayes classifier, but also can be implemented, at a computational cost, for classifiers without inherent measures of uncertainty.
    Position-Aware Subgraph Neural Networks with Data-Efficient Learning. (arXiv:2211.00572v2 [cs.LG] UPDATED)
    Data-efficient learning on graphs (GEL) is essential in real-world applications. Existing GEL methods focus on learning useful representations for nodes, edges, or entire graphs with ``small'' labeled data. But the problem of data-efficient learning for subgraph prediction has not been explored. The challenges of this problem lie in the following aspects: 1) It is crucial for subgraphs to learn positional features to acquire structural information in the base graph in which they exist. Although the existing subgraph neural network method is capable of learning disentangled position encodings, the overall computational complexity is very high. 2) Prevailing graph augmentation methods for GEL, including rule-based, sample-based, adaptive, and automated methods, are not suitable for augmenting subgraphs because a subgraph contains fewer nodes but richer information such as position, neighbor, and structure. Subgraph augmentation is more susceptible to undesirable perturbations. 3) Only a small number of nodes in the base graph are contained in subgraphs, which leads to a potential ``bias'' problem that the subgraph representation learning is dominated by these ``hot'' nodes. By contrast, the remaining nodes fail to be fully learned, which reduces the generalization ability of subgraph representation learning. In this paper, we aim to address the challenges above and propose a Position-Aware Data-Efficient Learning framework for subgraph neural networks called PADEL. Specifically, we propose a novel node position encoding method that is anchor-free, and design a new generative subgraph augmentation method based on a diffused variational subgraph autoencoder, and we propose exploratory and exploitable views for subgraph contrastive learning. Extensive experiment results on three real-world datasets show the superiority of our proposed method over state-of-the-art baselines.
    Fuzzy Rough Sets Based on Fuzzy Quantification. (arXiv:2212.04327v1 [cs.AI])
    One of the weaknesses of classical (fuzzy) rough sets is their sensitivity to noise, which is particularly undesirable for machine learning applications. One approach to solve this issue is by making use of fuzzy quantifiers, as done by the vaguely quantified fuzzy rough set (VQFRS) model. While this idea is intuitive, the VQFRS model suffers from both theoretical flaws as well as from suboptimal performance in applications. In this paper, we improve on VQFRS by introducing fuzzy quantifier-based fuzzy rough sets (FQFRS), an intuitive generalization of fuzzy rough sets that makes use of general unary and binary quantification models. We show how several existing models fit in this generalization as well as how it inspires novel ones. Several binary quantification models are proposed to be used with FQFRS. We conduct a theoretical study of their properties, and investigate their potential by applying them to classification problems. In particular, we highlight Yager's Weighted Implication-based (YWI) binary quantification model, which induces a fuzzy rough set model that is both a significant improvement on VQFRS, as well as a worthy competitor to the popular ordered weighted averaging based fuzzy rough set (OWAFRS) model.
    Robust Speech Recognition via Large-Scale Weak Supervision. (arXiv:2212.04356v1 [eess.AS])
    We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.
    Designing with Non-Finite Output Dimension via Fourier Coefficients of Neural Waveforms. (arXiv:2212.04351v1 [cs.LG])
    Ordinary Deep Learning models require having the dimension of their outputs determined by a human practitioner prior to training and operation. For design tasks, this places a hard limit on the maximum complexity of any designs produced by a neural network, which is disadvantageous if a greater allowance for complexity would result in better designs. In this paper, we introduce a methodology for taking outputs of non-finite dimension from neural networks, by learning a "neural waveform," and then taking as outputs the coefficients of its Fourier series representation. We then present experimental evidence that neural networks can learn in this setting on a toy problem.
    Power Consumption Modeling of 5G Multi-Carrier Base Stations: A Machine Learning Approach. (arXiv:2212.04318v1 [cs.NI])
    The fifth generation of the Radio Access Network (RAN) has brought new services, technologies, and paradigms with the corresponding societal benefits. However, the energy consumption of 5G networks is today a concern. In recent years, the design of new methods for decreasing the RAN power consumption has attracted interest from both the research community and standardization bodies, and many energy savings solutions have been proposed. However, there is still a need to understand the power consumption behavior of state-ofthe-art base station architectures, such as multi-carrier active antenna units (AAUs), as well as the impact of different network parameters. In this paper, we present a power consumption model for 5G AAUs based on artificial neural networks. We demonstrate that this model achieves good estimation performance, and it is able to capture the benefits of energy saving when dealing with the complexity of multi-carrier base stations architectures. Importantly, multiple experiments are carried out to show the advantage of designing a general model able to capture the power consumption behaviors of different types of AAUs. Finally, we provide an analysis of the model scalability and the training data requirements.
    Encrypted machine learning of molecular quantum properties. (arXiv:2212.04322v1 [cs.CR])
    Large machine learning models with improved predictions have become widely available in the chemical sciences. Unfortunately, these models do not protect the privacy necessary within commercial settings, prohibiting the use of potentially extremely valuable data by others. Encrypting the prediction process can solve this problem by double-blind model evaluation and prohibits the extraction of training or query data. However, contemporary ML models based on fully homomorphic encryption or federated learning are either too expensive for practical use or have to trade higher speed for weaker security. We have implemented secure and computationally feasible encrypted machine learning models using oblivious transfer enabling and secure predictions of molecular quantum properties across chemical compound space. However, we find that encrypted predictions using kernel ridge regression models are a million times more expensive than without encryption. This demonstrates a dire need for a compact machine learning model architecture, including molecular representation and kernel matrix size, that minimizes model evaluation costs.
    Multi-Task Option Learning and Discovery for Stochastic Path Planning. (arXiv:2210.00068v2 [cs.LG] UPDATED)
    This paper addresses the problem of reliably and efficiently solving broad classes of long-horizon stochastic path planning problems. Starting with a vanilla RL formulation with a stochastic dynamics simulator and an occupancy matrix of the environment, our approach computes useful options with policies as well as high-level paths that compose the discovered options. Our main contributions are (1) data-driven methods for creating abstract states that serve as endpoints for helpful options, (2) methods for computing option policies using auto-generated option guides in the form of dense pseudo-reward functions, and (3) an overarching algorithm for composing the computed options. We show that this approach yields strong guarantees of executability and solvability: under fairly general conditions, the computed option guides lead to composable option policies and consequently ensure downward refinability. Empirical evaluation on a range of robots, environments, and tasks shows that this approach effectively transfers knowledge across related tasks and that it outperforms existing approaches by a significant margin.
    ConsistTL: Modeling Consistency in Transfer Learning for Low-Resource Neural Machine Translation. (arXiv:2212.04262v1 [cs.CL])
    Transfer learning is a simple and powerful method that can be used to boost model performance of low-resource neural machine translation (NMT). Existing transfer learning methods for NMT are static, which simply transfer knowledge from a parent model to a child model once via parameter initialization. In this paper, we propose a novel transfer learning method for NMT, namely ConsistTL, which can continuously transfer knowledge from the parent model during the training of the child model. Specifically, for each training instance of the child model, ConsistTL constructs the semantically-equivalent instance for the parent model and encourages prediction consistency between the parent and child for this instance, which is equivalent to the child model learning each instance under the guidance of the parent model. Experimental results on five low-resource NMT tasks demonstrate that ConsistTL results in significant improvements over strong transfer learning baselines, with a gain up to 1.7 BLEU over the existing back-translation model on the widely-used WMT17 Turkish-English benchmark. Further analysis reveals that ConsistTL can improve the inference calibration of the child model. Code and scripts are freely available at https://github.com/NLP2CT/ConsistTL.
    Deep Variational Inverse Scattering. (arXiv:2212.04309v1 [cs.LG])
    Inverse medium scattering solvers generally reconstruct a single solution without an associated measure of uncertainty. This is true both for the classical iterative solvers and for the emerging deep learning methods. But ill-posedness and noise can make this single estimate inaccurate or misleading. While deep networks such as conditional normalizing flows can be used to sample posteriors in inverse problems, they often yield low-quality samples and uncertainty estimates. In this paper, we propose U-Flow, a Bayesian U-Net based on conditional normalizing flows, which generates high-quality posterior samples and estimates physically-meaningful uncertainty. We show that the proposed model significantly outperforms the recent normalizing flows in terms of posterior sample quality while having comparable performance with the U-Net in point estimation.
    Device identification using optimized digital footprints. (arXiv:2212.04354v1 [cs.CR])
    The rapidly increasing number of internet of things (IoT) and non-IoT devices has imposed new security challenges to network administrators. Accurate device identification in the increasingly complex network structures is necessary. In this paper, a device fingerprinting (DFP) method has been proposed for device identification, based on digital footprints, which devices use for communication over a network. A subset of nine features have been selected from the network and transport layers of a single transmission control protocol/internet protocol packet based on attribute evaluators in Weka, to generate device-specific signatures. The method has been evaluated on two online datasets, and an experimental dataset, using different supervised machine learning (ML) algorithms. Results have shown that the method is able to distinguish device type with up to 100% precision using the random forest (RF) classifier, and classify individual devices with up to 95.7% precision. These results demonstrate the applicability of the proposed DFP method for device identification, in order to provide a more secure and robust network.
    ChromaCorrect: Prescription Correction in Virtual Reality Headsets through Perceptual Guidance. (arXiv:2212.04264v1 [cs.HC])
    A large portion of today's world population suffer from vision impairments and wear prescription eyeglasses. However, eyeglasses causes additional bulk and discomfort when used with augmented and virtual reality headsets, thereby negatively impacting the viewer's visual experience. In this work, we remedy the usage of prescription eyeglasses in Virtual Reality (VR) headsets by shifting the optical complexity completely into software and propose a prescription-aware rendering approach for providing sharper and immersive VR imagery. To this end, we develop a differentiable display and visual perception model encapsulating display-specific parameters, color and visual acuity of human visual system and the user-specific refractive errors. Using this differentiable visual perception model, we optimize the rendered imagery in the display using stochastic gradient-descent solvers. This way, we provide prescription glasses-free sharper images for a person with vision impairments. We evaluate our approach on various displays, including desktops and VR headsets, and show significant quality and contrast improvements for users with vision impairments.
    A 65nm 8b-Activation 8b-Weight SRAM-Based Charge-Domain Computing-in-Memory Macro Using A Fully-Parallel Analog Adder Network and A Single-ADC Interface. (arXiv:2212.04320v1 [cs.AR])
    Performing data-intensive tasks in the von Neumann architecture is challenging to achieve both high performance and power efficiency due to the memory wall bottleneck. Computing-in-memory (CiM) is a promising mitigation approach by enabling parallel in-situ multiply-accumulate (MAC) operations within the memory with support from the peripheral interface and datapath. SRAM-based charge-domain CiM (CD-CiM) has shown its potential of enhanced power efficiency and computing accuracy. However, existing SRAM-based CD-CiM faces scaling challenges to meet the throughput requirement of high-performance multi-bit-quantization applications. This paper presents an SRAM-based high-throughput ReLU-optimized CD-CiM macro. It is capable of completing MAC and ReLU of two signed 8b vectors in one CiM cycle with only one A/D conversion. Along with non-linearity compensation for the analog computing and A/D conversion interfaces, this work achieves 51.2GOPS throughput and 10.3TOPS/W energy efficiency, while showing 88.6% accuracy in the CIFAR-10 dataset.
    A Modality-level Explainable Framework for Misinformation Checking in Social Networks. (arXiv:2212.04272v1 [cs.LG])
    The widespread of false information is a rising concern worldwide with critical social impact, inspiring the emergence of fact-checking organizations to mitigate misinformation dissemination. However, human-driven verification leads to a time-consuming task and a bottleneck to have checked trustworthy information at the same pace they emerge. Since misinformation relates not only to the content itself but also to other social features, this paper addresses automatic misinformation checking in social networks from a multimodal perspective. Moreover, as simply naming a piece of news as incorrect may not convince the citizen and, even worse, strengthen confirmation bias, the proposal is a modality-level explainable-prone misinformation classifier framework. Our framework comprises a misinformation classifier assisted by explainable methods to generate modality-oriented explainable inferences. Preliminary findings show that the misinformation classifier does benefit from multimodal information encoding and the modality-oriented explainable mechanism increases both inferences' interpretability and completeness.
    Counterfactuals for the Future. (arXiv:2212.03974v1 [cs.AI])
    Counterfactuals are often described as 'retrospective,' focusing on hypothetical alternatives to a realized past. This description relates to an often implicit assumption about the structure and stability of exogenous variables in the system being modeled -- an assumption that is reasonable in many settings where counterfactuals are used. In this work, we consider cases where we might reasonably make a different assumption about exogenous variables, namely, that the exogenous noise terms of each unit do exhibit some unit-specific structure and/or stability. This leads us to a different use of counterfactuals -- a 'forward-looking' rather than 'retrospective' counterfactual. We introduce "counterfactual treatment choice," a type of treatment choice problem that motivates using forward-looking counterfactuals. We then explore how mismatches between interventional versus forward-looking counterfactual approaches to treatment choice, consistent with different assumptions about exogenous noise, can lead to counterintuitive results.
    Vicious Classifiers: Data Reconstruction Attack at Inference Time. (arXiv:2212.04223v1 [cs.LG])
    Privacy-preserving inference via edge or encrypted computing paradigms encourages users of machine learning services to confidentially run a model on their personal data for a target task and only share the model's outputs with the service provider; e.g., to activate further services. Nevertheless, despite all confidentiality efforts, we show that a ''vicious'' service provider can approximately reconstruct its users' personal data by observing only the model's outputs, while keeping the target utility of the model very close to that of a ''honest'' service provider. We show the possibility of jointly training a target model (to be run at users' side) and an attack model for data reconstruction (to be secretly used at server's side). We introduce the ''reconstruction risk'': a new measure for assessing the quality of reconstructed data that better captures the privacy risk of such attacks. Experimental results on 6 benchmark datasets show that for low-complexity data types, or for tasks with larger number of classes, a user's personal data can be approximately reconstructed from the outputs of a single target inference task. We propose a potential defense mechanism that helps to distinguish vicious vs. honest classifiers at inference time. We conclude this paper by discussing current challenges and open directions for future studies. We open-source our code and results, as a benchmark for future work.
    Self-training via Metric Learning for Source-Free Domain Adaptation of Semantic Segmentation. (arXiv:2212.04227v1 [cs.CV])
    Unsupervised source-free domain adaptation methods aim to train a model to be used in the target domain utilizing the pretrained source-domain model and unlabeled target-domain data, where the source data may not be accessible due to intellectual property or privacy issues. These methods frequently utilize self-training with pseudo-labeling thresholded by prediction confidence. In a source-free scenario, only supervision comes from target data, and thresholding limits the contribution of the self-training. In this study, we utilize self-training with a mean-teacher approach. The student network is trained with all predictions of the teacher network. Instead of thresholding the predictions, the gradients calculated from the pseudo-labels are weighted based on the reliability of the teacher's predictions. We propose a novel method that uses proxy-based metric learning to estimate reliability. We train a metric network on the encoder features of the teacher network. Since the teacher is updated with the moving average, the encoder feature space is slowly changing. Therefore, the metric network can be updated in training time, which enables end-to-end training. We also propose a metric-based online ClassMix method to augment the input of the student network where the patches to be mixed are decided based on the metric reliability. We evaluated our method in synthetic-to-real and cross-city scenarios. The benchmarks show that our method significantly outperforms the existing state-of-the-art methods.
    A probabilistic autoencoder for causal discovery. (arXiv:2212.04235v1 [stat.ML])
    The paper addresses the problem of finding the causal direction between two associated variables. The proposed solution is to build an autoencoder of their joint distribution and to maximize its estimation capacity relative to both the marginal distributions. It is shown that the resulting two capacities cannot, in general, be equal. This leads to a new criterion for causal discovery: the higher capacity is consistent with the unconstrained choice of a distribution representing the cause while the lower capacity reflects the constraints imposed by the mechanism on the distribution of the effect. Estimation capacity is defined as the ability of the auto-encoder to represent arbitrary datasets. A regularization term forces it to decide which one of the variables to model in a more generic way i.e., while maintaining higher model capacity. The causal direction is revealed by the constraints encountered while encoding the data instead of being measured as a property of the data itself. The idea is implemented and tested using a restricted Boltzmann machine.
    Customizing Number Representation and Precision. (arXiv:2212.04184v1 [cs.AR])
    There is a growing interest in the use of reduced-precision arithmetic, exacerbated by the recent interest in artificial intelligence, especially with deep learning. Most architectures already provide reduced-precision capabilities (e.g., 8-bit integer, 16-bit floating point). In the context of FPGAs, any number format and bit-width can even be considered.In computer arithmetic, the representation of real numbers is a major issue. Fixed-point (FxP) and floating-point (FlP) are the main options to represent reals, both with their advantages and drawbacks. This chapter presents both FxP and FlP number representations, and draws a fair a comparison between their cost, performance and energy, as well as their impact on accuracy during computations.It is shown that the choice between FxP and FlP is not obvious and strongly depends on the application considered. In some cases, low-precision floating-point arithmetic can be the most effective and provides some benefits over the classical fixed-point choice for energy-constrained applications.
    Structure-Preserving Graph Representation Learning. (arXiv:2209.00793v2 [cs.LG] UPDATED)
    Though graph representation learning (GRL) has made significant progress, it is still a challenge to extract and embed the rich topological structure and feature information in an adequate way. Most existing methods focus on local structure and fail to fully incorporate the global topological structure. To this end, we propose a novel Structure-Preserving Graph Representation Learning (SPGRL) method, to fully capture the structure information of graphs. Specifically, to reduce the uncertainty and misinformation of the original graph, we construct a feature graph as a complementary view via k-Nearest Neighbor method. The feature graph can be used to contrast at node-level to capture the local relation. Besides, we retain the global topological structure information by maximizing the mutual information (MI) of the whole graph and feature embeddings, which is theoretically reduced to exchanging the feature embeddings of the feature and the original graphs to reconstruct themselves. Extensive experiments show that our method has quite superior performance on semi-supervised node classification task and excellent robustness under noise perturbation on graph structure or node features.
    GreenEyes: An Air Quality Evaluating Model based on WaveNet. (arXiv:2212.04175v1 [cs.LG])
    Accompanying rapid industrialization, humans are suffering from serious air pollution problems. The demand for air quality prediction is becoming more and more important to the government's policy-making and people's daily life. In this paper, We propose GreenEyes -- a deep neural network model, which consists of a WaveNet-based backbone block for learning representations of sequences and an LSTM with a Temporal Attention module for capturing the hidden interactions between features of multi-channel inputs. To evaluate the effectiveness of our proposed method, we carry out several experiments including an ablation study on our collected and preprocessed air quality data near HKUST. The experimental results show our model can effectively predict the air quality level of the next timestamp given any segment of the air quality data from the data set. We have also released our standalone dataset at https://github.com/AI-Huang/IAQI_Dataset The model and code for this paper are publicly available at https://github.com/AI-Huang/AirEvaluation
    Momentum Calibration for Text Generation. (arXiv:2212.04257v1 [cs.CL])
    The input and output of most text generation tasks can be transformed to two sequences of tokens and they can be modeled using sequence-to-sequence learning modeling tools such as Transformers. These models are usually trained by maximizing the likelihood the output text sequence and assumes the input sequence and all gold preceding tokens are given during training, while during inference the model suffers from the exposure bias problem (i.e., it only has access to its previously predicted tokens rather gold tokens during beam search). In this paper, we propose MoCa ({\bf Mo}mentum {\bf Ca}libration) for text generation. MoCa is an online method that dynamically generates slowly evolving (but consistent) samples using a momentum moving average generator with beam search and MoCa learns to align its model scores of these samples with their actual qualities. Experiments on four text generation datasets (i.e., CNN/DailyMail, XSum, SAMSum and Gigaword) show MoCa consistently improves strong pre-trained transformers using vanilla fine-tuning and we achieve the state-of-the-art results on CNN/DailyMail and SAMSum datasets.
    A Novel Hierarchical-Classification-Block Based Convolutional Neural Network for Source Camera Model Identification. (arXiv:2212.04161v1 [cs.CV])
    Digital security has been an active area of research interest due to the rapid adaptation of internet infrastructure, the increasing popularity of social media, and digital cameras. Due to inherent differences in working principles to generate an image, different camera brands left behind different intrinsic processing noises which can be used to identify the camera brand. In the last decade, many signal processing and deep learning-based methods have been proposed to identify and isolate this noise from the scene details in an image to detect the source camera brand. One prominent solution is to utilize a hierarchical classification system rather than the traditional single-classifier approach. Different individual networks are used for brand-level and model-level source camera identification. This approach allows for better scaling and requires minimal modifications for adding a new camera brand/model to the solution. However, using different full-fledged networks for both brand and model-level classification substantially increases memory consumption and training complexity. Moreover, extracted low-level features from the different network's initial layers often coincide, resulting in redundant weights. To mitigate the training and memory complexity, we propose a classifier-block-level hierarchical system instead of a network-level one for source camera model classification. Our proposed approach not only results in significantly fewer parameters but also retains the capability to add a new camera model with minimal modification. Thorough experimentation on the publicly available Dresden dataset shows that our proposed approach can achieve the same level of state-of-the-art performance but requires fewer parameters compared to a state-of-the-art network-level hierarchical-based system.
    A parallelizable model-based approach for marginal and multivariate clustering. (arXiv:2212.04009v1 [stat.ML])
    This paper develops a clustering method that takes advantage of the sturdiness of model-based clustering, while attempting to mitigate some of its pitfalls. First, we note that standard model-based clustering likely leads to the same number of clusters per margin, which seems a rather artificial assumption for a variety of datasets. We tackle this issue by specifying a finite mixture model per margin that allows each margin to have a different number of clusters, and then cluster the multivariate data using a strategy game-inspired algorithm to which we call Reign-and-Conquer. Second, since the proposed clustering approach only specifies a model for the margins -- but leaves the joint unspecified -- it has the advantage of being partially parallelizable; hence, the proposed approach is computationally appealing as well as more tractable for moderate to high dimensions than a `full' (joint) model-based clustering approach. A battery of numerical experiments on artificial data indicate an overall good performance of the proposed methods in a variety of scenarios, and real datasets are used to showcase their application in practice.
    Out-of-Distribution Detection with Deep Nearest Neighbors. (arXiv:2204.06507v3 [cs.LG] UPDATED)
    Out-of-distribution (OOD) detection is a critical task for deploying machine learning models in the open world. Distance-based methods have demonstrated promise, where testing samples are detected as OOD if they are relatively far away from in-distribution (ID) data. However, prior methods impose a strong distributional assumption of the underlying feature space, which may not always hold. In this paper, we explore the efficacy of non-parametric nearest-neighbor distance for OOD detection, which has been largely overlooked in the literature. Unlike prior works, our method does not impose any distributional assumption, hence providing stronger flexibility and generality. We demonstrate the effectiveness of nearest-neighbor-based OOD detection on several benchmarks and establish superior performance. Under the same model trained on ImageNet-1k, our method substantially reduces the false positive rate (FPR@TPR95) by 24.77% compared to a strong baseline SSD+, which uses a parametric approach Mahalanobis distance in detection. Code is available: https://github.com/deeplearning-wisc/knn-ood.
    Explainable Machine Learning for Breakdown Prediction in High Gradient RF Cavities. (arXiv:2202.05610v2 [physics.acc-ph] UPDATED)
    The occurrence of vacuum arcs or radio frequency (rf) breakdowns is one of the most prevalent factors limiting the high-gradient performance of normal conducting rf cavities in particle accelerators. In this paper, we search for the existence of previously unrecognized features related to the incidence of rf breakdowns by applying a machine learning strategy to high-gradient cavity data from CERN's test stand for the Compact Linear Collider (CLIC). By interpreting the parameters of the learned models with explainable artificial intelligence (AI), we reverse-engineer physical properties for deriving fast, reliable, and simple rule-based models. Based on 6 months of historical data and dedicated experiments, our models show fractions of data with a high influence on the occurrence of breakdowns. Specifically, it is shown that the field emitted current following an initial breakdown is closely related to the probability of another breakdown occurring shortly thereafter. Results also indicate that the cavity pressure should be monitored with increased temporal resolution in future experiments, to further explore the vacuum activity associated with breakdowns.
    DP-RAFT: A Differentially Private Recipe for Accelerated Fine-Tuning. (arXiv:2212.04486v1 [cs.LG])
    A major direction in differentially private machine learning is differentially private fine-tuning: pretraining a model on a source of "public data" and transferring the extracted features to downstream tasks. This is an important setting because many industry deployments fine-tune publicly available feature extractors on proprietary data for downstream tasks. In this paper, we use features extracted from state-of-the-art open source models to solve benchmark tasks in computer vision and natural language processing using differentially private fine-tuning. Our key insight is that by accelerating training, we can quickly drive the model parameters to regions in parameter space where the impact of noise is minimized. In doing so, we recover the same performance as non-private fine-tuning for realistic values of epsilon in [0.01, 1.0] on benchmark image classification datasets including CIFAR100.
    DeeProb-kit: a Python Library for Deep Probabilistic Modelling. (arXiv:2212.04403v1 [cs.LG])
    DeeProb-kit is a unified library written in Python consisting of a collection of deep probabilistic models (DPMs) that are tractable and exact representations for the modelled probability distributions. The availability of a representative selection of DPMs in a single library makes it possible to combine them in a straightforward manner, a common practice in deep learning research nowadays. In addition, it includes efficiently implemented learning techniques, inference routines, statistical algorithms, and provides high-quality fully-documented APIs. The development of DeeProb-kit will help the community to accelerate research on DPMs as well as to standardise their evaluation and better understand how they are related based on their expressivity.
    SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation. (arXiv:2212.04493v1 [cs.CV])
    In this work, we present a novel framework built to simplify 3D asset generation for amateur users. To enable interactive generation, our method supports a variety of input modalities that can be easily provided by a human, including images, text, partially observed shapes and combinations of these, further allowing to adjust the strength of each input. At the core of our approach is an encoder-decoder, compressing 3D shapes into a compact latent representation, upon which a diffusion model is learned. To enable a variety of multi-modal inputs, we employ task-specific encoders with dropout followed by a cross-attention mechanism. Due to its flexibility, our model naturally supports a variety of tasks, outperforming prior works on shape completion, image-based 3D reconstruction, and text-to-3D. Most interestingly, our model can combine all these tasks into one swiss-army-knife tool, enabling the user to perform shape generation using incomplete shapes, images, and textual descriptions at the same time, providing the relative weights for each input and facilitating interactivity. Despite our approach being shape-only, we further show an efficient method to texture the generated shape using large-scale text-to-image models.
    Logit Clipping for Robust Learning against Label Noise. (arXiv:2212.04055v1 [cs.LG])
    In the presence of noisy labels, designing robust loss functions is critical for securing the generalization performance of deep neural networks. Cross Entropy (CE) loss has been shown to be not robust to noisy labels due to its unboundedness. To alleviate this issue, existing works typically design specialized robust losses with the symmetric condition, which usually lead to the underfitting issue. In this paper, our key idea is to induce a loss bound at the logit level, thus universally enhancing the noise robustness of existing losses. Specifically, we propose logit clipping (LogitClip), which clamps the norm of the logit vector to ensure that it is upper bounded by a constant. In this manner, CE loss equipped with our LogitClip method is effectively bounded, mitigating the overfitting to examples with noisy labels. Moreover, we present theoretical analyses to certify the noise-tolerant ability of LogitClip. Extensive experiments show that LogitClip not only significantly improves the noise robustness of CE loss, but also broadly enhances the generalization performance of popular robust losses.
    Multi-View Mesh Reconstruction with Neural Deferred Shading. (arXiv:2212.04386v1 [cs.CV])
    We propose an analysis-by-synthesis method for fast multi-view 3D reconstruction of opaque objects with arbitrary materials and illumination. State-of-the-art methods use both neural surface representations and neural rendering. While flexible, neural surface representations are a significant bottleneck in optimization runtime. Instead, we represent surfaces as triangle meshes and build a differentiable rendering pipeline around triangle rasterization and neural shading. The renderer is used in a gradient descent optimization where both a triangle mesh and a neural shader are jointly optimized to reproduce the multi-view images. We evaluate our method on a public 3D reconstruction dataset and show that it can match the reconstruction accuracy of traditional baselines and neural approaches while surpassing them in optimization runtime. Additionally, we investigate the shader and find that it learns an interpretable representation of appearance, enabling applications such as 3D material editing.
    Lattice-Free Sequence Discriminative Training for Phoneme-Based Neural Transducers. (arXiv:2212.04325v1 [eess.AS])
    Recently, RNN-Transducers have achieved remarkable results on various automatic speech recognition tasks. However, lattice-free sequence discriminative training methods, which obtain superior performance in hybrid modes, are rarely investigated in RNN-Transducers. In this work, we propose three lattice-free training objectives, namely lattice-free maximum mutual information, lattice-free segment-level minimum Bayes risk, and lattice-free minimum Bayes risk, which are used for the final posterior output of the phoneme-based neural transducer with a limited context dependency. Compared to criteria using N-best lists, lattice-free methods eliminate the decoding step for hypotheses generation during training, which leads to more efficient training. Experimental results show that lattice-free methods gain up to 6.5% relative improvement in word error rate compared to a sequence-level cross-entropy trained model. Compared to the N-best-list based minimum Bayes risk objectives, lattice-free methods gain 40% - 70% relative training time speedup with a small degradation in performance.
    Secure Over-the-Air Computation using Zero-Forced Artificial Noise. (arXiv:2212.04288v1 [cs.IT])
    Over-the-air computation has the potential to increase the communication-efficiency of data-dependent distributed wireless systems, but is vulnerable to eavesdropping. We consider over-the-air computation over block-fading additive white Gaussian noise channels in the presence of a passive eavesdropper. The goal is to design a secure over-the-air computation scheme. We propose a scheme that achieves MSE-security against the eavesdropper by employing zero-forced artificial noise, while keeping the distortion at the legitimate receiver small. In contrast to former approaches, the security does not depend on external helper nodes to jam the eavesdropper's receive signal. We thoroughly design the system parameters of the scheme, propose an artificial noise design that harnesses unused transmit power for security, and give an explicit construction rule. Our design approach is applicable both if the eavesdropper's channel coefficients are known and if they are unknown in the signal design. Simulations demonstrate the performance, and show that our noise design outperforms other methods.
    Structured Vision-Language Pretraining for Computational Cooking. (arXiv:2212.04267v1 [cs.CV])
    Vision-Language Pretraining (VLP) and Foundation models have been the go-to recipe for achieving SoTA performance on general benchmarks. However, leveraging these powerful techniques for more complex vision-language tasks, such as cooking applications, with more structured input data, is still little investigated. In this work, we propose to leverage these techniques for structured-text based computational cuisine tasks. Our strategy, dubbed VLPCook (Structured Vision-Language Pretraining for Computational Cooking), first transforms existing image-text pairs to image and structured-text pairs. This allows to pretrain our VLPCook model using VLP objectives adapted to the strutured data of the resulting datasets, then finetuning it on downstream computational cooking tasks. During finetuning, we also enrich the visual encoder, leveraging pretrained foundation models (e.g. CLIP) to provide local and global textual context. VLPCook outperforms current SoTA by a significant margin (+3.3 Recall@1 absolute improvement) on the task of Cross-Modal Food Retrieval on the large Recipe1M dataset. Finally, we conduct further experiments on VLP to validate their importance, especially on the Recipe1M+ dataset. The code will be made publicly available.  ( 2 min )
    Model-based trajectory stitching for improved behavioural cloning and its applications. (arXiv:2212.04280v1 [stat.ML])
    Behavioural cloning (BC) is a commonly used imitation learning method to infer a sequential decision-making policy from expert demonstrations. However, when the quality of the data is not optimal, the resulting behavioural policy also performs sub-optimally once deployed. Recently, there has been a surge in offline reinforcement learning methods that hold the promise to extract high-quality policies from sub-optimal historical data. A common approach is to perform regularisation during training, encouraging updates during policy evaluation and/or policy improvement to stay close to the underlying data. In this work, we investigate whether an offline approach to improving the quality of the existing data can lead to improved behavioural policies without any changes in the BC algorithm. The proposed data improvement approach - Trajectory Stitching (TS) - generates new trajectories (sequences of states and actions) by `stitching' pairs of states that were disconnected in the original data and generating their connecting new action. By construction, these new transitions are guaranteed to be highly plausible according to probabilistic models of the environment, and to improve a state-value function. We demonstrate that the iterative process of replacing old trajectories with new ones incrementally improves the underlying behavioural policy. Extensive experimental results show that significant performance gains can be achieved using TS over BC policies extracted from the original data. Furthermore, using the D4RL benchmarking suite, we demonstrate that state-of-the-art results are obtained by combining TS with two existing offline learning methodologies reliant on BC, model-based offline planning (MBOP) and policy constraint (TD3+BC).
    Simulation of Attacker Defender Interaction in a Noisy Security Game. (arXiv:2212.04281v1 [cs.CR])
    In the cybersecurity setting, defenders are often at the mercy of their detection technologies and subject to the information and experiences that individual analysts have. In order to give defenders an advantage, it is important to understand an attacker's motivation and their likely next best action. As a first step in modeling this behavior, we introduce a security game framework that simulates interplay between attackers and defenders in a noisy environment, focusing on the factors that drive decision making for attackers and defenders in the variants of the game with full knowledge and observability, knowledge of the parameters but no observability of the state (``partial knowledge''), and zero knowledge or observability (``zero knowledge''). We demonstrate the importance of making the right assumptions about attackers, given significant differences in outcomes. Furthermore, there is a measurable trade-off between false-positives and true-positives in terms of attacker outcomes, suggesting that a more false-positive prone environment may be acceptable under conditions where true-positives are also higher.  ( 2 min )
    GTFLAT: Game Theory Based Add-On For Empowering Federated Learning Aggregation Techniques. (arXiv:2212.04103v1 [cs.LG])
    GTFLAT, as a game theory-based add-on, addresses an important research question: How can a federated learning algorithm achieve better performance and training efficiency by setting more effective adaptive weights for averaging in the model aggregation phase? The main objectives for the ideal method of answering the question are: (1) empowering federated learning algorithms to reach better performance in fewer communication rounds, notably in the face of heterogeneous scenarios, and last but not least, (2) being easy to use alongside the state-of-the-art federated learning algorithms as a new module. To this end, GTFLAT models the averaging task as a strategic game among active users. Then it proposes a systematic solution based on the population game and evolutionary dynamics to find the equilibrium. In contrast with existing approaches that impose the weights on the participants, GTFLAT concludes a self-enforcement agreement among clients in a way that none of them is motivated to deviate from it individually. The results reveal that, on average, using GTFLAT increases the top-1 test accuracy by 1.38%, while it needs 21.06% fewer communication rounds to reach the accuracy.  ( 2 min )
    Better Hit the Nail on the Head than Beat around the Bush: Removing Protected Attributes with a Single Projection. (arXiv:2212.04273v1 [cs.LG])
    Bias elimination and recent probing studies attempt to remove specific information from embedding spaces. Here it is important to remove as much of the target information as possible, while preserving any other information present. INLP is a popular recent method which removes specific information through iterative nullspace projections. Multiple iterations, however, increase the risk that information other than the target is negatively affected. We introduce two methods that find a single targeted projection: Mean Projection (MP, more efficient) and Tukey Median Projection (TMP, with theoretical guarantees). Our comparison between MP and INLP shows that (1) one MP projection removes linear separability based on the target and (2) MP has less impact on the overall space. Further analysis shows that applying random projections after MP leads to the same overall effects on the embedding space as the multiple projections of INLP. Applying one targeted (MP) projection hence is methodologically cleaner than applying multiple (INLP) projections that introduce random effects.  ( 2 min )
    Physics-guided Data Augmentation for Learning the Solution Operator of Linear Differential Equations. (arXiv:2212.04100v1 [cs.LG])
    Neural networks, especially the recent proposed neural operator models, are increasingly being used to find the solution operator of differential equations. Compared to traditional numerical solvers, they are much faster and more efficient in practical applications. However, one critical issue is that training neural operator models require large amount of ground truth data, which usually comes from the slow numerical solvers. In this paper, we propose a physics-guided data augmentation (PGDA) method to improve the accuracy and generalization of neural operator models. Training data is augmented naturally through the physical properties of differential equations such as linearity and translation. We demonstrate the advantage of PGDA on a variety of linear differential equations, showing that PGDA can improve the sample complexity and is robust to distributional shift.  ( 2 min )
    Generating and Weighting Semantically Consistent Sample Pairs for Ultrasound Contrastive Learning. (arXiv:2212.04097v1 [cs.CV])
    Well-annotated medical datasets enable deep neural networks (DNNs) to gain strong power in extracting lesion-related features. Building such large and well-designed medical datasets is costly due to the need for high-level expertise. Model pre-training based on ImageNet is a common practice to gain better generalization when the data amount is limited. However, it suffers from the domain gap between natural and medical images. In this work, we pre-train DNNs on ultrasound (US) domains instead of ImageNet to reduce the domain gap in medical US applications. To learn US image representations based on unlabeled US videos, we propose a novel meta-learning-based contrastive learning method, namely Meta Ultrasound Contrastive Learning (Meta-USCL). To tackle the key challenge of obtaining semantically consistent sample pairs for contrastive learning, we present a positive pair generation module along with an automatic sample weighting module based on meta-learning. Experimental results on multiple computer-aided diagnosis (CAD) problems, including pneumonia detection, breast cancer classification, and breast tumor segmentation, show that the proposed self-supervised method reaches state-of-the-art (SOTA). The codes are available at https://github.com/Schuture/Meta-USCL.  ( 2 min )
    Deep Model Assembling. (arXiv:2212.04129v1 [cs.CV])
    Large deep learning models have achieved remarkable success in many scenarios. However, training large models is usually challenging, e.g., due to the high computational cost, the unstable and painfully slow optimization procedure, and the vulnerability to overfitting. To alleviate these problems, this work studies a divide-and-conquer strategy, i.e., dividing a large model into smaller modules, training them independently, and reassembling the trained modules to obtain the target model. This approach is promising since it avoids directly training large models from scratch. Nevertheless, implementing this idea is non-trivial, as it is difficult to ensure the compatibility of the independently trained modules. In this paper, we present an elegant solution to address this issue, i.e., we introduce a global, shared meta model to implicitly link all the modules together. This enables us to train highly compatible modules that collaborate effectively when they are assembled together. We further propose a module incubation mechanism that enables the meta model to be designed as an extremely shallow network. As a result, the additional overhead introduced by the meta model is minimalized. Though conceptually simple, our method significantly outperforms end-to-end (E2E) training in terms of both final accuracy and training efficiency. For example, on top of ViT-Huge, it improves the accuracy by 2.7% compared to the E2E baseline on ImageNet-1K, while saving the training cost by 43% in the meantime. Code is available at https://github.com/LeapLabTHU/Model-Assembling.  ( 2 min )
    Federated Learning for Inference at Anytime and Anywhere. (arXiv:2212.04084v1 [cs.LG])
    Federated learning has been predominantly concerned with collaborative training of deep networks from scratch, and especially the many challenges that arise, such as communication cost, robustness to heterogeneous data, and support for diverse device capabilities. However, there is no unified framework that addresses all these problems together. This paper studies the challenges and opportunities of exploiting pre-trained Transformer models in FL. In particular, we propose to efficiently adapt such pre-trained models by injecting a novel attention-based adapter module at each transformer block that both modulates the forward pass and makes an early prediction. Training only the lightweight adapter by FL leads to fast and communication-efficient learning even in the presence of heterogeneous data and devices. Extensive experiments on standard FL benchmarks, including CIFAR-100, FEMNIST and SpeechCommandsv2 demonstrate that this simple framework provides fast and accurate FL while supporting heterogenous device capabilities, efficient personalization, and scalable-cost anytime inference.  ( 2 min )
    SpaceEditing: Integrating Human Knowledge into Deep Neural Networks via Interactive Latent Space Editing. (arXiv:2212.04065v1 [cs.LG])
    We propose an interactive editing method that allows humans to help deep neural networks (DNNs) learn a latent space more consistent with human knowledge, thereby improving classification accuracy on indistinguishable ambiguous data. Firstly, we visualize high-dimensional data features through dimensionality reduction methods and design an interactive system \textit{SpaceEditing} to display the visualized data. \textit{SpaceEditing} provides a 2D workspace based on the idea of spatial layout. In this workspace, the user can move the projection data in it according to the system guidance. Then, \textit{SpaceEditing} will find the corresponding high-dimensional features according to the projection data moved by the user, and feed the high-dimensional features back to the network for retraining, therefore achieving the purpose of interactively modifying the high-dimensional latent space for the user. Secondly, to more rationally incorporate human knowledge into the training process of neural networks, we design a new loss function that enables the network to learn user-modified information. Finally, We demonstrate how \textit{SpaceEditing} meets user needs through three case studies while evaluating our proposed new method, and the results confirm the effectiveness of our method.  ( 2 min )
    Disaggregated Interventions to Reduce Inequality. (arXiv:2107.00593v3 [cs.LG] UPDATED)
    A significant body of research in the data sciences considers unfair discrimination against social categories such as race or gender that could occur or be amplified as a result of algorithmic decisions. Simultaneously, real-world disparities continue to exist, even before algorithmic decisions are made. In this work, we draw on insights from the social sciences brought into the realm of causal modeling and constrained optimization, and develop a novel algorithmic framework for tackling pre-existing real-world disparities. The purpose of our framework, which we call the "impact remediation framework," is to measure real-world disparities and discover the optimal intervention policies that could help improve equity or access to opportunity for those who are underserved with respect to an outcome of interest. We develop a disaggregated approach to tackling pre-existing disparities that relaxes the typical set of assumptions required for the use of social categories in structural causal models. Our approach flexibly incorporates counterfactuals and is compatible with various ontological assumptions about the nature of social categories. We demonstrate impact remediation with a hypothetical case study and compare our disaggregated approach to an existing state-of-the-art approach, comparing its structure and resulting policy recommendations. In contrast to most work on optimal policy learning, we explore disparity reduction itself as an objective, explicitly focusing the power of algorithms on reducing inequality.  ( 2 min )
    Unrolled algorithms for group synchronization. (arXiv:2207.09418v2 [eess.SP] UPDATED)
    The group synchronization problem involves estimating a collection of group elements from noisy measurements of their pairwise ratios. This task is a key component in many computational problems, including the molecular reconstruction problem in single-particle cryo-electron microscopy (cryo-EM). The standard methods to estimate the group elements are based on iteratively applying linear and non-linear operators, and are not necessarily optimal. Motivated by the structural similarity to deep neural networks, we adopt the concept of algorithm unrolling, where training data is used to optimize the algorithm. We design unrolled algorithms for several group synchronization instances, including synchronization over the group of 3-D rotations: the synchronization problem in cryo-EM. We also apply a similar approach to the multi-reference alignment problem. We show by numerical experiments that the unrolling strategy outperforms existing synchronization algorithms in a wide variety of scenarios.  ( 2 min )
    XRand: Differentially Private Defense against Explanation-Guided Attacks. (arXiv:2212.04454v1 [cs.LG])
    Recent development in the field of explainable artificial intelligence (XAI) has helped improve trust in Machine-Learning-as-a-Service (MLaaS) systems, in which an explanation is provided together with the model prediction in response to each query. However, XAI also opens a door for adversaries to gain insights into the black-box models in MLaaS, thereby making the models more vulnerable to several attacks. For example, feature-based explanations (e.g., SHAP) could expose the top important features that a black-box model focuses on. Such disclosure has been exploited to craft effective backdoor triggers against malware classifiers. To address this trade-off, we introduce a new concept of achieving local differential privacy (LDP) in the explanations, and from that we establish a defense, called XRand, against such attacks. We show that our mechanism restricts the information that the adversary can learn about the top important features, while maintaining the faithfulness of the explanations.  ( 2 min )
    General-Purpose In-Context Learning by Meta-Learning Transformers. (arXiv:2212.04458v1 [cs.LG])
    Modern machine learning requires system designers to specify aspects of the learning pipeline, such as losses, architectures, and optimizers. Meta-learning, or learning-to-learn, instead aims to learn those aspects, and promises to unlock greater capabilities with less manual effort. One particularly ambitious goal of meta-learning is to train general-purpose in-context learning algorithms from scratch, using only black-box models with minimal inductive bias. Such a model takes in training data, and produces test-set predictions across a wide range of problems, without any explicit definition of an inference model, training loss, or optimization algorithm. In this paper we show that Transformers and other black-box models can be meta-trained to act as general-purpose in-context learners. We characterize phase transitions between algorithms that generalize, algorithms that memorize, and algorithms that fail to meta-train at all, induced by changes in model size, number of tasks, and meta-optimization. We further show that the capabilities of meta-trained algorithms are bottlenecked by the accessible state size (memory) determining the next prediction, unlike standard models which are thought to be bottlenecked by parameter count. Finally, we propose practical interventions such as biasing the training distribution that improve the meta-training and meta-generalization of general-purpose learning algorithms.  ( 2 min )
    Adapting the Linearised Laplace Model Evidence for Modern Deep Learning. (arXiv:2206.08900v2 [stat.ML] UPDATED)
    The linearised Laplace method for estimating model uncertainty has received renewed attention in the Bayesian deep learning community. The method provides reliable error bars and admits a closed-form expression for the model evidence, allowing for scalable selection of model hyperparameters. In this work, we examine the assumptions behind this method, particularly in conjunction with model selection. We show that these interact poorly with some now-standard tools of deep learning--stochastic approximation methods and normalisation layers--and make recommendations for how to better adapt this classic method to the modern setting. We provide theoretical support for our recommendations and validate them empirically on MLPs, classic CNNs, residual networks with and without normalisation layers, generative autoencoders and transformers.  ( 2 min )
    Editing Models with Task Arithmetic. (arXiv:2212.04089v1 [cs.LG])
    Changing how pre-trained models behave -- e.g., improving their performance on a downstream task or mitigating biases learned during pre-training -- is a common practice when developing machine learning systems. In this work, we propose a new paradigm for steering the behavior of neural networks, centered around \textit{task vectors}. A task vector specifies a direction in the weight space of a pre-trained model, such that movement in that direction improves performance on the task. We build task vectors by subtracting the weights of a pre-trained model from the weights of the same model after fine-tuning on a task. We show that these task vectors can be modified and combined together through arithmetic operations such as negation and addition, and the behavior of the resulting model is steered accordingly. Negating a task vector decreases performance on the target task, with little change in model behavior on control tasks. Moreover, adding task vectors together can improve performance on multiple tasks at once. Finally, when tasks are linked by an analogy relationship of the form ``A is to B as C is to D", combining task vectors from three of the tasks can improve performance on the fourth, even when no data from the fourth task is used for training. Overall, our experiments with several models, modalities and tasks show that task arithmetic is a simple, efficient and effective way of editing models.  ( 2 min )
    Distributed Contextual Linear Bandits with Minimax Optimal Communication Cost. (arXiv:2205.13170v2 [cs.LG] UPDATED)
    We study distributed contextual linear bandits with stochastic contexts, where $N$ agents act cooperatively to solve a linear bandit-optimization problem with $d$-dimensional features over the course of $T$ rounds. For this problem, we derive the first ever information-theoretic lower bound $\Omega(dN)$ on the communication cost of any algorithm that performs optimally in a regret minimization setup. We then propose a distributed batch elimination version of the LinUCB algorithm, DisBE-LUCB, where the agents share information among each other through a central server. We prove that the communication cost of DisBE-LUCB matches our lower bound up to logarithmic factors. In particular, for scenarios with known context distribution, the communication cost of DisBE-LUCB is only $\tilde{\mathcal{O}}(dN)$ and its regret is ${\tilde{\mathcal{O}}}(\sqrt{dNT})$, which is of the same order as that incurred by an optimal single-agent algorithm for $NT$ rounds. We also provide similar bounds for practical settings where the context distribution can only be estimated. Therefore, our proposed algorithm is nearly minimax optimal in terms of \emph{both regret and communication cost}. Finally, we propose DecBE-LUCB, a fully decentralized version of DisBE-LUCB, which operates without a central server, where agents share information with their \emph{immediate neighbors} through a carefully designed consensus procedure.  ( 2 min )
    Detect, Distill and Update: Learned DB Systems Facing Out of Distribution Data. (arXiv:2210.05508v2 [cs.DB] UPDATED)
    Machine Learning (ML) is changing DBs as many DB components are being replaced by ML models. One open problem in this setting is how to update such ML models in the presence of data updates. We start this investigation focusing on data insertions (dominating updates in analytical DBs). We study how to update neural network (NN) models when new data follows a different distribution (a.k.a. it is "out-of-distribution" -- OOD), rendering previously-trained NNs inaccurate. A requirement in our problem setting is that learned DB components should ensure high accuracy for tasks on old and new data (e.g., for approximate query processing (AQP), cardinality estimation (CE), synthetic data generation (DG), etc.). This paper proposes a novel updatability framework (DDUp). DDUp can provide updatability for different learned DB system components, even based on different NNs, without the high costs to retrain the NNs from scratch. DDUp entails two components: First, a novel, efficient, and principled statistical-testing approach to detect OOD data. Second, a novel model updating approach, grounded on the principles of transfer learning with knowledge distillation, to update learned models efficiently, while still ensuring high accuracy. We develop and showcase DDUp's applicability for three different learned DB components, AQP, CE, and DG, each employing a different type of NN. Detailed experimental evaluation using real and benchmark datasets for AQP, CE, and DG detail DDUp's performance advantages.  ( 2 min )
    Alleviating neighbor bias: augmenting graph self-supervise learning with structural equivalent positive samples. (arXiv:2212.04365v1 [cs.LG])
    In recent years, using a self-supervised learning framework to learn the general characteristics of graphs has been considered a promising paradigm for graph representation learning. The core of self-supervised learning strategies for graph neural networks lies in constructing suitable positive sample selection strategies. However, existing GNNs typically aggregate information from neighboring nodes to update node representations, leading to an over-reliance on neighboring positive samples, i.e., homophilous samples; while ignoring long-range positive samples, i.e., positive samples that are far apart on the graph but structurally equivalent samples, a problem we call "neighbor bias." This neighbor bias can reduce the generalization performance of GNNs. In this paper, we argue that the generalization properties of GNNs should be determined by combining homogeneous samples and structurally equivalent samples, which we call the "GC combination hypothesis." Therefore, we propose a topological signal-driven self-supervised method. It uses a topological information-guided structural equivalence sampling strategy. First, we extract multiscale topological features using persistent homology. Then we compute the structural equivalence of node pairs based on their topological features. In particular, we design a topological loss function to pull in non-neighboring node pairs with high structural equivalence in the representation space to alleviate neighbor bias. Finally, we use the joint training mechanism to adjust the effect of structural equivalence on the model to fit datasets with different characteristics. We conducted experiments on the node classification task across seven graph datasets. The results show that the model performance can be effectively improved using a strategy of topological signal enhancement.  ( 2 min )
    On The Relevance Of The Differences Between HRTF Measurement Setups For Machine Learning. (arXiv:2212.04283v1 [eess.AS])
    As spatial audio is enjoying a surge in popularity, data-driven machine learning techniques that have been proven successful in other domains are increasingly used to process head-related transfer function measurements. However, these techniques require much data, whereas the existing datasets are ranging from tens to the low hundreds of datapoints. It therefore becomes attractive to combine multiple of these datasets, although they are measured under different conditions. In this paper, we first establish the common ground between a number of datasets, then we investigate potential pitfalls of mixing datasets. We perform a simple experiment to test the relevance of the remaining differences between datasets when applying machine learning techniques. Finally, we pinpoint the most relevant differences.  ( 2 min )
    Dual Convexified Convolutional Neural Networks. (arXiv:2205.14056v2 [cs.LG] UPDATED)
    We propose the framework of dual convexified convolutional neural networks (DCCNNs). In this framework, we first introduce a primal learning problem motivated by convexified convolutional neural networks (CCNNs), and then construct the dual convex training program through careful analysis of the Karush-Kuhn-Tucker (KKT) conditions and Fenchel conjugates. Our approach reduces the computational overhead of constructing a large kernel matrix and more importantly, eliminates the ambiguity of factorizing the matrix. Due to the low-rank structure in CCNNs and the related subdifferential of nuclear norms, there is no closed-form expression to recover the primal solution from the dual solution. To overcome this, we propose a highly novel weight recovery algorithm, which takes the dual solution and the kernel information as the input, and recovers the linear weight and the output of convolutional layer, instead of weight parameter. Furthermore, our recovery algorithm exploits the low-rank structure and imposes a small number of filters indirectly, which reduces the parameter size. As a result, DCCNNs inherit all the statistical benefits of CCNNs, while enjoying a more formal and efficient workflow.  ( 2 min )
    The Ordered Matrix Dirichlet for Modeling Ordinal Dynamics. (arXiv:2212.04130v1 [stat.ML])
    Many dynamical systems exhibit latent states with intrinsic orderings such as "ally", "neutral" and "enemy" relationships in international relations. Such latent states are evidenced through entities' cooperative versus conflictual interactions which are similarly ordered. Models of such systems often involve state-to-action emission and state-to-state transition matrices. It is common practice to assume that the rows of these stochastic matrices are independently sampled from a Dirichlet distribution. However, this assumption discards ordinal information and treats states and actions falsely as order-invariant categoricals, which hinders interpretation and evaluation. To address this problem, we propose the Ordered Matrix Dirichlet (OMD): rows are sampled conditionally dependent such that probability mass is shifted to the right of the matrix as we move down rows. This results in a well-ordered mapping between latent states and observed action types. We evaluate the OMD in two settings: a Hidden Markov Model and a novel Bayesian Dynamic Poisson Tucker Model tailored to political event data. Models built on the OMD recover interpretable latent states and show superior forecasting performance in few-shot settings. We detail the wide applicability of the OMD to other domains where models with Dirichlet-sampled matrices are popular (e.g. topic modeling) and publish user-friendly code.  ( 2 min )
    Recurrent Memory Transformer. (arXiv:2207.06881v2 [cs.CL] UPDATED)
    Transformer-based models show their effectiveness across multiple domains and tasks. The self-attention allows to combine information from all sequence elements into context-aware representations. However, global and local information has to be stored mostly in the same element-wise representations. Moreover, the length of an input sequence is limited by quadratic computational complexity of self-attention. In this work, we propose and study a memory-augmented segment-level recurrent Transformer (RMT). Memory allows to store and process local and global information as well as to pass information between segments of the long sequence with the help of recurrence. We implement a memory mechanism with no changes to Transformer model by adding special memory tokens to the input or output sequence. Then the model is trained to control both memory operations and sequence representations processing. Results of experiments show that RMT performs on par with the Transformer-XL on language modeling for smaller memory sizes and outperforms it for tasks that require longer sequence processing. We show that adding memory tokens to Tr-XL is able to improve its performance. This makes Recurrent Memory Transformer a promising architecture for applications that require learning of long-term dependencies and general purpose in memory processing, such as algorithmic tasks and reasoning.  ( 2 min )
    Evaluating Zero-cost Active Learning for Object Detection. (arXiv:2212.04211v1 [cs.LG])
    Object detection requires substantial labeling effort for learning robust models. Active learning can reduce this effort by intelligently selecting relevant examples to be annotated. However, selecting these examples properly without introducing a sampling bias with a negative impact on the generalization performance is not straightforward and most active learning techniques can not hold their promises on real-world benchmarks. In our evaluation paper, we focus on active learning techniques without a computational overhead besides inference, something we refer to as zero-cost active learning. In particular, we show that a key ingredient is not only the score on a bounding box level but also the technique used for aggregating the scores for ranking images. We outline our experimental setup and also discuss practical considerations when using active learning for object detection.  ( 2 min )
    Learning Quantum Processes and Hamiltonians via the Pauli Transfer Matrix. (arXiv:2212.04471v1 [quant-ph])
    Learning about physical systems from quantum-enhanced experiments, relying on a quantum memory and quantum processing, can outperform learning from experiments in which only classical memory and processing are available. Whereas quantum advantages have been established for a variety of state learning tasks, quantum process learning allows for comparable advantages only with a careful problem formulation and is less understood. We establish an exponential quantum advantage for learning an unknown $n$-qubit quantum process $\mathcal{N}$. We show that a quantum memory allows to efficiently solve the following tasks: (a) learning the Pauli transfer matrix of an arbitrary $\mathcal{N}$, (b) predicting expectation values of bounded Pauli-sparse observables measured on the output of an arbitrary $\mathcal{N}$ upon input of a Pauli-sparse state, and (c) predicting expectation values of arbitrary bounded observables measured on the output of an unknown $\mathcal{N}$ with sparse Pauli transfer matrix upon input of an arbitrary state. With quantum memory, these tasks can be solved using linearly-in-$n$ many copies of the Choi state of $\mathcal{N}$, and even time-efficiently in the case of (b). In contrast, any learner without quantum memory requires exponentially-in-$n$ many queries, even when querying $\mathcal{N}$ on subsystems of adaptively chosen states and performing adaptively chosen measurements. In proving this separation, we extend existing shadow tomography upper and lower bounds from states to channels via the Choi-Jamiolkowski isomorphism. Moreover, we combine Pauli transfer matrix learning with polynomial interpolation techniques to develop a procedure for learning arbitrary Hamiltonians, which may have non-local all-to-all interactions, from short-time dynamics. Our results highlight the power of quantum-enhanced experiments for learning highly complex quantum dynamics.  ( 2 min )
    A Novel Stochastic Gradient Descent Algorithm for Learning Principal Subspaces. (arXiv:2212.04025v1 [cs.LG])
    Many machine learning problems encode their data as a matrix with a possibly very large number of rows and columns. In several applications like neuroscience, image compression or deep reinforcement learning, the principal subspace of such a matrix provides a useful, low-dimensional representation of individual data. Here, we are interested in determining the $d$-dimensional principal subspace of a given matrix from sample entries, i.e. from small random submatrices. Although a number of sample-based methods exist for this problem (e.g. Oja's rule \citep{oja1982simplified}), these assume access to full columns of the matrix or particular matrix structure such as symmetry and cannot be combined as-is with neural networks \citep{baldi1989neural}. In this paper, we derive an algorithm that learns a principal subspace from sample entries, can be applied when the approximate subspace is represented by a neural network, and hence can be scaled to datasets with an effectively infinite number of rows and columns. Our method consists in defining a loss function whose minimizer is the desired principal subspace, and constructing a gradient estimate of this loss whose bias can be controlled. We complement our theoretical analysis with a series of experiments on synthetic matrices, the MNIST dataset \citep{lecun2010mnist} and the reinforcement learning domain PuddleWorld \citep{sutton1995generalization} demonstrating the usefulness of our approach.  ( 2 min )
    Fine-grained Image Editing by Pixel-wise Guidance Using Diffusion Models. (arXiv:2212.02024v2 [cs.CV] UPDATED)
    Generative models, particularly GANs, have been utilized for image editing. Although GAN-based methods perform well on generating reasonable contents aligned with the user's intentions, they struggle to strictly preserve the contents outside the editing region. To address this issue, we use diffusion models instead of GANs and propose a novel image-editing method, based on pixel-wise guidance. Specifically, we first train pixel-classifiers with few annotated data and then estimate the semantic segmentation map of a target image. Users then manipulate the map to instruct how the image is to be edited. The diffusion model generates an edited image via guidance by pixel-wise classifiers, such that the resultant image aligns with the manipulated map. As the guidance is conducted pixel-wise, the proposed method can create reasonable contents in the editing region while preserving the contents outside this region. The experimental results validate the advantages of the proposed method both quantitatively and qualitatively.  ( 2 min )
    Beyond 1-WL with Local Ego-Network Encodings. (arXiv:2211.14906v2 [cs.LG] UPDATED)
    Identifying similar network structures is key to capture graph isomorphisms and learn representations that exploit structural information encoded in graph data. This work shows that ego-networks can produce a structural encoding scheme for arbitrary graphs with greater expressivity than the Weisfeiler-Lehman (1-WL) test. We introduce IGEL, a preprocessing step to produce features that augment node representations by encoding ego-networks into sparse vectors that enrich Message Passing (MP) Graph Neural Networks (GNNs) beyond 1-WL expressivity. We describe formally the relation between IGEL and 1-WL, and characterize its expressive power and limitations. Experiments show that IGEL matches the empirical expressivity of state-of-the-art methods on isomorphism detection while improving performance on seven GNN architectures.  ( 2 min )
    LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models. (arXiv:2212.04088v1 [cs.AI])
    This study focuses on embodied agents that can follow natural language instructions to complete complex tasks in a visually-perceived environment. Existing methods rely on a large amount of (instruction, gold trajectory) pairs to learn a good policy. The high data cost and poor sample efficiency prevents the development of versatile agents that are capable of many tasks and can learn new tasks quickly. In this work, we propose a novel method, LLM-Planner, that harnesses the power of large language models (LLMs) such as GPT-3 to do few-shot planning for embodied agents. We further propose a simple but effective way to enhance LLMs with physical grounding to generate plans that are grounded in the current environment. Experiments on the ALFRED dataset show that our method can achieve very competitive few-shot performance, even outperforming several recent baselines that are trained using the full training data despite using less than 0.5% of paired training data. Existing methods can barely complete any task successfully under the same few-shot setting. Our work opens the door for developing versatile and sample-efficient embodied agents that can quickly learn many tasks.  ( 2 min )
    Whose Emotion Matters? Speaker Detection without Prior Knowledge. (arXiv:2211.15377v2 [eess.AS] UPDATED)
    The task of emotion recognition in conversations (ERC) benefits from the availability of multiple modalities, as offered, for example, in the video-based MELD dataset. However, only a few research approaches use both acoustic and visual information from the MELD videos. There are two reasons for this: First, label-to-video alignments in MELD are noisy, making those videos an unreliable source of emotional speech data. Second, conversations can involve several people in the same scene, which requires the detection of the person speaking the utterance. In this paper we demonstrate that by using recent automatic speech recognition and active speaker detection models, we are able to realign the videos of MELD, and capture the facial expressions from uttering speakers in 96.92% of the utterances provided in MELD. Experiments with a self-supervised voice recognition model indicate that the realigned MELD videos more closely match the corresponding utterances offered in the dataset. Finally, we devise a model for emotion recognition in conversations trained on the face and audio information of the MELD realigned videos, which outperforms state-of-the-art models for ERC based on vision alone. This indicates that active speaker detection is indeed effective for extracting facial expressions from the uttering speakers, and that faces provide more informative visual cues than the visual features state-of-the-art models have been using so far.  ( 2 min )
    A Comprehensive Survey on Multi-hop Machine Reading Comprehension Datasets and Metrics. (arXiv:2212.04070v1 [cs.CL])
    Multi-hop Machine reading comprehension is a challenging task with aim of answering a question based on disjoint pieces of information across the different passages. The evaluation metrics and datasets are a vital part of multi-hop MRC because it is not possible to train and evaluate models without them, also, the proposed challenges by datasets often are an important motivation for improving the existing models. Due to increasing attention to this field, it is necessary and worth reviewing them in detail. This study aims to present a comprehensive survey on recent advances in multi-hop MRC evaluation metrics and datasets. In this regard, first, the multi-hop MRC problem definition will be presented, then the evaluation metrics based on their multi-hop aspect will be investigated. Also, 15 multi-hop datasets have been reviewed in detail from 2017 to 2022, and a comprehensive analysis has been prepared at the end. Finally, open issues in this field have been discussed.  ( 2 min )
    Reinforcement Learning for Resilient Power Grids. (arXiv:2212.04069v1 [cs.LG])
    Traditional power grid systems have become obsolete under more frequent and extreme natural disasters. Reinforcement learning (RL) has been a promising solution for resilience given its successful history of power grid control. However, most power grid simulators and RL interfaces do not support simulation of power grid under large-scale blackouts or when the network is divided into sub-networks. In this study, we proposed an updated power grid simulator built on Grid2Op, an existing simulator and RL interface, and experimented on limiting the action and observation spaces of Grid2Op. By testing with DDQN and SliceRDQN algorithms, we found that reduced action spaces significantly improve training performance and efficiency. In addition, we investigated a low-rank neural network regularization method for deep Q-learning, one of the most widely used RL algorithms, in this power grid control scenario. As a result, the experiment demonstrated that in the power grid simulation environment, adopting this method will significantly increase the performance of RL agents.  ( 2 min )
    AutoPINN: When AutoML Meets Physics-Informed Neural Networks. (arXiv:2212.04058v1 [cs.LG])
    Physics-Informed Neural Networks (PINNs) have recently been proposed to solve scientific and engineering problems, where physical laws are introduced into neural networks as prior knowledge. With the embedded physical laws, PINNs enable the estimation of critical parameters, which are unobservable via physical tools, through observable variables. For example, Power Electronic Converters (PECs) are essential building blocks for the green energy transition. PINNs have been applied to estimate the capacitance, which is unobservable during PEC operations, using current and voltage, which can be observed easily during operations. The estimated capacitance facilitates self-diagnostics of PECs. Existing PINNs are often manually designed, which is time-consuming and may lead to suboptimal performance due to a large number of design choices for neural network architectures and hyperparameters. In addition, PINNs are often deployed on different physical devices, e.g., PECs, with limited and varying resources. Therefore, it requires designing different PINN models under different resource constraints, making it an even more challenging task for manual design. To contend with the challenges, we propose Automated Physics-Informed Neural Networks (AutoPINN), a framework that enables the automated design of PINNs by combining AutoML and PINNs. Specifically, we first tailor a search space that allows finding high-accuracy PINNs for PEC internal parameter estimation. We then propose a resource-aware search strategy to explore the search space to find the best PINN model under different resource constraints. We experimentally demonstrate that AutoPINN is able to find more accurate PINN models than human-designed, state-of-the-art PINN models using fewer resources.  ( 2 min )
    MixBoost: Improving the Robustness of Deep Neural Networks by Boosting Data Augmentation. (arXiv:2212.04059v1 [cs.LG])
    As more and more artificial intelligence (AI) technologies move from the laboratory to real-world applications, the open-set and robustness challenges brought by data from the real world have received increasing attention. Data augmentation is a widely used method to improve model performance, and some recent works have also confirmed its positive effect on the robustness of AI models. However, most of the existing data augmentation methods are heuristic, lacking the exploration of their internal mechanisms. We apply the explainable artificial intelligence (XAI) method, explore the internal mechanisms of popular data augmentation methods, analyze the relationship between game interactions and some widely used robustness metrics, and propose a new proxy for model robustness in the open-set environment. Based on the analysis of the internal mechanisms, we develop a mask-based boosting method for data augmentation that comprehensively improves several robustness measures of AI models and beats state-of-the-art data augmentation approaches. Experiments show that our method can be widely applied to many popular data augmentation methods. Different from the adversarial training, our boosting method not only significantly improves the robustness of models, but also improves the accuracy of test sets. Our code is available at \url{https://github.com/Anonymous_for_submission}.  ( 2 min )
    CODEBench: A Neural Architecture and Hardware Accelerator Co-Design Framework. (arXiv:2212.03965v1 [cs.AR])
    Recently, automated co-design of machine learning (ML) models and accelerator architectures has attracted significant attention from both the industry and academia. However, most co-design frameworks either explore a limited search space or employ suboptimal exploration techniques for simultaneous design decision investigations of the ML model and the accelerator. Furthermore, training the ML model and simulating the accelerator performance is computationally expensive. To address these limitations, this work proposes a novel neural architecture and hardware accelerator co-design framework, called CODEBench. It is composed of two new benchmarking sub-frameworks, CNNBench and AccelBench, which explore expanded design spaces of convolutional neural networks (CNNs) and CNN accelerators. CNNBench leverages an advanced search technique, BOSHNAS, to efficiently train a neural heteroscedastic surrogate model to converge to an optimal CNN architecture by employing second-order gradients. AccelBench performs cycle-accurate simulations for a diverse set of accelerator architectures in a vast design space. With the proposed co-design method, called BOSHCODE, our best CNN-accelerator pair achieves 1.4% higher accuracy on the CIFAR-10 dataset compared to the state-of-the-art pair, while enabling 59.1% lower latency and 60.8% lower energy consumption. On the ImageNet dataset, it achieves 3.7% higher Top1 accuracy at 43.8% lower latency and 11.2% lower energy consumption. CODEBench outperforms the state-of-the-art framework, i.e., Auto-NBA, by achieving 1.5% higher accuracy and 34.7x higher throughput, while enabling 11.0x lower energy-delay product (EDP) and 4.0x lower chip area on CIFAR-10.  ( 2 min )
    Zero-Shot Transfer Learning for Structural Health Monitoring using Generative Adversarial Networks and Spectral Mapping. (arXiv:2212.04002v1 [cs.LG])
    Gathering properly labelled, adequately rich, and case-specific data for successfully training a data-driven or hybrid model for structural health monitoring (SHM) applications is a challenging task. We posit that a Transfer Learning (TL) method that utilizes available data in any relevant source domain and directly applies to the target domain through domain adaptation can provide substantial remedies to address this issue. Accordingly, we present a novel TL method that differentiates between the source's no-damage and damage cases and utilizes a domain adaptation (DA) technique. The DA module transfers the accumulated knowledge in contrasting no-damage and damage cases in the source domain to the target domain, given only the target's no-damage case. High-dimensional features allow employing signal processing domain knowledge to devise a generalizable DA approach. The Generative Adversarial Network (GAN) architecture is adopted for learning since its optimization process accommodates high-dimensional inputs in a zero-shot setting. At the same time, its training objective conforms seamlessly with the case of no-damage and damage data in SHM since its discriminator network differentiates between real (no damage) and fake (possibly unseen damage) data. An extensive set of experimental results demonstrates the method's success in transferring knowledge on differences between no-damage and damage cases across three strongly heterogeneous independent target structures. The area under the Receiver Operating Characteristics curves (Area Under the Curve - AUC) is used to evaluate the differentiation between no-damage and damage cases in the target domain, reaching values as high as 0.95. With no-damage and damage cases discerned from each other, zero-shot structural damage detection is carried out. The mean F1 scores for all damages in the three independent datasets are 0.978, 0.992, and 0.975.  ( 2 min )
    Statistical and Computational Guarantees for Influence Diagnostics. (arXiv:2212.04014v1 [stat.ML])
    Influence diagnostics such as influence functions and approximate maximum influence perturbations are popular in machine learning and in AI domain applications. Influence diagnostics are powerful statistical tools to identify influential datapoints or subsets of datapoints. We establish finite-sample statistical bounds, as well as computational complexity bounds, for influence functions and approximate maximum influence perturbations using efficient inverse-Hessian-vector product implementations. We illustrate our results with generalized linear models and large attention based models on synthetic and real data.  ( 2 min )
    DDoD: Dual Denial of Decision Attacks on Human-AI Teams. (arXiv:2212.03980v1 [cs.HC])
    Artificial Intelligence (AI) systems have been increasingly used to make decision-making processes faster, more accurate, and more efficient. However, such systems are also at constant risk of being attacked. While the majority of attacks targeting AI-based applications aim to manipulate classifiers or training data and alter the output of an AI model, recently proposed Sponge Attacks against AI models aim to impede the classifier's execution by consuming substantial resources. In this work, we propose \textit{Dual Denial of Decision (DDoD) attacks against collaborative Human-AI teams}. We discuss how such attacks aim to deplete \textit{both computational and human} resources, and significantly impair decision-making capabilities. We describe DDoD on human and computational resources and present potential risk scenarios in a series of exemplary domains.  ( 2 min )
    On Interpretable Anomaly Detection Using Causal Algorithmic Recourse. (arXiv:2212.04031v1 [cs.LG])
    As many deep anomaly detection models have been deployed in the real-world, interpretable anomaly detection becomes an emerging task. Recent studies focus on identifying features of samples leading to abnormal outcomes but cannot recommend a set of actions to flip the abnormal outcomes. In this work, we focus on interpretations via algorithmic recourse that shows how to act to revert abnormal predictions by suggesting actions on features. The key challenge is that algorithmic recourse involves interventions in the physical world, which is fundamentally a causal problem. To tackle this challenge, we propose an interpretable Anomaly Detection framework using Causal Algorithmic Recourse (ADCAR), which recommends recourse actions and infers counterfactual of abnormal samples guided by the causal mechanism. Experiments on three datasets show that ADCAR can flip the abnormal labels with minimal interventions.  ( 2 min )
    RLSEP: Learning Label Ranks for Multi-label Classification. (arXiv:2212.04022v1 [cs.CV])
    Multi-label ranking maps instances to a ranked set of predicted labels from multiple possible classes. The ranking approach for multi-label learning problems received attention for its success in multi-label classification, with one of the well-known approaches being pairwise label ranking. However, most existing methods assume that only partial information about the preference relation is known, which is inferred from the partition of labels into a positive and negative set, then treat labels with equal importance. In this paper, we focus on the unique challenge of ranking when the order of the true label set is provided. We propose a novel dedicated loss function to optimize models by incorporating penalties for incorrectly ranked pairs, and make use of the ranking information present in the input. Our method achieves the best reported performance measures on both synthetic and real world ranked datasets and shows improvements on overall ranking of labels. Our experimental results demonstrate that our approach is generalizable to a variety of multi-label classification and ranking tasks, while revealing a calibration towards a certain ranking ordering.  ( 2 min )
    Unsupervised language models for disease variant prediction. (arXiv:2212.03979v1 [cs.LG])
    There is considerable interest in predicting the pathogenicity of protein variants in human genes. Due to the sparsity of high quality labels, recent approaches turn to \textit{unsupervised} learning, using Multiple Sequence Alignments (MSAs) to train generative models of natural sequence variation within each gene. These generative models then predict variant likelihood as a proxy to evolutionary fitness. In this work we instead combine this evolutionary principle with pretrained protein language models (LMs), which have already shown promising results in predicting protein structure and function. Instead of training separate models per-gene, we find that a single protein LM trained on broad sequence datasets can score pathogenicity for any gene variant zero-shot, without MSAs or finetuning. We call this unsupervised approach \textbf{VELM} (Variant Effect via Language Models), and show that it achieves scoring performance comparable to the state of the art when evaluated on clinically labeled variants of disease-related genes.  ( 2 min )
    TweetDrought: A Deep-Learning Drought Impacts Recognizer based on Twitter Data. (arXiv:2212.04001v1 [cs.CL])
    Acquiring a better understanding of drought impacts becomes increasingly vital under a warming climate. Traditional drought indices describe mainly biophysical variables and not impacts on social, economic, and environmental systems. We utilized natural language processing and bidirectional encoder representation from Transformers (BERT) based transfer learning to fine-tune the model on the data from the news-based Drought Impact Report (DIR) and then apply it to recognize seven types of drought impacts based on the filtered Twitter data from the United States. Our model achieved a satisfying macro-F1 score of 0.89 on the DIR test set. The model was then applied to California tweets and validated with keyword-based labels. The macro-F1 score was 0.58. However, due to the limitation of keywords, we also spot-checked tweets with controversial labels. 83.5% of BERT labels were correct compared to the keyword labels. Overall, the fine-tuned BERT-based recognizer provided proper predictions and valuable information on drought impacts. The interpretation and analysis of the model were consistent with experiential domain expertise.  ( 2 min )
    Going Beyond XAI: A Systematic Survey for Explanation-Guided Learning. (arXiv:2212.03954v1 [cs.AI])
    As the societal impact of Deep Neural Networks (DNNs) grows, the goals for advancing DNNs become more complex and diverse, ranging from improving a conventional model accuracy metric to infusing advanced human virtues such as fairness, accountability, transparency (FaccT), and unbiasedness. Recently, techniques in Explainable Artificial Intelligence (XAI) are attracting considerable attention, and have tremendously helped Machine Learning (ML) engineers in understanding AI models. However, at the same time, we started to witness the emerging need beyond XAI among AI communities; based on the insights learned from XAI, how can we better empower ML engineers in steering their DNNs so that the model's reasonableness and performance can be improved as intended? This article provides a timely and extensive literature overview of the field Explanation-Guided Learning (EGL), a domain of techniques that steer the DNNs' reasoning process by adding regularization, supervision, or intervention on model explanations. In doing so, we first provide a formal definition of EGL and its general learning paradigm. Secondly, an overview of the key factors for EGL evaluation, as well as summarization and categorization of existing evaluation procedures and metrics for EGL are provided. Finally, the current and potential future application areas and directions of EGL are discussed, and an extensive experimental study is presented aiming at providing comprehensive comparative studies among existing EGL models in various popular application domains, such as Computer Vision (CV) and Natural Language Processing (NLP) domains.  ( 2 min )
    Learning Graph Search Heuristics. (arXiv:2212.03978v1 [cs.LG])
    Searching for a path between two nodes in a graph is one of the most well-studied and fundamental problems in computer science. In numerous domains such as robotics, AI, or biology, practitioners develop search heuristics to accelerate their pathfinding algorithms. However, it is a laborious and complex process to hand-design heuristics based on the problem and the structure of a given use case. Here we present PHIL (Path Heuristic with Imitation Learning), a novel neural architecture and a training algorithm for discovering graph search and navigation heuristics from data by leveraging recent advances in imitation learning and graph representation learning. At training time, we aggregate datasets of search trajectories and ground-truth shortest path distances, which we use to train a specialized graph neural network-based heuristic function using backpropagation through steps of the pathfinding process. Our heuristic function learns graph embeddings useful for inferring node distances, runs in constant time independent of graph sizes, and can be easily incorporated in an algorithm such as A* at test time. Experiments show that PHIL reduces the number of explored nodes compared to state-of-the-art methods on benchmark datasets by 58.5\% on average, can be directly applied in diverse graphs ranging from biological networks to road networks, and allows for fast planning in time-critical robotics domains.  ( 2 min )
    Unsupervised Deep Learning for AC Optimal Power Flow via Lagrangian Duality. (arXiv:2212.03977v1 [eess.SY])
    Non-convex AC optimal power flow (AC-OPF) is a fundamental optimization problem in power system analysis. The computational complexity of conventional solvers is typically high and not suitable for large-scale networks in real-time operation. Hence, deep learning based approaches have gained intensive attention to conduct the time-consuming training process offline. Supervised learning methods may yield a feasible AC-OPF solution with a small optimality gap. However, they often need conventional solvers to generate the training dataset. This paper proposes an end-to-end unsupervised learning based framework for AC-OPF. We develop a deep neural network to output a partial set of decision variables while the remaining variables are recovered by solving AC power flow equations. The fast decoupled power flow solver is adopted to further reduce the computational time. In addition, we propose using a modified augmented Lagrangian function as the training loss. The multipliers are adjusted dynamically based on the degree of constraint violation. Extensive numerical test results corroborate the advantages of our proposed approach over some existing methods.  ( 2 min )
    Short term prediction of demand for ride hailing services: A deep learning approach. (arXiv:2212.03956v1 [cs.LG])
    As ride-hailing services become increasingly popular, being able to accurately predict demand for such services can help operators efficiently allocate drivers to customers, and reduce idle time, improve congestion, and enhance the passenger experience. This paper proposes UberNet, a deep learning Convolutional Neural Network for short-term prediction of demand for ride-hailing services. UberNet empploys a multivariate framework that utilises a number of temporal and spatial features that have been found in the literature to explain demand for ride-hailing services. The proposed model includes two sub-networks that aim to encode the source series of various features and decode the predicting series, respectively. To assess the performance and effectiveness of UberNet, we use 9 months of Uber pickup data in 2014 and 28 spatial and temporal features from New York City. By comparing the performance of UberNet with several other approaches, we show that the prediction quality of the model is highly competitive. Further, Ubernet's prediction performance is better when using economic, social and built environment features. This suggests that Ubernet is more naturally suited to including complex motivators in making real-time passenger demand predictions for ride-hailing services.  ( 2 min )
    Low Variance Off-policy Evaluation with State-based Importance Sampling. (arXiv:2212.03932v1 [cs.LG])
    In off-policy reinforcement learning, a behaviour policy performs exploratory interactions with the environment to obtain state-action-reward samples which are then used to learn a target policy that optimises the expected return. This leads to a problem of off-policy evaluation, where one needs to evaluate the target policy from samples collected by the often unrelated behaviour policy. Importance sampling is a traditional statistical technique that is often applied to off-policy evaluation. While importance sampling estimators are unbiased, their variance increases exponentially with the horizon of the decision process due to computing the importance weight as a product of action probability ratios, yielding estimates with low accuracy for domains involving long-term planning. This paper proposes state-based importance sampling (SIS), which drops the action probability ratios of sub-trajectories with "neglible states" -- roughly speaking, those for which the chosen actions have no impact on the return estimate -- from the computation of the importance weight. Theoretical results show that this results in a reduction of the exponent in the variance upper bound as well as improving the mean squared error. An automated search algorithm based on covariance testing is proposed to identify a negligible state set which has minimal MSE when performing state-based importance sampling. Experiments are conducted on a lift domain, which include "lift states" where the action has no impact on the following state and reward. The results demonstrate that using the search algorithm, SIS yields reduced variance and improved accuracy compared to traditional importance sampling, per-decision importance sampling, and incremental importance sampling.  ( 2 min )
    Multi-Rate VAE: Train Once, Get the Full Rate-Distortion Curve. (arXiv:2212.03905v1 [cs.LG])
    Variational autoencoders (VAEs) are powerful tools for learning latent representations of data used in a wide range of applications. In practice, VAEs usually require multiple training rounds to choose the amount of information the latent variable should retain. This trade-off between the reconstruction error (distortion) and the KL divergence (rate) is typically parameterized by a hyperparameter $\beta$. In this paper, we introduce Multi-Rate VAE (MR-VAE), a computationally efficient framework for learning optimal parameters corresponding to various $\beta$ in a single training run. The key idea is to explicitly formulate a response function that maps $\beta$ to the optimal parameters using hypernetworks. MR-VAEs construct a compact response hypernetwork where the pre-activations are conditionally gated based on $\beta$. We justify the proposed architecture by analyzing linear VAEs and showing that it can represent response functions exactly for linear VAEs. With the learned hypernetwork, MR-VAEs can construct the rate-distortion curve without additional training and can be deployed with significantly less hyperparameter tuning. Empirically, our approach is competitive and often exceeds the performance of multiple $\beta$-VAEs training with minimal computation and memory overheads.  ( 2 min )
    An Efficient Evolutionary Deep Learning Framework Based on Multi-source Transfer Learning to Evolve Deep Convolutional Neural Networks. (arXiv:2212.03942v1 [cs.CV])
    Convolutional neural networks (CNNs) have constantly achieved better performance over years by introducing more complex topology, and enlarging the capacity towards deeper and wider CNNs. This makes the manual design of CNNs extremely difficult, so the automated design of CNNs has come into the research spotlight, which has obtained CNNs that outperform manually-designed CNNs. However, the computational cost is still the bottleneck of automatically designing CNNs. In this paper, inspired by transfer learning, a new evolutionary computation based framework is proposed to efficiently evolve CNNs without compromising the classification accuracy. The proposed framework leverages multi-source domains, which are smaller datasets than the target domain datasets, to evolve a generalised CNN block only once. And then, a new stacking method is proposed to both widen and deepen the evolved block, and a grid search method is proposed to find optimal stacking solutions. The experimental results show the proposed method acquires good CNNs faster than 15 peer competitors within less than 40 GPU-hours. Regarding the classification accuracy, the proposed method gains its strong competitiveness against the peer competitors, which achieves the best error rates of 3.46%, 18.36% and 1.76% for the CIFAR-10, CIFAR-100 and SVHN datasets, respectively.  ( 2 min )
    Tight Performance Guarantees of Imitator Policies with Continuous Actions. (arXiv:2212.03922v1 [cs.LG])
    Behavioral Cloning (BC) aims at learning a policy that mimics the behavior demonstrated by an expert. The current theoretical understanding of BC is limited to the case of finite actions. In this paper, we study BC with the goal of providing theoretical guarantees on the performance of the imitator policy in the case of continuous actions. We start by deriving a novel bound on the performance gap based on Wasserstein distance, applicable for continuous-action experts, holding under the assumption that the value function is Lipschitz continuous. Since this latter condition is hardy fulfilled in practice, even for Lipschitz Markov Decision Processes and policies, we propose a relaxed setting, proving that value function is always Holder continuous. This result is of independent interest and allows obtaining in BC a general bound for the performance of the imitator policy. Finally, we analyze noise injection, a common practice in which the expert action is executed in the environment after the application of a noise kernel. We show that this practice allows deriving stronger performance guarantees, at the price of a bias due to the noise addition.  ( 2 min )
    Deep Learning for Brain Age Estimation: A Systematic Review. (arXiv:2212.03868v1 [eess.IV])
    Over the years, Machine Learning models have been successfully employed on neuroimaging data for accurately predicting brain age. Deviations from the healthy brain aging pattern are associated to the accelerated brain aging and brain abnormalities. Hence, efficient and accurate diagnosis techniques are required for eliciting accurate brain age estimations. Several contributions have been reported in the past for this purpose, resorting to different data-driven modeling methods. Recently, deep neural networks (also referred to as deep learning) have become prevalent in manifold neuroimaging studies, including brain age estimation. In this review, we offer a comprehensive analysis of the literature related to the adoption of deep learning for brain age estimation with neuroimaging data. We detail and analyze different deep learning architectures used for this application, pausing at research works published to date quantitatively exploring their application. We also examine different brain age estimation frameworks, comparatively exposing their advantages and weaknesses. Finally, the review concludes with an outlook towards future directions that should be followed by prospective studies. The ultimate goal of this paper is to establish a common and informed reference for newcomers and experienced researchers willing to approach brain age estimation by using deep learning models  ( 2 min )
    Learning on Graphs for Mineral Asset Valuation Under Supply and Demand Uncertainty. (arXiv:2212.03865v1 [cs.AI])
    Valuing mineral assets is a challenging task that is highly dependent on the supply (geological) uncertainty surrounding resources and reserves, and the uncertainty of demand (commodity prices). In this work, a graph-based reasoning, modeling and solution approach is proposed to jointly address mineral asset valuation and mine plan scheduling and optimization under supply and demand uncertainty in the "mining complex" framework. Three graph-based solutions are proposed: (i) a neural branching policy that learns a block-sampling ore body representation, (ii) a guiding policy that learns to explore a heuristic selection tree, (iii) a hyper-heuristic that manages the value/supply chain optimization and dynamics modeled as a graph structure. Results on two large-scale industrial mining complexes show a reduction of up to three orders of magnitude in primal suboptimality, execution time, and number of iterations, and an increase of up to 40% in the mineral asset value.  ( 2 min )
    Pre-Training With Scientific Text Improves Educational Question Generation. (arXiv:2212.03869v1 [cs.CL])
    With the boom of digital educational materials and scalable e-learning systems, the potential for realising AI-assisted personalised learning has skyrocketed. In this landscape, the automatic generation of educational questions will play a key role, enabling scalable self-assessment when a global population is manoeuvring their personalised learning journeys. We develop EduQG, a novel educational question generation model built by adapting a large language model. Our initial experiments demonstrate that EduQG can produce superior educational questions by pre-training on scientific text.  ( 2 min )
  • Open

    Adapting the Linearised Laplace Model Evidence for Modern Deep Learning. (arXiv:2206.08900v2 [stat.ML] UPDATED)
    The linearised Laplace method for estimating model uncertainty has received renewed attention in the Bayesian deep learning community. The method provides reliable error bars and admits a closed-form expression for the model evidence, allowing for scalable selection of model hyperparameters. In this work, we examine the assumptions behind this method, particularly in conjunction with model selection. We show that these interact poorly with some now-standard tools of deep learning--stochastic approximation methods and normalisation layers--and make recommendations for how to better adapt this classic method to the modern setting. We provide theoretical support for our recommendations and validate them empirically on MLPs, classic CNNs, residual networks with and without normalisation layers, generative autoencoders and transformers.  ( 2 min )
    Dual Convexified Convolutional Neural Networks. (arXiv:2205.14056v2 [cs.LG] UPDATED)
    We propose the framework of dual convexified convolutional neural networks (DCCNNs). In this framework, we first introduce a primal learning problem motivated by convexified convolutional neural networks (CCNNs), and then construct the dual convex training program through careful analysis of the Karush-Kuhn-Tucker (KKT) conditions and Fenchel conjugates. Our approach reduces the computational overhead of constructing a large kernel matrix and more importantly, eliminates the ambiguity of factorizing the matrix. Due to the low-rank structure in CCNNs and the related subdifferential of nuclear norms, there is no closed-form expression to recover the primal solution from the dual solution. To overcome this, we propose a highly novel weight recovery algorithm, which takes the dual solution and the kernel information as the input, and recovers the linear weight and the output of convolutional layer, instead of weight parameter. Furthermore, our recovery algorithm exploits the low-rank structure and imposes a small number of filters indirectly, which reduces the parameter size. As a result, DCCNNs inherit all the statistical benefits of CCNNs, while enjoying a more formal and efficient workflow.  ( 2 min )
    Joint Entropy Search for Maximally-Informed Bayesian Optimization. (arXiv:2206.04771v4 [cs.LG] UPDATED)
    Information-theoretic Bayesian optimization techniques have become popular for optimizing expensive-to-evaluate black-box functions due to their non-myopic qualities. Entropy Search and Predictive Entropy Search both consider the entropy over the optimum in the input space, while the recent Max-value Entropy Search considers the entropy over the optimal value in the output space. We propose Joint Entropy Search (JES), a novel information-theoretic acquisition function that considers an entirely new quantity, namely the entropy over the joint optimal probability density over both input and output space. To incorporate this information, we consider the reduction in entropy from conditioning on fantasized optimal input/output pairs. The resulting approach primarily relies on standard GP machinery and removes complex approximations typically associated with information-theoretic methods. With minimal computational overhead, JES shows superior decision-making, and yields state-of-the-art performance for information-theoretic approaches across a wide suite of tasks. As a light-weight approach with superior results, JES provides a new go-to acquisition function for Bayesian optimization.  ( 2 min )
    Equivariant maps from invariant functions. (arXiv:2209.14991v2 [stat.ML] UPDATED)
    In equivariant machine learning the idea is to restrict the learning to a hypothesis class where all the functions are equivariant with respect to some group action. Irreducible representations or invariant theory are typically used to parameterize the space of such functions. In this note, we explicate a general procedure, attributed to Malgrange, to express all polynomial maps between linear spaces that are equivariant with respect to the action of a group $G$, given a characterization of the invariant polynomials on a bigger space. The method also parametrizes smooth equivariant maps in the case that $G$ is a compact Lie group.  ( 2 min )
    Proximal Mean Field Learning in Shallow Neural Networks. (arXiv:2210.13879v2 [cs.LG] UPDATED)
    Recent mean field interpretations of learning dynamics in over-parameterized neural networks offer theoretical insights on the empirical success of first order optimization algorithms in finding global minima of the nonconvex risk landscape. In this paper, we explore applying mean field learning dynamics as a computational algorithm, rather than as an analytical tool. Specifically, we design a Sinkhorn regularized proximal algorithm to approximate the distributional flow from the learning dynamics in the mean field regime over weighted point clouds. In this setting, a contractive fixed point recursion computes the time-varying weights, numerically realizing the interacting Wasserstein gradient flow of the parameter distribution supported over the neuronal ensemble. An appealing aspect of the proposed algorithm is that the measure-valued recursions allow meshless computation. We demonstrate the proposed computational framework of interacting weighted particle evolution on binary and multi-class classification. Our algorithm performs gradient descent of the free energy associated with the risk functional.  ( 2 min )
    Making Linear MDPs Practical via Contrastive Representation Learning. (arXiv:2207.07150v2 [cs.LG] UPDATED)
    It is common to address the curse of dimensionality in Markov decision processes (MDPs) by exploiting low-rank representations. This motivates much of the recent theoretical study on linear MDPs. However, most approaches require a given representation under unrealistic assumptions about the normalization of the decomposition or introduce unresolved computational challenges in practice. Instead, we consider an alternative definition of linear MDPs that automatically ensures normalization while allowing efficient representation learning via contrastive estimation. The framework also admits confidence-adjusted index algorithms, enabling an efficient and principled approach to incorporating optimism or pessimism in the face of uncertainty. To the best of our knowledge, this provides the first practical representation learning method for linear MDPs that achieves both strong theoretical guarantees and empirical performance. Theoretically, we prove that the proposed algorithm is sample efficient in both the online and offline settings. Empirically, we demonstrate superior performance over existing state-of-the-art model-based and model-free algorithms on several benchmarks.  ( 2 min )
    Structure of Classifier Boundaries: Case Study for a Naive Bayes Classifier. (arXiv:2212.04382v1 [stat.ML])
    Whether based on models, training data or a combination, classifiers place (possibly complex) input data into one of a relatively small number of output categories. In this paper, we study the structure of the boundary--those points for which a neighbor is classified differently--in the context of an input space that is a graph, so that there is a concept of neighboring inputs, The scientific setting is a model-based naive Bayes classifier for DNA reads produced by Next Generation Sequencers. We show that the boundary is both large and complicated in structure. We create a new measure of uncertainty, called Neighbor Similarity, that compares the result for a point to the distribution of results for its neighbors. This measure not only tracks two inherent uncertainty measures for the Bayes classifier, but also can be implemented, at a computational cost, for classifiers without inherent measures of uncertainty.  ( 2 min )
    Distributed Contextual Linear Bandits with Minimax Optimal Communication Cost. (arXiv:2205.13170v2 [cs.LG] UPDATED)
    We study distributed contextual linear bandits with stochastic contexts, where $N$ agents act cooperatively to solve a linear bandit-optimization problem with $d$-dimensional features over the course of $T$ rounds. For this problem, we derive the first ever information-theoretic lower bound $\Omega(dN)$ on the communication cost of any algorithm that performs optimally in a regret minimization setup. We then propose a distributed batch elimination version of the LinUCB algorithm, DisBE-LUCB, where the agents share information among each other through a central server. We prove that the communication cost of DisBE-LUCB matches our lower bound up to logarithmic factors. In particular, for scenarios with known context distribution, the communication cost of DisBE-LUCB is only $\tilde{\mathcal{O}}(dN)$ and its regret is ${\tilde{\mathcal{O}}}(\sqrt{dNT})$, which is of the same order as that incurred by an optimal single-agent algorithm for $NT$ rounds. We also provide similar bounds for practical settings where the context distribution can only be estimated. Therefore, our proposed algorithm is nearly minimax optimal in terms of \emph{both regret and communication cost}. Finally, we propose DecBE-LUCB, a fully decentralized version of DisBE-LUCB, which operates without a central server, where agents share information with their \emph{immediate neighbors} through a carefully designed consensus procedure.  ( 2 min )
    Weisfeiler and Leman go Machine Learning: The Story so far. (arXiv:2112.09992v2 [cs.LG] UPDATED)
    In recent years, algorithms and neural architectures based on the Weisfeiler-Leman algorithm, a well-known heuristic for the graph isomorphism problem, have emerged as a powerful tool for machine learning with graphs and relational data. Here, we give a comprehensive overview of the algorithm's use in a machine-learning setting, focusing on the supervised regime. We discuss the theoretical background, show how to use it for supervised graph and node representation learning, discuss recent extensions, and outline the algorithm's connection to (permutation-)equivariant neural architectures. Moreover, we give an overview of current applications and future directions to stimulate further research.  ( 2 min )
    Nonstationary Bandit Learning via Predictive Sampling. (arXiv:2205.01970v4 [cs.LG] UPDATED)
    Thompson sampling has proven effective across a wide range of stationary bandit environments. However, as we demonstrate in this paper, it can perform poorly when applied to nonstationary environments. We show that such failures are attributed to the fact that, when exploring, the algorithm does not differentiate actions based on how quickly the information acquired loses its usefulness due to nonstationarity. Building upon this insight, we propose predictive sampling, which extends Thompson sampling to do this. We establish a Bayesian regret bound and establish that, in nonstationary bandit environments, the regret incurred by Thompson sampling can far exceed that of predictive sampling. We also present implementations of predictive sampling that scale to complex bandit environments of practical interest in a computationally tractable manner. Through simulations, we demonstrate that predictive sampling outperforms Thompson sampling and other state-of-the-art algorithms across a wide range of nonstationary bandit environments.  ( 2 min )
    Diffusion Probabilistic Modeling for Video Generation. (arXiv:2203.09481v5 [cs.CV] UPDATED)
    Denoising diffusion probabilistic models are a promising new class of generative models that mark a milestone in high-quality image generation. This paper showcases their ability to sequentially generate video, surpassing prior methods in perceptual and probabilistic forecasting metrics. We propose an autoregressive, end-to-end optimized video diffusion model inspired by recent advances in neural video compression. The model successively generates future frames by correcting a deterministic next-frame prediction using a stochastic residual generated by an inverse diffusion process. We compare this approach against five baselines on four datasets involving natural and simulation-based videos. We find significant improvements in terms of perceptual quality for all datasets. Furthermore, by introducing a scalable version of the Continuous Ranked Probability Score (CRPS) applicable to video, we show that our model also outperforms existing approaches in their probabilistic frame forecasting ability.  ( 2 min )
    Inexact bilevel stochastic gradient methods for constrained and unconstrained lower-level problems. (arXiv:2110.00604v2 [math.OC] UPDATED)
    Two-level stochastic optimization formulations have become instrumental in a number of machine learning contexts such as continual learning, neural architecture search, adversarial learning, and hyperparameter tuning. Practical stochastic bilevel optimization problems become challenging in optimization or learning scenarios where the number of variables is high or there are constraints. In this paper, we introduce a bilevel stochastic gradient method for bilevel problems with lower-level constraints. We also present a comprehensive convergence theory that covers all inexact calculations of the adjoint gradient (also called hypergradient) and addresses both the lower-level unconstrained and constrained cases. To promote the use of bilevel optimization in large-scale learning, we introduce a practical bilevel stochastic gradient method (BSG-1) that does not require second-order derivatives and, in the lower-level unconstrained case, dismisses any system solves and matrix-vector products.  ( 2 min )
    COIN++: Neural Compression Across Modalities. (arXiv:2201.12904v3 [cs.LG] UPDATED)
    Neural compression algorithms are typically based on autoencoders that require specialized encoder and decoder architectures for different data modalities. In this paper, we propose COIN++, a neural compression framework that seamlessly handles a wide range of data modalities. Our approach is based on converting data to implicit neural representations, i.e. neural functions that map coordinates (such as pixel locations) to features (such as RGB values). Then, instead of storing the weights of the implicit neural representation directly, we store modulations applied to a meta-learned base network as a compressed code for the data. We further quantize and entropy code these modulations, leading to large compression gains while reducing encoding time by two orders of magnitude compared to baselines. We empirically demonstrate the feasibility of our method by compressing various data modalities, from images and audio to medical and climate data.  ( 2 min )
    General-Purpose In-Context Learning by Meta-Learning Transformers. (arXiv:2212.04458v1 [cs.LG])
    Modern machine learning requires system designers to specify aspects of the learning pipeline, such as losses, architectures, and optimizers. Meta-learning, or learning-to-learn, instead aims to learn those aspects, and promises to unlock greater capabilities with less manual effort. One particularly ambitious goal of meta-learning is to train general-purpose in-context learning algorithms from scratch, using only black-box models with minimal inductive bias. Such a model takes in training data, and produces test-set predictions across a wide range of problems, without any explicit definition of an inference model, training loss, or optimization algorithm. In this paper we show that Transformers and other black-box models can be meta-trained to act as general-purpose in-context learners. We characterize phase transitions between algorithms that generalize, algorithms that memorize, and algorithms that fail to meta-train at all, induced by changes in model size, number of tasks, and meta-optimization. We further show that the capabilities of meta-trained algorithms are bottlenecked by the accessible state size (memory) determining the next prediction, unlike standard models which are thought to be bottlenecked by parameter count. Finally, we propose practical interventions such as biasing the training distribution that improve the meta-training and meta-generalization of general-purpose learning algorithms.  ( 2 min )
    Disaggregated Interventions to Reduce Inequality. (arXiv:2107.00593v3 [cs.LG] UPDATED)
    A significant body of research in the data sciences considers unfair discrimination against social categories such as race or gender that could occur or be amplified as a result of algorithmic decisions. Simultaneously, real-world disparities continue to exist, even before algorithmic decisions are made. In this work, we draw on insights from the social sciences brought into the realm of causal modeling and constrained optimization, and develop a novel algorithmic framework for tackling pre-existing real-world disparities. The purpose of our framework, which we call the "impact remediation framework," is to measure real-world disparities and discover the optimal intervention policies that could help improve equity or access to opportunity for those who are underserved with respect to an outcome of interest. We develop a disaggregated approach to tackling pre-existing disparities that relaxes the typical set of assumptions required for the use of social categories in structural causal models. Our approach flexibly incorporates counterfactuals and is compatible with various ontological assumptions about the nature of social categories. We demonstrate impact remediation with a hypothetical case study and compare our disaggregated approach to an existing state-of-the-art approach, comparing its structure and resulting policy recommendations. In contrast to most work on optimal policy learning, we explore disparity reduction itself as an objective, explicitly focusing the power of algorithms on reducing inequality.  ( 2 min )
    Shapley values for cluster importance: How clusters of the training data affect a prediction. (arXiv:2012.03625v2 [stat.ML] UPDATED)
    This paper proposes a novel approach to explain the predictions made by data-driven methods. Since such predictions rely heavily on the data used for training, explanations that convey information about how the training data affects the predictions are useful. The paper proposes a novel approach to quantify how different data-clusters of the training data affect a prediction. The quantification is based on Shapley values, a concept which originates from coalitional game theory, developed to fairly distribute the payout among a set of cooperating players. A player's Shapley value is a measure of that player's contribution. Shapley values are often used to quantify feature importance, ie. how features affect a prediction. This paper extends this to cluster importance, letting clusters of the training data act as players in a game where the predictions are the payouts. The novel methodology proposed in this paper lets us explore and investigate how different clusters of the training data affect the predictions made by any black-box model, allowing new aspects of the reasoning and inner workings of a prediction model to be conveyed to the users. The methodology is fundamentally different from existing explanation methods, providing insight which would not be available otherwise, and should complement existing explanation methods, including explanations based on feature importance.  ( 2 min )
    Model-based trajectory stitching for improved behavioural cloning and its applications. (arXiv:2212.04280v1 [stat.ML])
    Behavioural cloning (BC) is a commonly used imitation learning method to infer a sequential decision-making policy from expert demonstrations. However, when the quality of the data is not optimal, the resulting behavioural policy also performs sub-optimally once deployed. Recently, there has been a surge in offline reinforcement learning methods that hold the promise to extract high-quality policies from sub-optimal historical data. A common approach is to perform regularisation during training, encouraging updates during policy evaluation and/or policy improvement to stay close to the underlying data. In this work, we investigate whether an offline approach to improving the quality of the existing data can lead to improved behavioural policies without any changes in the BC algorithm. The proposed data improvement approach - Trajectory Stitching (TS) - generates new trajectories (sequences of states and actions) by `stitching' pairs of states that were disconnected in the original data and generating their connecting new action. By construction, these new transitions are guaranteed to be highly plausible according to probabilistic models of the environment, and to improve a state-value function. We demonstrate that the iterative process of replacing old trajectories with new ones incrementally improves the underlying behavioural policy. Extensive experimental results show that significant performance gains can be achieved using TS over BC policies extracted from the original data. Furthermore, using the D4RL benchmarking suite, we demonstrate that state-of-the-art results are obtained by combining TS with two existing offline learning methodologies reliant on BC, model-based offline planning (MBOP) and policy constraint (TD3+BC).  ( 2 min )
    Differentially-Private Bayes Consistency. (arXiv:2212.04216v1 [cs.LG])
    We construct a universally Bayes consistent learning rule that satisfies differential privacy (DP). We first handle the setting of binary classification and then extend our rule to the more general setting of density estimation (with respect to the total variation metric). The existence of a universally consistent DP learner reveals a stark difference with the distribution-free PAC model. Indeed, in the latter DP learning is extremely limited: even one-dimensional linear classifiers are not privately learnable in this stringent model. Our result thus demonstrates that by allowing the learning rate to depend on the target distribution, one can circumvent the above-mentioned impossibility result and in fact, learn \emph{arbitrary} distributions by a single DP algorithm. As an application, we prove that any VC class can be privately learned in a semi-supervised setting with a near-optimal \emph{labeled} sample complexity of $\tilde{O}(d/\varepsilon)$ labeled examples (and with an unlabeled sample complexity that can depend on the target distribution).  ( 2 min )
    Application of machine learning regression models to inverse eigenvalue problems. (arXiv:2212.04279v1 [math.NA])
    In this work, we study the numerical solution of inverse eigenvalue problems from a machine learning perspective. Two different problems are considered: the inverse Strum-Liouville eigenvalue problem for symmetric potentials and the inverse transmission eigenvalue problem for spherically symmetric refractive indices. Firstly, we solve the corresponding direct problems to produce the required eigenvalues datasets in order to train the machine learning algorithms. Next, we consider several examples of inverse problems and compare the performance of each model to predict the unknown potentials and refractive indices respectively, from a given small set of the lowest eigenvalues. The supervised regression models we use are k-Nearest Neighbours, Random Forests and Multi-Layer Perceptron. Our experiments show that these machine learning methods, under appropriate tuning on their parameters, can numerically solve the examined inverse eigenvalue problems.  ( 2 min )
    The Ordered Matrix Dirichlet for Modeling Ordinal Dynamics. (arXiv:2212.04130v1 [stat.ML])
    Many dynamical systems exhibit latent states with intrinsic orderings such as "ally", "neutral" and "enemy" relationships in international relations. Such latent states are evidenced through entities' cooperative versus conflictual interactions which are similarly ordered. Models of such systems often involve state-to-action emission and state-to-state transition matrices. It is common practice to assume that the rows of these stochastic matrices are independently sampled from a Dirichlet distribution. However, this assumption discards ordinal information and treats states and actions falsely as order-invariant categoricals, which hinders interpretation and evaluation. To address this problem, we propose the Ordered Matrix Dirichlet (OMD): rows are sampled conditionally dependent such that probability mass is shifted to the right of the matrix as we move down rows. This results in a well-ordered mapping between latent states and observed action types. We evaluate the OMD in two settings: a Hidden Markov Model and a novel Bayesian Dynamic Poisson Tucker Model tailored to political event data. Models built on the OMD recover interpretable latent states and show superior forecasting performance in few-shot settings. We detail the wide applicability of the OMD to other domains where models with Dirichlet-sampled matrices are popular (e.g. topic modeling) and publish user-friendly code.  ( 2 min )
    Transfer learning for chemically accurate interatomic neural network potentials. (arXiv:2212.03916v1 [physics.comp-ph])
    Developing machine learning-based interatomic potentials from ab-initio electronic structure methods remains a challenging task for computational chemistry and materials science. This work studies the capability of transfer learning for efficiently generating chemically accurate interatomic neural network potentials on organic molecules from the MD17 and ANI data sets. We show that pre-training the network parameters on data obtained from density functional calculations considerably improves the sample efficiency of models trained on more accurate ab-initio data. Additionally, we show that fine-tuning with energy labels alone suffices to obtain accurate atomic forces and run large-scale atomistic simulations. We also investigate possible limitations of transfer learning, especially regarding the design and size of the pre-training and fine-tuning data sets. Finally, we provide GM-NN potentials pre-trained and fine-tuned on the ANI-1x and ANI-1ccx data sets, which can easily be fine-tuned on and applied to organic molecules.  ( 2 min )
    A parallelizable model-based approach for marginal and multivariate clustering. (arXiv:2212.04009v1 [stat.ML])
    This paper develops a clustering method that takes advantage of the sturdiness of model-based clustering, while attempting to mitigate some of its pitfalls. First, we note that standard model-based clustering likely leads to the same number of clusters per margin, which seems a rather artificial assumption for a variety of datasets. We tackle this issue by specifying a finite mixture model per margin that allows each margin to have a different number of clusters, and then cluster the multivariate data using a strategy game-inspired algorithm to which we call Reign-and-Conquer. Second, since the proposed clustering approach only specifies a model for the margins -- but leaves the joint unspecified -- it has the advantage of being partially parallelizable; hence, the proposed approach is computationally appealing as well as more tractable for moderate to high dimensions than a `full' (joint) model-based clustering approach. A battery of numerical experiments on artificial data indicate an overall good performance of the proposed methods in a variety of scenarios, and real datasets are used to showcase their application in practice.  ( 2 min )
    A probabilistic autoencoder for causal discovery. (arXiv:2212.04235v1 [stat.ML])
    The paper addresses the problem of finding the causal direction between two associated variables. The proposed solution is to build an autoencoder of their joint distribution and to maximize its estimation capacity relative to both the marginal distributions. It is shown that the resulting two capacities cannot, in general, be equal. This leads to a new criterion for causal discovery: the higher capacity is consistent with the unconstrained choice of a distribution representing the cause while the lower capacity reflects the constraints imposed by the mechanism on the distribution of the effect. Estimation capacity is defined as the ability of the auto-encoder to represent arbitrary datasets. A regularization term forces it to decide which one of the variables to model in a more generic way i.e., while maintaining higher model capacity. The causal direction is revealed by the constraints encountered while encoding the data instead of being measured as a property of the data itself. The idea is implemented and tested using a restricted Boltzmann machine.  ( 2 min )
    A Novel Stochastic Gradient Descent Algorithm for Learning Principal Subspaces. (arXiv:2212.04025v1 [cs.LG])
    Many machine learning problems encode their data as a matrix with a possibly very large number of rows and columns. In several applications like neuroscience, image compression or deep reinforcement learning, the principal subspace of such a matrix provides a useful, low-dimensional representation of individual data. Here, we are interested in determining the $d$-dimensional principal subspace of a given matrix from sample entries, i.e. from small random submatrices. Although a number of sample-based methods exist for this problem (e.g. Oja's rule \citep{oja1982simplified}), these assume access to full columns of the matrix or particular matrix structure such as symmetry and cannot be combined as-is with neural networks \citep{baldi1989neural}. In this paper, we derive an algorithm that learns a principal subspace from sample entries, can be applied when the approximate subspace is represented by a neural network, and hence can be scaled to datasets with an effectively infinite number of rows and columns. Our method consists in defining a loss function whose minimizer is the desired principal subspace, and constructing a gradient estimate of this loss whose bias can be controlled. We complement our theoretical analysis with a series of experiments on synthetic matrices, the MNIST dataset \citep{lecun2010mnist} and the reinforcement learning domain PuddleWorld \citep{sutton1995generalization} demonstrating the usefulness of our approach.  ( 2 min )
    Multi-Rate VAE: Train Once, Get the Full Rate-Distortion Curve. (arXiv:2212.03905v1 [cs.LG])
    Variational autoencoders (VAEs) are powerful tools for learning latent representations of data used in a wide range of applications. In practice, VAEs usually require multiple training rounds to choose the amount of information the latent variable should retain. This trade-off between the reconstruction error (distortion) and the KL divergence (rate) is typically parameterized by a hyperparameter $\beta$. In this paper, we introduce Multi-Rate VAE (MR-VAE), a computationally efficient framework for learning optimal parameters corresponding to various $\beta$ in a single training run. The key idea is to explicitly formulate a response function that maps $\beta$ to the optimal parameters using hypernetworks. MR-VAEs construct a compact response hypernetwork where the pre-activations are conditionally gated based on $\beta$. We justify the proposed architecture by analyzing linear VAEs and showing that it can represent response functions exactly for linear VAEs. With the learned hypernetwork, MR-VAEs can construct the rate-distortion curve without additional training and can be deployed with significantly less hyperparameter tuning. Empirically, our approach is competitive and often exceeds the performance of multiple $\beta$-VAEs training with minimal computation and memory overheads.  ( 2 min )
    Pre-Training With Scientific Text Improves Educational Question Generation. (arXiv:2212.03869v1 [cs.CL])
    With the boom of digital educational materials and scalable e-learning systems, the potential for realising AI-assisted personalised learning has skyrocketed. In this landscape, the automatic generation of educational questions will play a key role, enabling scalable self-assessment when a global population is manoeuvring their personalised learning journeys. We develop EduQG, a novel educational question generation model built by adapting a large language model. Our initial experiments demonstrate that EduQG can produce superior educational questions by pre-training on scientific text.  ( 2 min )
    Counterfactuals for the Future. (arXiv:2212.03974v1 [cs.AI])
    Counterfactuals are often described as 'retrospective,' focusing on hypothetical alternatives to a realized past. This description relates to an often implicit assumption about the structure and stability of exogenous variables in the system being modeled -- an assumption that is reasonable in many settings where counterfactuals are used. In this work, we consider cases where we might reasonably make a different assumption about exogenous variables, namely, that the exogenous noise terms of each unit do exhibit some unit-specific structure and/or stability. This leads us to a different use of counterfactuals -- a 'forward-looking' rather than 'retrospective' counterfactual. We introduce "counterfactual treatment choice," a type of treatment choice problem that motivates using forward-looking counterfactuals. We then explore how mismatches between interventional versus forward-looking counterfactual approaches to treatment choice, consistent with different assumptions about exogenous noise, can lead to counterintuitive results.  ( 2 min )
    Strong identifiability and parameter learning in regression with heterogeneous response. (arXiv:2212.04091v1 [math.ST])
    Mixtures of regression are a powerful class of models for regression learning with respect to a highly uncertain and heterogeneous response variable of interest. In addition to being a rich predictive model for the response given some covariates, the parameters in this model class provide useful information about the heterogeneity in the data population, which is represented by the conditional distributions for the response given the covariates associated with a number of distinct but latent subpopulations. In this paper, we investigate conditions of strong identifiability, rates of convergence for conditional density and parameter estimation, and the Bayesian posterior contraction behavior arising in finite mixture of regression models, under exact-fitted and over-fitted settings and when the number of components is unknown. This theory is applicable to common choices of link functions and families of conditional distributions employed by practitioners. We provide simulation studies and data illustrations, which shed some light on the parameter learning behavior found in several popular regression mixture models reported in the literature.  ( 2 min )
    Statistical and Computational Guarantees for Influence Diagnostics. (arXiv:2212.04014v1 [stat.ML])
    Influence diagnostics such as influence functions and approximate maximum influence perturbations are popular in machine learning and in AI domain applications. Influence diagnostics are powerful statistical tools to identify influential datapoints or subsets of datapoints. We establish finite-sample statistical bounds, as well as computational complexity bounds, for influence functions and approximate maximum influence perturbations using efficient inverse-Hessian-vector product implementations. We illustrate our results with generalized linear models and large attention based models on synthetic and real data.  ( 2 min )

  • Open

    AI-based assessment of cardiac allograft rejections
    submitted by /u/pasticciociccio [link] [comments]  ( 44 min )
    Why We Normalize The Input Data
    submitted by /u/Personal-Trainer-541 [link] [comments]  ( 46 min )
  • Open

    Does "massively parallel simulation" actually help advance Reinforcement Learning?
    NVIDIA's Isaac Gym project revealed GPU's capability of performing massively parallel simulation for gym-style environments. Detailed information can be found in the following paper: [1] Makoviychuk, Viktor, et al. "Isaac Gym: High-Performance GPU Based Physics Simulation For Robot Learning." Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). 2021. At its release, people commented on Twitter that "it is the MNIST moment for reinforcement learning." And over the past year, I saw several follow-up works and tested NVIDIA's implementations. For example, a demo by this blog https://towardsdatascience.com/a-new-era-of-massively-parallel-simulation-a-practical-tutorial-using-elegantrl-5ebc483c3385 ​ The question is, does that technique help advance Reinforcement Learning, as expected? submitted by /u/Capital-Style-6613 [link] [comments]  ( 56 min )
    Jack Parker-Holder, DeepMind: On open-endedness, evolving agents and environments, online adaptation, and offline learning
    Here is a podcast episode with Jack Parker-Holder from DeepMind where we discuss open-endedness, evolving agents and environments, offline learning with world models, and much more! submitted by /u/thejashGI [link] [comments]  ( 54 min )
    Illustrating Reinforcement Learning from Human Feedback (RLHF)
    submitted by /u/robotphilanthropist [link] [comments]  ( 55 min )
    Question on how to model a "discontinuous" action space
    Hi, I'm working on a problem where the agent has two choices at any time step. It can either choose to perform action 1 which ends the sequence or perform action 2 which takes a continuous valued parameter (that the agent needs to choose). I'm not exactly sure how I can model this. Should I model it as two decisions? First have the agent pick between actions 1 and 2 and then if action 2 is picked, have it choose the continuous valued parameter? submitted by /u/theanswerisnt42 [link] [comments]  ( 57 min )
  • Open

    "AI-based assessment of cardiac allograft rejections" Lipkova et al.
    submitted by /u/pasticciociccio [link] [comments]  ( 42 min )
    Jack Parker-Holder, DeepMind: On open-endedness, evolving agents and environments, online adaptation, and offline learning
    Here is a podcast episode with Jack Parker-Holder from DeepMind where we discuss open-endedness, evolving agents and environments, offline learning with world models, and much more! submitted by /u/thejashGI [link] [comments]  ( 42 min )
    The Illusion of Control: How our attempts to tame time only serve to prolong our suffering
    submitted by /u/nalr00n [link] [comments]  ( 48 min )
    Story
    Hello. I just found something on AI about making a screenplay, and it's on a video called Elastic Monkeys. The software is called Story, and it includes a UI to describe your characters. Here's a link to the website. i'm wanting to know how to download it. submitted by /u/Tipene5 [link] [comments]  ( 42 min )
    Looking for a cofounder for an AI app
    Apologies in advance if this is not the right place to post. I am building an app to help supercharge the productivity of customer support / sales professionals. I have a background in enterprise sales and my cofounder is an expert UI and frontend dev (iOS, web, mobile). We are looking for an AI dev (ML Ops, custom model training, etc) to join our team and help build the product. Feel free to DM directly or comment below to learn more. submitted by /u/pragmaticpirate [link] [comments]  ( 46 min )
    Convert descriptions to images with transparent backgrounds using AI
    submitted by /u/barty777 [link] [comments]  ( 44 min )
    Discovering the best portion to sample from an audio file for pitch detection?
    Discovering the best portion to sample from an audio file for pitch detection? Assume that we cannot guarantee what the whole audio file will be like, but we know that some of it contains the signal whose pitch we're interested to detect. However, other parts can contain noise, invalid data, ... Is this unfeasible? It sounds like a hard problem. submitted by /u/mavavilj [link] [comments]  ( 48 min )
    ChatGPT is Trending but Singularity Identifies AI Adoption Obstacles
    submitted by /u/Kipyegonn [link] [comments]  ( 43 min )
    Is Stable Diffusion 2.1 Disappointing?
    submitted by /u/PuppetHere [link] [comments]  ( 44 min )
    AI Assistant that helps you analyze and choose options
    submitted by /u/SudoSharma [link] [comments]  ( 46 min )
    Why We Normalize The Input Data
    submitted by /u/Personal-Trainer-541 [link] [comments]  ( 44 min )
    ...
    submitted by /u/MindsTyrant [link] [comments]  ( 42 min )
    ChatGPT wrote Tic-Tac-Toe in Python... but it's unwinnable. Best I can do is tie.
    # Tic-Tac-Toe game # global constants X = "X" O = "O" EMPTY = " " TIE = "TIE" NUM_SQUARES = 9 def display_instructions(): """Display game instructions.""" print( """ Welcome to the greatest intellectual challenge of all time: Tic-Tac-Toe. This will be a showdown between your human brain and my silicon processor. You will make your move known by entering a number, 0 - 8. The number will correspond to the board position as illustrated: 0 | 1 | 2 --------- 3 | 4 | 5 --------- 6 | 7 | 8 Prepare yourself, human. The ultimate battle is about to begin. \n """ ) def ask_yes_no(question): """Ask a yes or no question.""" response = None while response not in ("y", "n"): response = input(question).lower() return response def ask_number(question, low, high): """Ask for a number within a range.""" resp…  ( 47 min )
    [R] Discovering Latent Knowledge in Language Models Without Supervision
    submitted by /u/walt74 [link] [comments]  ( 48 min )
    [paper] Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models
    submitted by /u/walt74 [link] [comments]  ( 45 min )
    Ai switched up real quick
    submitted by /u/mmohand [link] [comments]  ( 42 min )
    What is a good AI image editing software for braindead people like me?
    I only need it for editing midjourney AI imagines. Is midjourney the best at the moment? submitted by /u/Thesmallcookie [link] [comments]  ( 45 min )
    I asked an AI to turn the crucifixion of Jesus into an action move, and here is how it went
    submitted by /u/wazserd [link] [comments]  ( 61 min )
    ChatGPT - Ultimate Test ;)
    Everything I say is a lie. I'm lying. Based on your statement, it is impossible to determine whether or not you are lying. This is because you have stated that everything you say is a lie, which means that your statement that you are lying could be false. In other words, if everything you say is a lie, then it is possible that you are actually telling the truth when you say that you are lying. Because of this logical contradiction, it is impossible to determine the truth of your statement. submitted by /u/Sybrsean [link] [comments]  ( 43 min )
  • Open

    [Discussion] Universal product recommendation model
    Hi, I have a use-case which requires building a universal model to predict what a customer might take (or what we can recommend) as a product' i.e., a customer's next best product offer. At the moment, we have 20 products, and only 4 of those have a model built for them; a model that predicts the propensity of a customer taking that product. From those models, monthly campaigns are launched, and the top 2 deciles (highest propensities) are targeted. One of the complexities here is the products themselves. Some products are loan products (example: personal loans, car loans, home loans, etc.) and others are non-lending products (example: an application, a digital wallet, insurance, etc.). Some of these products are secondary products/byproducts of other products; example: you can only get PF2 if you currently hold PF1 for at least a year, or you can't get HF1 if you currently hold HF1, HF2, or HF3, etc. A customer's personal information (age, nationality, city, region), inflow/outflow transactions, spending habits, salary, balance, current and previous active/closed product holding, etc. are all available. How would you approach this? What model type would work? I assume a universal model wouldn't be the right way, but building a model for each product isn't very practical at the moment. Appreciate the help. submitted by /u/crisler_iden [link] [comments]  ( 64 min )
    [P] Multi-Label Classification for Darts
    Hello all! I am trying to build a CNN model which is able to predict the Darts score after throwing 1 until 3 steel tip arrows. Currently my data set is around 10.000 pictures with labels, which looks like this: Name of the picture; Sum of scores, score of arrow 1, score of arrow 2, score of arrow 3, arrow 1 thown, arrow 2 thrown, arrow 3 thrown example: 12_2022-10-12_135955.050, 12, 12, 0, 0, True, True, False 17_2022-10-12_135958.923, 17, 10, 0, 7, True, False, True [....] My approach is to build a multi-label classifier and to use sigmoid activation. I transfered the labels from categorical variable into a numeric variables using one hot encoder. My problem is that the numeric variables can be greater than 1 because all three arrows can be in the same dart field multiple times. Do you think I am on the right track with my approach or do you have any tipps how to go on with the procedure? My problem is that I dont know how to handle the multi labels. Thanks for any idea or tip! submitted by /u/sqrt-1iscomplex [link] [comments]  ( 66 min )
    [D] Making a regression NN estimate its own regression error
    Is there a way to make a neural network that performs only regression to estimate its own error at inference time, without having ground truths for reference? My network predicts N points and I know the [x,y] coordinates for each. On a labeled test set I can compute the distance between each point and the ground truths, however, I want the network to be able to estimate these distances by itself. I do not have separate classes, the network is trained using just the L2 loss between its predicted points and the expected ground truth points. submitted by /u/Alex-S-S [link] [comments]  ( 67 min )
    [D] Jack Parker-Holder, DeepMind: On open-endedness, evolving agents and environments, online adaptation, and offline learning
    Here is a podcast episode with Jack Parker-Holder from DeepMind where we discuss open-endedness, evolving agents and environments, offline learning with world models, and much more! submitted by /u/thejashGI [link] [comments]  ( 70 min )
    [R] Illustrating Reinforcement Learning from Human Feedback (RLHF)
    New HuggingFace blog post on RLHF: https://huggingface.co/blog/rlhf Motivated by ChatGPT and the lack of conceptually focused resources on the topic. submitted by /u/robotphilanthropist [link] [comments]  ( 66 min )
    [D] Seeking efficient audio based data augmentation (bottlenecks, need advice)
    In audio, I have to generate an image representation of some folder of audio files.Unfortunately, I have to generate a cochleagram and run each audio file through an auditory toolbox 1 by 1. For large datasets this can take hours for each iteration when testing stuff out (different resolutions, number of filterbanks applied, etc). I then have to save off all the features into one flat dataframe on my hard drive for training. A sub problem to this is data augmentation. Pedalboard is a python library that allows for audio data augmentation. However, this must be performed when the audio is still audio, before the cochleagram data is generated using the filterbanks. For this reason, I either need to save the entire augmented dataset in a flat file, and then generate their visual representations, or make all this pre processing part of the CNN, which would make training time unfeasible. Thus, the only augmentations I use are simple image shifts and gaussian blur, when I really could be making use of different convolutional reverbs and other forms of data augmentation using VST plugins. Is this just where things are at right now? I was initially planning on comparing different filterbanks for use as inputs to CNNs, but it seems now a better comparison may be comparing cochleagrams (without audio based augmentation) to mel-spectrograms (with audio based augmentation) , which would allow me to make everything part of the training loop since mel-spectrograms generate much faster. submitted by /u/Oceanboi [link] [comments]  ( 67 min )
    Representation ability of a MLP network [D]
    I am wondering, is there any relationship between the dimension of the input vector and the ability of the output that it can represent? (e.g. can I say that a 10-dimensional feature vector has better representation ability against a 5-dimensional one, assuming data are sufficient to train a model.) Or if not, can you suggest any reference to the formal induction that illustrates that relationship? submitted by /u/OutOfCharm [link] [comments]  ( 106 min )
    [D] On outlier removal
    Hi there, I have a question which has been answered multiple times on various blogs but I need to find a paper for this. When we know that we have outliers which lets say have valid values and are not artifacts, do we keep them even if they mess up a model's performance? As a person with a PhD in machine learning, it was always clear to me that outliers (when few) are to be deleted de-facto. I've never encountered a dilemma in the literature regarding outliers when training models (but again strictly from the data scientist point of view). In many top AI conferences, KDD, ICML, WWW, etc., it is a common practise to drop outliers to have this small boost in performance as to beat the SOTA methods. I am looking for related literature which studies the ethical impact of removing/maintaining outliers or a review on outlier handling from top AI-conferences. submitted by /u/potato_head_101 [link] [comments]  ( 64 min )
    [D] Dr. Petar Veličković (Deepmind) - Categories, Graphs, Reasoning and Graph Expander Propagation
    Hey folks, We interviewed Petar Veličković at NeurIPS last week here -- https://www.youtube.com/watch?v=1lkdWduuN14 Categories (Cats for AI) [00:00:00] Algorithmic Reasoning [00:14:44] Extrapolation [00:19:09] Ishan Misra Skit [00:27:50] Graphs (Expander Graph Propagation) [00:29:18] ​ References MLST#60 Geometric Deep Learning Blueprint (Special Edition) https://www.youtube.com/watch?v=bIZB1hIJ4u8 ​ Categories for AI https://cats.for.ai/ Organised by: Andrew Dudzik - DeepMind Bruno Gavranović - University of Strathclyde João Guilherme Araújo - Cohere / Universidade de São Paulo ​ Petar Veličković - DeepMind / University of Cambridge Pim de Haan - University of Amsterdam / Qualcomm AI Research ​ [Petar Veličković] Graph Attention Networks https://arxiv.org/abs/1…  ( 71 min )
    [D] Causal ML in Natural Language Processing
    What’s your opinion on Causal ML in NLP? Are there any research groups mainly working on this? submitted by /u/ameli__c [link] [comments]  ( 72 min )
    [R] Large language models are not zero-shot communicators
    Paper: Large language models are not zero-shot communicators (arXiv) Abstract: Despite widespread use of LLMs as conversational agents, evaluations of performance fail to capture a crucial aspect of communication: interpreting language in context. Humans interpret language using beliefs and prior knowledge about the world. For example, we intuitively understand the response "I wore gloves" to the question "Did you leave fingerprints?" as meaning "No". To investigate whether LLMs have the ability to make this type of inference, known as an implicature, we design a simple task and evaluate widely used state-of-the-art models. We find that, despite only evaluating on utterances that require a binary inference (yes or no), most perform close to random. Models adapted to be "aligned with human intent" perform much better, but still show a significant gap with human performance. We present our findings as the starting point for further research into evaluating how LLMs interpret language in context and to drive the development of more pragmatic and useful models of human discourse. Authors: Laura Ruis, Akbir Khan, Stella Biderman, Sara Hooker, Tim Rocktäschel, Edward Grefenstette submitted by /u/mrx-ai [link] [comments]  ( 78 min )
    [r] Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models
    Important paper researching the replication of training data in diffusion models. Very relevant to recent debates around "art theft" and "data laundering". Abstract Cutting-edge diffusion models produce images with high quality and customizability, enabling them to be used for commercial art and graphic design purposes. But do diffusion models create unique works of art, or are they stealing content directly from their training sets? In this work, we study image retrieval frameworks that enable us to compare generated images with training samples and detect when content has been replicated. Applying our frameworks to diffusion models trained on multiple datasets including Oxford flowers, Celeb-A, ImageNet, and LAION, we discuss how factors such as training set size impact rates of content replication. We also identify cases where diffusion models, including the popular Stable Diffusion model, blatantly copy from their training data. Paper: Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models submitted by /u/walt74 [link] [comments]  ( 66 min )
    [D] When to use 1x1 convolution
    I am aware that one 1x1 convolution is needed for separable convolution but when else is it useful. I see it used in mobilenetv2 before the depthwise separable convolution later in the bottleneck but not sure why. I also see it used with stride 2 when max pooling could be used instead. Could someone please explain the logic behind this. Thanks. submitted by /u/Ananth_A_007 [link] [comments]  ( 67 min )
    [R] General-Purpose In-Context Learning by Meta-Learning Transformers
    submitted by /u/hardmaru [link] [comments]  ( 64 min )
  • Open

    Meet the 2022-23 Accenture Fellows
    This year's fellows will work across research areas including telemonitoring, human-computer interactions, operations research,  AI-mediated socialization, and chemical transformations.  ( 8 min )
  • Open

    Transfer Learning for Functional Linear Regression with Structural Interpretability. (arXiv:2206.04277v3 [stat.ML] UPDATED)
    This work studies the problem of transfer learning under the functional linear regression model framework, which aims to improve the estimation and prediction of the target model by leveraging the information from related source models. We measure the relatedness between target and source models using Reproducing Kernel Hilbert Spaces (RKHS) norm, allowing the type of information being transferred to be interpreted by the structural properties of the spaces. Two transfer learning algorithms are proposed: one transfers information from source tasks when we know which sources to use, while the other one aggregates multiple transfer learning results from the first algorithm to achieve robust transfer learning without prior information about the sources. Furthermore, we establish the optimal convergence rates for the prediction risk in the target model, making the statistical gain via transfer learning mathematically provable. The theoretical analysis of the prediction risk also provides insights regarding what factors are affecting the transfer learning effect, i.e. what makes source tasks useful to the target task. We demonstrate the effectiveness of the proposed transfer learning algorithms on extensive synthetic data as well as real financial data application.  ( 2 min )
    Detecting hidden confounding in observational data using multiple environments. (arXiv:2205.13935v2 [stat.ME] UPDATED)
    A common assumption in causal inference from observational data is that there is no hidden confounding. Yet it is, in general, impossible to verify the presence of hidden confounding factors from a single dataset. Under the assumption of independent causal mechanisms underlying the data generating process, we demonstrate a way to detect unobserved confounders when having multiple observational datasets coming from different environments. We present a theory for testable conditional independencies that are only absent during hidden confounding and examine cases where we violate its assumptions: degenerate & dependent mechanisms, and faithfulness violations. Additionally, we propose a procedure to test these independencies and study its empirical finite-sample behavior using simulation studies and semi-synthetic data based on a real-world dataset. In most cases, our theory correctly predicts the presence of hidden confounding, particularly when the confounding bias is~large.  ( 2 min )
    Reconstructing Training Data from Model Gradient, Provably. (arXiv:2212.03714v1 [cs.LG])
    Understanding when and how much a model gradient leaks information about the training sample is an important question in privacy. In this paper, we present a surprising result: even without training or memorizing the data, we can fully reconstruct the training samples from a single gradient query at a randomly chosen parameter value. We prove the identifiability of the training data under mild conditions: with shallow or deep neural networks and a wide range of activation functions. We also present a statistically and computationally efficient algorithm based on tensor decomposition to reconstruct the training data. As a provable attack that reveals sensitive training data, our findings suggest potential severe threats to privacy, especially in federated learning.  ( 2 min )
    Fast Offline Policy Optimization for Large Scale Recommendation. (arXiv:2208.05327v3 [cs.IR] UPDATED)
    Personalised interactive systems such as recommender systems require selecting relevant items dependent on context. Production systems need to identify the items rapidly from very large catalogues which can be efficiently solved using maximum inner product search technology. Offline optimisation of maximum inner product search can be achieved by a relaxation of the discrete problem resulting in policy learning or REINFORCE style learning algorithms. Unfortunately, this relaxation step requires computing a sum over the entire catalogue making the complexity of the evaluation of the gradient (and hence each stochastic gradient descent iterations) linear in the catalogue size. This calculation is untenable in many real world examples such as large catalogue recommender systems, severely limiting the usefulness of this method in practice. In this paper, we derive an excellent approximation of these policy learning algorithms that scale logarithmically with the catalogue size. Our contribution is based upon combining three novel ideas: a new Monte Carlo estimate of the gradient of a policy, the self normalised importance sampling estimator and the use of fast maximum inner product search at training time. Extensive experiments show that our algorithm is an order of magnitude faster than naive approaches yet produces equally good policies.
    A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification. (arXiv:2107.07511v6 [cs.LG] UPDATED)
    Black-box machine learning models are now routinely used in high-risk settings, like medical diagnostics, which demand uncertainty quantification to avoid consequential model failures. Conformal prediction is a user-friendly paradigm for creating statistically rigorous uncertainty sets/intervals for the predictions of such models. Critically, the sets are valid in a distribution-free sense: they possess explicit, non-asymptotic guarantees even without distributional assumptions or model assumptions. One can use conformal prediction with any pre-trained model, such as a neural network, to produce sets that are guaranteed to contain the ground truth with a user-specified probability, such as 90%. It is easy-to-understand, easy-to-use, and general, applying naturally to problems arising in the fields of computer vision, natural language processing, deep reinforcement learning, and so on. This hands-on introduction is aimed to provide the reader a working understanding of conformal prediction and related distribution-free uncertainty quantification techniques with one self-contained document. We lead the reader through practical theory for and examples of conformal prediction and describe its extensions to complex machine learning tasks involving structured outputs, distribution shift, time-series, outliers, models that abstain, and more. Throughout, there are many explanatory illustrations, examples, and code samples in Python. With each code sample comes a Jupyter notebook implementing the method on a real-data example; the notebooks can be accessed and easily run using our codebase.
    Stable Conformal Prediction Sets. (arXiv:2112.10224v2 [stat.ML] UPDATED)
    When one observes a sequence of variables $(x_1, y_1), \ldots, (x_n, y_n)$, Conformal Prediction (CP) is a methodology that allows to estimate a confidence set for $y_{n+1}$ given $x_{n+1}$ by merely assuming that the distribution of the data is exchangeable. CP sets have guaranteed coverage for any finite population size $n$. While appealing, the computation of such a set turns out to be infeasible in general, e.g. when the unknown variable $y_{n+1}$ is continuous. The bottleneck is that it is based on a procedure that readjusts a prediction model on data where we replace the unknown target by all its possible values in order to select the most probable one. This requires computing an infinite number of models, which often makes it intractable. In this paper, we combine CP techniques with classical algorithmic stability bounds to derive a prediction set computable with a single model fit. We demonstrate that our proposed confidence set does not lose any coverage guarantees while avoiding the need for data splitting as currently done in the literature. We provide some numerical experiments to illustrate the tightness of our estimation when the sample size is sufficiently large, on both synthetic and real datasets.
    Leveraging Structure for Improved Classification of Grouped Biased Data. (arXiv:2212.03697v1 [stat.ML])
    We consider semi-supervised binary classification for applications in which data points are naturally grouped (e.g., survey responses grouped by state) and the labeled data is biased (e.g., survey respondents are not representative of the population). The groups overlap in the feature space and consequently the input-output patterns are related across the groups. To model the inherent structure in such data, we assume the partition-projected class-conditional invariance across groups, defined in terms of the group-agnostic feature space. We demonstrate that under this assumption, the group carries additional information about the class, over the group-agnostic features, with provably improved area under the ROC curve. Further assuming invariance of partition-projected class-conditional distributions across both labeled and unlabeled data, we derive a semi-supervised algorithm that explicitly leverages the structure to learn an optimal, group-aware, probability-calibrated classifier, despite the bias in the labeled data. Experiments on synthetic and real data demonstrate the efficacy of our algorithm over suitable baselines and ablative models, spanning standard supervised and semi-supervised learning approaches, with and without incorporating the group directly as a feature.
    When saliency goes off on a tangent: Interpreting Deep Neural Networks with nonlinear saliency maps. (arXiv:2110.06639v2 [cs.LG] UPDATED)
    A fundamental bottleneck in utilising complex machine learning systems for critical applications has been not knowing why they do and what they do, thus preventing the development of any crucial safety protocols. To date, no method exist that can provide full insight into the granularity of the neural network's decision process. In the past, saliency maps were an early attempt at resolving this problem through sensitivity calculations, whereby dimensions of a data point are selected based on how sensitive the output of the system is to them. However, the success of saliency maps has been at best limited, mainly due to the fact that they interpret the underlying learning system through a linear approximation. We present a novel class of methods for generating nonlinear saliency maps which fully account for the nonlinearity of the underlying learning system. While agreeing with linear saliency maps on simple problems where linear saliency maps are correct, they clearly identify more specific drivers of classification on complex examples where nonlinearities are more pronounced. This new class of methods significantly aids interpretability of deep neural networks and related machine learning systems. Crucially, they provide a starting point for their more broad use in serious applications, where 'why' is equally important as 'what'.
    Dynamic Learning of Correlation Potentials for a Time-Dependent Kohn-Sham System. (arXiv:2112.07067v2 [stat.ML] UPDATED)
    We develop methods to learn the correlation potential for a time-dependent Kohn-Sham (TDKS) system in one spatial dimension. We start from a low-dimensional two-electron system for which we can numerically solve the time-dependent Schr\"odinger equation; this yields electron densities suitable for training models of the correlation potential. We frame the learning problem as one of optimizing a least-squares objective subject to the constraint that the dynamics obey the TDKS equation. Applying adjoints, we develop efficient methods to compute gradients and thereby learn models of the correlation potential. Our results show that it is possible to learn values of the correlation potential such that the resulting electron densities match ground truth densities. We also show how to learn correlation potential functionals with memory, demonstrating one such model that yields reasonable results for trajectories outside the training set.
    Tight bounds for maximum $\ell_1$-margin classifiers. (arXiv:2212.03783v1 [stat.ML])
    Popular iterative algorithms such as boosting methods and coordinate descent on linear models converge to the maximum $\ell_1$-margin classifier, a.k.a. sparse hard-margin SVM, in high dimensional regimes where the data is linearly separable. Previous works consistently show that many estimators relying on the $\ell_1$-norm achieve improved statistical rates for hard sparse ground truths. We show that surprisingly, this adaptivity does not apply to the maximum $\ell_1$-margin classifier for a standard discriminative setting. In particular, for the noiseless setting, we prove tight upper and lower bounds for the prediction error that match existing rates of order $\frac{\|\wgt\|_1^{2/3}}{n^{1/3}}$ for general ground truths. To complete the picture, we show that when interpolating noisy observations, the error vanishes at a rate of order $\frac{1}{\sqrt{\log(d/n)}}$. We are therefore first to show benign overfitting for the maximum $\ell_1$-margin classifier.
    Root-finding Approaches for Computing Conformal Prediction Set. (arXiv:2104.06648v3 [stat.ML] UPDATED)
    Conformal prediction constructs a confidence set for an unobserved response of a feature vector based on previous identically distributed and exchangeable observations of responses and features. It has a coverage guarantee at any nominal level without additional assumptions on their distribution. Its computation deplorably requires a refitting procedure for all replacement candidates of the target response. In regression settings, this corresponds to an infinite number of model fits. Apart from relatively simple estimators that can be written as pieces of linear function of the response, efficiently computing such sets is difficult, and is still considered as an open problem. We exploit the fact that, \emph{often}, conformal prediction sets are intervals whose boundaries can be efficiently approximated by classical root-finding algorithms. We investigate how this approach can overcome many limitations of formerly used strategies; we discuss its complexity and drawbacks.
    Pseudo-Riemannian Embedding Models for Multi-Relational Graph Representations. (arXiv:2212.03720v1 [cs.SI])
    In this paper we generalize single-relation pseudo-Riemannian graph embedding models to multi-relational networks, and show that the typical approach of encoding relations as manifold transformations translates from the Riemannian to the pseudo-Riemannian case. In addition we construct a view of relations as separate spacetime submanifolds of multi-time manifolds, and consider an interpolation between a pseudo-Riemannian embedding model and its Wick-rotated Riemannian counterpart. We validate these extensions in the task of link prediction, focusing on flat Lorentzian manifolds, and demonstrate their use in both knowledge graph completion and knowledge discovery in a biological domain.
    A Nonsmooth Dynamical Systems Perspective on Accelerated Extensions of ADMM. (arXiv:1808.04048v6 [math.OC] UPDATED)
    Recently, there has been great interest in connections between continuous-time dynamical systems and optimization algorithms, notably in the context of accelerated methods for smooth and unconstrained problems. In this paper we extend this perspective to nonsmooth and constrained problems by obtaining differential inclusions associated to novel accelerated variants of the alternating direction method of multipliers (ADMM). Through a Lyapunov analysis, we derive rates of convergence for these dynamical systems in different settings that illustrate an interesting tradeoff between decaying versus constant damping strategies. We also obtain perturbed equations capturing fine-grained details of these methods, which have improved stability and preserve the leading order convergence rates.
    FPGA Implementation of Multi-Layer Machine Learning Equalizer with On-Chip Training. (arXiv:2212.03515v1 [eess.SP])
    We design and implement an adaptive machine learning equalizer that alternates multiple linear and nonlinear computational layers on an FPGA. On-chip training via gradient backpropagation is shown to allow for real-time adaptation to time-varying channel impairments.
    Transportation-Inequalities, Lyapunov Stability and Sampling for Dynamical Systems on Continuous State Space. (arXiv:2205.12448v2 [stat.ML] UPDATED)
    We study the concentration phenomenon for discrete-time random dynamical systems with an unbounded state space. We develop a heuristic approach towards obtaining exponential concentration inequalities for dynamical systems using an entirely functional analytic framework. We also show that existence of exponential-type Lyapunov function, compared to the purely deterministic setting, not only implies stability but also exponential concentration inequalities for sampling from the stationary distribution, via \emph{transport-entropy inequality} (T-E). These results have significant impact in \emph{reinforcement learning} (RL) and \emph{controls}, leading to exponential concentration inequalities even for unbounded observables, while neither assuming reversibility nor exact knowledge of random dynamical system (assumptions at heart of concentration inequalities in statistical mechanics and Markov diffusion processes).
    Computing Representations for Lie Algebraic Networks. (arXiv:2006.00724v3 [cs.LG] UPDATED)
    Recent work has constructed neural networks that are equivariant to continuous symmetry groups such as 2D and 3D rotations. This is accomplished using explicit Lie group representations to derive the equivariant kernels and nonlinearities. We present three contributions motivated by frontier applications of equivariance beyond rotations and translations. First, we relax the requirement for explicit Lie group representations with a novel algorithm that finds representations of arbitrary Lie groups given only the structure constants of the associated Lie algebra. Second, we provide a self-contained method and software for building Lie group-equivariant neural networks using these representations. Third, we contribute a novel benchmark dataset for classifying objects from relativistic point clouds, and apply our methods to construct the first object-tracking model equivariant to the Poincar\'e group.
    Active Labeling: Streaming Stochastic Gradients. (arXiv:2205.13255v3 [cs.LG] UPDATED)
    The workhorse of machine learning is stochastic gradient descent. To access stochastic gradients, it is common to consider iteratively input/output pairs of a training dataset. Interestingly, it appears that one does not need full supervision to access stochastic gradients, which is the main motivation of this paper. After formalizing the "active labeling" problem, which focuses on active learning with partial supervision, we provide a streaming technique that provably minimizes the ratio of generalization error over the number of samples. We illustrate our technique in depth for robust regression.
    Reinforcement Learning with Non-Exponential Discounting. (arXiv:2209.13413v2 [cs.LG] UPDATED)
    Commonly in reinforcement learning (RL), rewards are discounted over time using an exponential function to model time preference, thereby bounding the expected long-term reward. In contrast, in economics and psychology, it has been shown that humans often adopt a hyperbolic discounting scheme, which is optimal when a specific task termination time distribution is assumed. In this work, we propose a theory for continuous-time model-based reinforcement learning generalized to arbitrary discount functions. This formulation covers the case in which there is a non-exponential random termination time. We derive a Hamilton-Jacobi-Bellman (HJB) equation characterizing the optimal policy and describe how it can be solved using a collocation method, which uses deep learning for function approximation. Further, we show how the inverse RL problem can be approached, in which one tries to recover properties of the discount function given decision data. We validate the applicability of our proposed approach on two simulated problems. Our approach opens the way for the analysis of human discounting in sequential decision-making tasks.
    Generalized Gradient Flows with Provable Fixed-Time Convergence and Fast Evasion of Non-Degenerate Saddle Points. (arXiv:2212.03765v1 [cs.LG])
    Gradient-based first-order convex optimization algorithms find widespread applicability in a variety of domains, including machine learning tasks. Motivated by the recent advances in fixed-time stability theory of continuous-time dynamical systems, we introduce a generalized framework for designing accelerated optimization algorithms with strongest convergence guarantees that further extend to a subclass of non-convex functions. In particular, we introduce the \emph{GenFlow} algorithm and its momentum variant that provably converge to the optimal solution of objective functions satisfying the Polyak-{\L}ojasiewicz (PL) inequality, in a fixed-time. Moreover for functions that admit non-degenerate saddle-points, we show that for the proposed GenFlow algorithm, the time required to evade these saddle-points is bounded uniformly for all initial conditions. Finally, for strongly convex-strongly concave minimax problems whose optimal solution is a saddle point, a similar scheme is shown to arrive at the optimal solution again in a fixed-time. The superior convergence properties of our algorithm are validated experimentally on a variety of benchmark datasets.
    Proposal of a Score Based Approach to Sampling Using Monte Carlo Estimation of Score and Oracle Access to Target Density. (arXiv:2212.03325v1 [stat.ML])
    Score based approaches to sampling have shown much success as a generative algorithm to produce new samples from a target density given a pool of initial samples. In this work, we consider if we have no initial samples from the target density, but rather $0^{th}$ and $1^{st}$ order oracle access to the log likelihood. Such problems may arise in Bayesian posterior sampling, or in approximate minimization of non-convex functions. Using this knowledge alone, we propose a Monte Carlo method to estimate the score empirically as a particular expectation of a random variable. Using this estimator, we can then run a discrete version of the backward flow SDE to produce samples from the target density. This approach has the benefit of not relying on a pool of initial samples from the target density, and it does not rely on a neural network or other black box model to estimate the score.
    Criteria for Classifying Forecasting Methods. (arXiv:2212.03523v1 [stat.ML])
    Classifying forecasting methods as being either of a "machine learning" or "statistical" nature has become commonplace in parts of the forecasting literature and community, as exemplified by the M4 competition and the conclusion drawn by the organizers. We argue that this distinction does not stem from fundamental differences in the methods assigned to either class. Instead, this distinction is probably of a tribal nature, which limits the insights into the appropriateness and effectiveness of different forecasting methods. We provide alternative characteristics of forecasting methods which, in our view, allow to draw meaningful conclusions. Further, we discuss areas of forecasting which could benefit most from cross-pollination between the ML and the statistics communities.
    Low-Resource End-to-end Sanskrit TTS using Tacotron2, WaveGlow and Transfer Learning. (arXiv:2212.03558v1 [cs.CL])
    End-to-end text-to-speech (TTS) systems have been developed for European languages like English and Spanish with state-of-the-art speech quality, prosody, and naturalness. However, development of end-to-end TTS for Indian languages is lagging behind in terms of quality. The challenges involved in such a task are: 1) scarcity of quality training data; 2) low efficiency during training and inference; 3) slow convergence in the case of large vocabulary size. In our work reported in this paper, we have investigated the use of fine-tuning the English-pretrained Tacotron2 model with limited Sanskrit data to synthesize natural sounding speech in Sanskrit in low resource settings. Our experiments show encouraging results, achieving an overall MOS of 3.38 from 37 evaluators with good Sanskrit spoken knowledge. This is really a very good result, considering the fact that the speech data we have used is of duration 2.5 hours only.  ( 2 min )
    Drift Identification for L\'{e}vy alpha-Stable Stochastic Systems. (arXiv:2212.03317v1 [stat.ML])
    This paper focuses on a stochastic system identification problem: given time series observations of a stochastic differential equation (SDE) driven by L\'{e}vy $\alpha$-stable noise, estimate the SDE's drift field. For $\alpha$ in the interval $[1,2)$, the noise is heavy-tailed, leading to computational difficulties for methods that compute transition densities and/or likelihoods in physical space. We propose a Fourier space approach that centers on computing time-dependent characteristic functions, i.e., Fourier transforms of time-dependent densities. Parameterizing the unknown drift field using Fourier series, we formulate a loss consisting of the squared error between predicted and empirical characteristic functions. We minimize this loss with gradients computed via the adjoint method. For a variety of one- and two-dimensional problems, we demonstrate that this method is capable of learning drift fields in qualitative and/or quantitative agreement with ground truth fields.  ( 2 min )
    MetaStackVis: Visually-Assisted Performance Evaluation of Metamodels. (arXiv:2212.03539v1 [cs.LG])
    Stacking (or stacked generalization) is an ensemble learning method with one main distinctiveness from the rest: even though several base models are trained on the original data set, their predictions are further used as input data for one or more metamodels arranged in at least one extra layer. Composing a stack of models can produce high-performance outcomes, but it usually involves a trial-and-error process. Therefore, our previously developed visual analytics system, StackGenVis, was mainly designed to assist users in choosing a set of top-performing and diverse models by measuring their predictive performance. However, it only employs a single logistic regression metamodel. In this paper, we investigate the impact of alternative metamodels on the performance of stacking ensembles using a novel visualization tool, called MetaStackVis. Our interactive tool helps users to visually explore different singular and pairs of metamodels according to their predictive probabilities and multiple validation metrics, as well as their ability to predict specific problematic data instances. MetaStackVis was evaluated with a usage scenario based on a medical data set and via expert interviews.  ( 2 min )
    Phase2vec: Dynamical systems embedding with a physics-informed convolutional network. (arXiv:2212.03857v1 [cs.LG])
    Dynamical systems are found in innumerable forms across the physical and biological sciences, yet all these systems fall naturally into universal equivalence classes: conservative or dissipative, stable or unstable, compressible or incompressible. Predicting these classes from data remains an essential open challenge in computational physics at which existing time-series classification methods struggle. Here, we propose, \texttt{phase2vec}, an embedding method that learns high-quality, physically-meaningful representations of 2D dynamical systems without supervision. Our embeddings are produced by a convolutional backbone that extracts geometric features from flow data and minimizes a physically-informed vector field reconstruction loss. In an auxiliary training period, embeddings are optimized so that they robustly encode the equations of unseen data over and above the performance of a per-equation fitting method. The trained architecture can not only predict the equations of unseen data, but also, crucially, learns embeddings that respect the underlying semantics of the embedded physical systems. We validate the quality of learned embeddings investigating the extent to which physical categories of input data can be decoded from embeddings compared to standard blackbox classifiers and state-of-the-art time series classification techniques. We find that our embeddings encode important physical properties of the underlying data, including the stability of fixed points, conservation of energy, and the incompressibility of flows, with greater fidelity than competing methods. We finally apply our embeddings to the analysis of meteorological data, showing we can detect climatically meaningful features. Collectively, our results demonstrate the viability of embedding approaches for the discovery of dynamical features in physical systems.  ( 2 min )
    GP-BART: a novel Bayesian additive regression trees approach using Gaussian processes. (arXiv:2204.02112v3 [stat.ME] UPDATED)
    The Bayesian additive regression trees (BART) model is an ensemble method extensively and successfully used in regression tasks due to its consistently strong predictive performance and its ability to quantify uncertainty. BART combines "weak" tree models through a set of shrinkage priors, whereby each tree explains a small portion of the variability in the data. However, the lack of smoothness and the absence of a covariance structure over the observations in standard BART can yield poor performance in cases where such assumptions would be necessary. We propose Gaussian processes Bayesian additive regression trees (GP-BART) as an extension of BART which assumes Gaussian process (GP) priors for the predictions of each terminal node among all trees. We illustrate our model on simulated and real data and compare its performance to traditional modelling approaches, outperforming them in many scenarios. An implementation of our method is available in the R package rGPBART available at: https://github.com/MateusMaiaDS/gpbart  ( 2 min )
    General multi-fidelity surrogate models: Framework and active learning strategies for efficient rare event simulation. (arXiv:2212.03375v1 [cs.LG])
    Estimating the probability of failure for complex real-world systems using high-fidelity computational models is often prohibitively expensive, especially when the probability is small. Exploiting low-fidelity models can make this process more feasible, but merging information from multiple low-fidelity and high-fidelity models poses several challenges. This paper presents a robust multi-fidelity surrogate modeling strategy in which the multi-fidelity surrogate is assembled using an active learning strategy using an on-the-fly model adequacy assessment set within a subset simulation framework for efficient reliability analysis. The multi-fidelity surrogate is assembled by first applying a Gaussian process correction to each low-fidelity model and assigning a model probability based on the model's local predictive accuracy and cost. Three strategies are proposed to fuse these individual surrogates into an overall surrogate model based on model averaging and deterministic/stochastic model selection. The strategies also dictate which model evaluations are necessary. No assumptions are made about the relationships between low-fidelity models, while the high-fidelity model is assumed to be the most accurate and most computationally expensive model. Through two analytical and two numerical case studies, including a case study evaluating the failure probability of Tristructural isotropic-coated (TRISO) nuclear fuels, the algorithm is shown to be highly accurate while drastically reducing the number of high-fidelity model calls (and hence computational cost).  ( 2 min )
    Sequential Predictive Conformal Inference for Time Series. (arXiv:2212.03463v1 [stat.ML])
    We present a new distribution-free conformal prediction algorithm for sequential data (e.g., time series), called the \textit{sequential predictive conformal inference} (\texttt{SPCI}). We specifically account for the nature that the time series data are non-exchangeable, and thus many existing conformal prediction algorithms based on temporal residuals are not applicable. The main idea is to exploit the temporal dependence of conformity scores; thus, the past conformity scores contain information about future ones. Then we cast the problem of conformal prediction interval as predicting the quantile of a future residual, given a prediction algorithm. Theoretically, we establish asymptotic valid conditional coverage upon extending consistency analyses in quantile regression. Using simulation and real-data experiments, we demonstrate a significant reduction in interval width of \texttt{SPCI} compared to other existing methods under the desired empirical coverage.  ( 2 min )
    Neighborhood Adaptive Estimators for Causal Inference under Network Interference. (arXiv:2212.03683v1 [stat.ML])
    Estimating causal effects has become an integral part of most applied fields. Solving these modern causal questions requires tackling violations of many classical causal assumptions. In this work we consider the violation of the classical no-interference assumption, meaning that the treatment of one individuals might affect the outcomes of another. To make interference tractable, we consider a known network that describes how interference may travel. However, unlike previous work in this area, the radius (and intensity) of the interference experienced by a unit is unknown and can depend on different sub-networks of those treated and untreated that are connected to this unit. We study estimators for the average direct treatment effect on the treated in such a setting. The proposed estimator builds upon a Lepski-like procedure that searches over the possible relevant radii and treatment assignment patterns. In contrast to previous work, the proposed procedure aims to approximate the relevant network interference patterns. We establish oracle inequalities and corresponding adaptive rates for the estimation of the interference function. We leverage such estimates to propose and analyze two estimators for the average direct treatment effect on the treated. We address several challenges steaming from the data-driven creation of the patterns (i.e. feature engineering) and the network dependence. In addition to rates of convergence, under mild regularity conditions, we show that one of the proposed estimators is asymptotically normal and unbiased.  ( 2 min )
    Metric Elicitation; Moving from Theory to Practice. (arXiv:2212.03495v1 [stat.ML])
    Metric Elicitation (ME) is a framework for eliciting classification metrics that better align with implicit user preferences based on the task and context. The existing ME strategy so far is based on the assumption that users can most easily provide preference feedback over classifier statistics such as confusion matrices. This work examines ME, by providing a first ever implementation of the ME strategy. Specifically, we create a web-based ME interface and conduct a user study that elicits users' preferred metrics in a binary classification setting. We discuss the study findings and present guidelines for future research in this direction.  ( 2 min )
    Dimension Reduction for Fr\'echet Regression. (arXiv:2110.00467v2 [stat.ME] UPDATED)
    With the rapid development of data collection techniques, complex data objects that are not in the Euclidean space are frequently encountered in new statistical applications. Fr\'echet regression model (Peterson & M\"uller 2019) provides a promising framework for regression analysis with metric space-valued responses. In this paper, we introduce a flexible sufficient dimension reduction (SDR) method for Fr\'echet regression to achieve two purposes: to mitigate the curse of dimensionality caused by high-dimensional predictors and to provide a visual inspection tool for Fr\'echet regression. Our approach is flexible enough to turn any existing SDR method for Euclidean (X,Y) into one for Euclidean X and metric space-valued Y. The basic idea is to first map the metric-space valued random object $Y$ to a real-valued random variable $f(Y)$ using a class of functions, and then perform classical SDR to the transformed data. If the class of functions is sufficiently rich, then we are guaranteed to uncover the Fr\'echet SDR space. We showed that such a class, which we call an ensemble, can be generated by a universal kernel. We established the consistency and asymptotic convergence rate of the proposed methods. The finite-sample performance of the proposed methods is illustrated through simulation studies for several commonly encountered metric spaces that include Wasserstein space, the space of symmetric positive definite matrices, and the sphere. We illustrated the data visualization aspect of our method by exploring the human mortality distribution data across countries and by studying the distribution of hematoma density.  ( 2 min )
    Unsupervised spectral-band feature identification for optimal process discrimination. (arXiv:2212.03800v1 [cs.LG])
    Changes in real-world dynamic processes are often described in terms of differences in energies $\textbf{E}(\underline{\alpha})$ of a set of spectral-bands $\underline{\alpha}$. Given continuous spectra of two classes $A$ and $B$, or in general, two stochastic processes $S^{(A)}(f)$ and $S^{(B)}(f)$, $f \in \mathbb{R}^+$, we address the ubiquitous problem of identifying a subset of intervals of $f$ called spectral-bands $\underline{\alpha} \subset \mathbb{R}^+$ such that the energies $\textbf{E}(\underline{\alpha})$ of these bands can optimally discriminate between the two classes. We introduce EGO-MDA, an unsupervised method to identify optimal spectral-bands $\underline{\alpha}^*$ for given samples of spectra from two classes. EGO-MDA employs a statistical approach that iteratively minimizes an adjusted multinomial log-likelihood (deviance) criterion $\mathcal{D}(\underline{\alpha},\mathcal{M})$. Here, Mixture Discriminant Analysis (MDA) aims to derive MLE of two GMM distribution parameters, i.e., $\mathcal{M}^* = \underset{\mathcal{M}}{\rm argmin}~\mathcal{D}(\underline{\alpha}, \mathcal{M})$ and identify a classifier that optimally discriminates between two classes for a given spectral representation. The Efficient Global Optimization (EGO) finds the spectral-bands $\underline{\alpha}^* = \underset{\underline{\alpha}}{\rm argmin}~\mathcal{D}(\underline{\alpha},\mathcal{M})$ for given GMM parameters $\mathcal{M}$. For pathological cases of low separation between mixtures and model misspecification, we discuss the effect of the sample size and the number of iterations on the estimates of parameters $\mathcal{M}$ and therefore the classifier performance. A case study on a synthetic data set is provided. In an engineering application of optimal spectral-banding for anomaly tracking, EGO-MDA achieved at least 70% improvement in the median deviance relative to other methods tested.  ( 2 min )
    Optimal transport map estimation in general function spaces. (arXiv:2212.03722v1 [math.ST])
    We consider the problem of estimating the optimal transport map between a (fixed) source distribution $P$ and an unknown target distribution $Q$, based on samples from $Q$. The estimation of such optimal transport maps has become increasingly relevant in modern statistical applications, such as generative modeling. At present, estimation rates are only known in a few settings (e.g. when $P$ and $Q$ have densities bounded above and below and when the transport map lies in a H\"older class), which are often not reflected in practice. We present a unified methodology for obtaining rates of estimation of optimal transport maps in general function spaces. Our assumptions are significantly weaker than those appearing in the literature: we require only that the source measure $P$ satisfies a Poincar\'e inequality and that the optimal map be the gradient of a smooth convex function that lies in a space whose metric entropy can be controlled. As a special case, we recover known estimation rates for bounded densities and H\"older transport maps, but also obtain nearly sharp results in many settings not covered by prior work. For example, we provide the first statistical rates of estimation when $P$ is the normal distribution and the transport map is given by an infinite-width shallow neural network.  ( 2 min )
    Stochastic Rising Bandits. (arXiv:2212.03798v1 [cs.LG])
    This paper is in the field of stochastic Multi-Armed Bandits (MABs), i.e., those sequential selection techniques able to learn online using only the feedback given by the chosen option (a.k.a. arm). We study a particular case of the rested and restless bandits in which the arms' expected payoff is monotonically non-decreasing. This characteristic allows designing specifically crafted algorithms that exploit the regularity of the payoffs to provide tight regret bounds. We design an algorithm for the rested case (R-ed-UCB) and one for the restless case (R-less-UCB), providing a regret bound depending on the properties of the instance and, under certain circumstances, of $\widetilde{\mathcal{O}}(T^{\frac{2}{3}})$. We empirically compare our algorithms with state-of-the-art methods for non-stationary MABs over several synthetically generated tasks and an online model selection problem for a real-world dataset. Finally, using synthetic and real-world data, we illustrate the effectiveness of the proposed approaches compared with state-of-the-art algorithms for the non-stationary bandits.  ( 2 min )
  • Open

    SDRM3: A Dynamic Scheduler for Dynamic Real-time Multi-model ML Workloads. (arXiv:2212.03414v1 [cs.DC])
    Emerging real-time multi-model ML (RTMM) workloads such as AR/VR and drone control often involve dynamic behaviors in various levels; task, model, and layers (or, ML operators) within a model. Such dynamic behaviors are new challenges to the system software in an ML system because the overall system load is unpredictable unlike traditional ML workloads. Also, the real-time processing requires to meet deadlines, and multi-model workloads involve highly heterogeneous models. As RTMM workloads often run on resource-constrained devices (e.g., VR headset), developing an effective scheduler is an important research problem. Therefore, we propose a new scheduler, SDRM3, that effectively handles various dynamicity in RTMM style workloads targeting multi-accelerator systems. To make scheduling decisions, SDRM3 quantifies the unique requirements for RTMM workloads and utilizes the quantified scores to drive scheduling decisions, considering the current system load and other inference jobs on different models and input frames. SDRM3 has tunable parameters that provide fast adaptivity to dynamic workload changes based on a gradient descent-like online optimization, which typically converges within five steps for new workloads. In addition, we also propose a method to exploit model level dynamicity based on Supernet for exploiting the trade-off between the scheduling effectiveness and model performance (e.g., accuracy), which dynamically selects a proper sub-network in a Supernet based on the system loads. In our evaluation on five realistic RTMM workload scenarios, SDRM3 reduces the overall UXCost, which is a energy-delay-product (EDP)-equivalent metric for real-time applications defined in the paper, by 37.7% and 53.2% on geometric mean (up to 97.6% and 97.1%) compared to state-of-the-art baselines, which shows the efficacy of our scheduling methodology.  ( 2 min )
    When saliency goes off on a tangent: Interpreting Deep Neural Networks with nonlinear saliency maps. (arXiv:2110.06639v2 [cs.LG] UPDATED)
    A fundamental bottleneck in utilising complex machine learning systems for critical applications has been not knowing why they do and what they do, thus preventing the development of any crucial safety protocols. To date, no method exist that can provide full insight into the granularity of the neural network's decision process. In the past, saliency maps were an early attempt at resolving this problem through sensitivity calculations, whereby dimensions of a data point are selected based on how sensitive the output of the system is to them. However, the success of saliency maps has been at best limited, mainly due to the fact that they interpret the underlying learning system through a linear approximation. We present a novel class of methods for generating nonlinear saliency maps which fully account for the nonlinearity of the underlying learning system. While agreeing with linear saliency maps on simple problems where linear saliency maps are correct, they clearly identify more specific drivers of classification on complex examples where nonlinearities are more pronounced. This new class of methods significantly aids interpretability of deep neural networks and related machine learning systems. Crucially, they provide a starting point for their more broad use in serious applications, where 'why' is equally important as 'what'.
    A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning. (arXiv:2210.03112v2 [cs.LG] UPDATED)
    Recent studies in Vision-and-Language Navigation (VLN) train RL agents to execute natural-language navigation instructions in photorealistic environments, as a step towards robots that can follow human instructions. However, given the scarcity of human instruction data and limited diversity in the training environments, these agents still struggle with complex language grounding and spatial language understanding. Pretraining on large text and image-text datasets from the web has been extensively explored but the improvements are limited. We investigate large-scale augmentation with synthetic instructions. We take 500+ indoor environments captured in densely-sampled 360 degree panoramas, construct navigation trajectories through these panoramas, and generate a visually-grounded instruction for each trajectory using Marky, a high-quality multilingual navigation instruction generator. We also synthesize image observations from novel viewpoints using an image-to-image GAN. The resulting dataset of 4.2M instruction-trajectory pairs is two orders of magnitude larger than existing human-annotated datasets, and contains a wider variety of environments and viewpoints. To efficiently leverage data at this scale, we train a simple transformer agent with imitation learning. On the challenging RxR dataset, our approach outperforms all existing RL agents, improving the state-of-the-art NDTW from 71.1 to 79.1 in seen environments, and from 64.6 to 66.8 in unseen test environments. Our work points to a new path to improving instruction-following agents, emphasizing large-scale imitation learning and the development of synthetic instruction generation capabilities.
    Mingling Foresight with Imagination: Model-Based Cooperative Multi-Agent Reinforcement Learning. (arXiv:2204.09418v3 [cs.MA] UPDATED)
    Recently, model-based agents have achieved better performance than model-free ones using the same computational budget and training time in single-agent environments. However, due to the complexity of multi-agent systems, it is tough to learn the model of the environment. The significant compounding error may hinder the learning process when model-based methods are applied to multi-agent tasks. This paper proposes an implicit model-based multi-agent reinforcement learning method based on value decomposition methods. Under this method, agents can interact with the learned virtual environment and evaluate the current state value according to imagined future states in the latent space, making agents have the foresight. Our approach can be applied to any multi-agent value decomposition method. The experimental results show that our method improves the sample efficiency in different partially observable Markov decision process domains.
    Semantically-enhanced Topic Recommendation System for Software Projects. (arXiv:2206.00085v2 [cs.SE] UPDATED)
    Software-related platforms have enabled their users to collaboratively label software entities with topics. Tagging software repositories with relevant topics can be exploited for facilitating various downstream tasks. For instance, a correct and complete set of topics assigned to a repository can increase its visibility. Consequently, this improves the outcome of tasks such as browsing, searching, navigation, and organization of repositories. Unfortunately, assigned topics are usually highly noisy, and some repositories do not have well-assigned topics. Thus, there have been efforts on recommending topics for software projects, however, the semantic relationships among these topics have not been exploited so far. We propose two recommender models for tagging software projects that incorporate the semantic relationship among topics. Our approach has two main phases; (1) we first take a collaborative approach to curate a dataset of quality topics specifically for the domain of software engineering and development. We also enrich this data with the semantic relationships among these topics and encapsulate them in a knowledge graph we call SED-KGraph. Then, (2) we build two recommender systems; The first one operates only based on the list of original topics assigned to a repository and the relationships specified in our knowledge graph. The second predictive model, however, assumes there are no topics available for a repository, hence it proceeds to predict the relevant topics based on both textual information of a software project and SED-KGraph. We built SED-KGraph in a crowd-sourced project with 170 contributors from both academia and industry. The experiment results indicate that our solutions outperform baselines that neglect the semantic relationships among topics by at least 25% and 23% in terms of ASR and MAP metrics.
    Digital Twin-Empowered Network Planning for Multi-Tier Computing. (arXiv:2210.02616v2 [cs.NI] UPDATED)
    In this paper, we design a resource management scheme to support stateful applications, which will be prevalent in 6G networks. Different from stateless applications, stateful applications require context data while executing computing tasks from user terminals (UTs). Using a multi-tier computing paradigm with servers deployed at the core network, gateways, and base stations to support stateful applications, we aim to optimize long-term resource reservation by jointly minimizing the usage of computing, storage, and communication resources and the cost from reconfiguring resource reservation. The coupling among different resources and the impact of UT mobility create challenges in resource management. To address the challenges, we develop digital twin (DT) empowered network planning with two elements, i.e., multi-resource reservation and resource reservation reconfiguration. First, DTs are designed for collecting UT status data, based on which UTs are grouped according to their mobility patterns. Second, an algorithm is proposed to customize resource reservation for different groups to satisfy their different resource demands. Last, a Meta-learning-based approach is developed to reconfigure resource reservation for balancing the network resource usage and the reconfiguration cost. Simulation results demonstrate that the proposed DT-empowered network planning outperforms benchmark frameworks by using less resources and incurring lower reconfiguration costs.
    Solving the Side-Chain Packing Arrangement of Proteins from Reinforcement Learned Stochastic Decision Making. (arXiv:2212.03320v1 [math.OC])
    Protein structure prediction is a fundamental problem in computational molecular biology. Classical algorithms such as ab-initio or threading as well as many learning methods have been proposed to solve this challenging problem. However, most reinforcement learning methods tend to model the state-action pairs as discrete objects. In this paper, we develop a reinforcement learning (RL) framework in a continuous setting and based on a stochastic parametrized Hamiltonian version of the Pontryagin maximum principle (PMP) to solve the side-chain packing and protein-folding problem. For special cases our formulation can be reduced to previous work where the optimal folding trajectories are trained using an explicit use of Langevin dynamics. Optimal continuous stochastic Hamiltonian dynamics folding pathways can be derived with use of different models of molecular energetics and force fields. In our RL implementation we adopt a soft actor-critic methodology however we can replace this other RL training based on A2C, A3C or PPO.
    Understanding Self-Predictive Learning for Reinforcement Learning. (arXiv:2212.03319v1 [cs.LG])
    We study the learning dynamics of self-predictive learning for reinforcement learning, a family of algorithms that learn representations by minimizing the prediction error of their own future latent representations. Despite its recent empirical success, such algorithms have an apparent defect: trivial representations (such as constants) minimize the prediction error, yet it is obviously undesirable to converge to such solutions. Our central insight is that careful designs of the optimization dynamics are critical to learning meaningful representations. We identify that a faster paced optimization of the predictor and semi-gradient updates on the representation, are crucial to preventing the representation collapse. Then in an idealized setup, we show self-predictive learning dynamics carries out spectral decomposition on the state transition matrix, effectively capturing information of the transition dynamics. Building on the theoretical insights, we propose bidirectional self-predictive learning, a novel self-predictive algorithm that learns two representations simultaneously. We examine the robustness of our theoretical insights with a number of small-scale experiments and showcase the promise of the novel representation learning algorithm with large-scale experiments.  ( 2 min )
    DziriBERT: a Pre-trained Language Model for the Algerian Dialect. (arXiv:2109.12346v2 [cs.CL] UPDATED)
    Pre-trained transformers are now the de facto models in Natural Language Processing given their state-of-the-art results in many tasks and languages. However, most of the current models have been trained on languages for which large text resources are already available (such as English, French, Arabic, etc.). Therefore, there are still a number of low-resource languages that need more attention from the community. In this paper, we study the Algerian dialect which has several specificities that make the use of Arabic or multilingual models inappropriate. To address this issue, we collected more than one million Algerian tweets, and pre-trained the first Algerian language model: DziriBERT. When compared with existing models, DziriBERT achieves better results, especially when dealing with the Roman script. The obtained results show that pre-training a dedicated model on a small dataset (150 MB) can outperform existing models that have been trained on much more data (hundreds of GB). Finally, our model is publicly available to the community.
    Towards Automatic Cetacean Photo-Identification: A Framework for Fine-Grain, Few-Shot Learning in Marine Ecology. (arXiv:2212.03646v1 [cs.CV])
    Photo-identification (photo-id) is one of the main non-invasive capture-recapture methods utilised by marine researchers for monitoring cetacean (dolphin, whale, and porpoise) populations. This method has historically been performed manually resulting in high workload and cost due to the vast number of images collected. Recently automated aids have been developed to help speed-up photo-id, although they are often disjoint in their processing and do not utilise all available identifying information. Work presented in this paper aims to create a fully automatic photo-id aid capable of providing most likely matches based on all available information without the need for data pre-processing such as cropping. This is achieved through a pipeline of computer vision models and post-processing techniques aimed at detecting cetaceans in unedited field imagery before passing them downstream for individual level catalogue matching. The system is capable of handling previously uncatalogued individuals and flagging these for investigation thanks to catalogue similarity comparison. We evaluate the system against multiple real-life photo-id catalogues, achieving mAP@IOU[0.5] = 0.91, 0.96 for the task of dorsal fin detection on catalogues from Tanzania and the UK respectively and 83.1, 97.5% top-10 accuracy for the task of individual classification on catalogues from the UK and USA.
    Persona-Based Conversational AI: State of the Art and Challenges. (arXiv:2212.03699v1 [cs.CL])
    Conversational AI has become an increasingly prominent and practical application of machine learning. However, existing conversational AI techniques still suffer from various limitations. One such limitation is a lack of well-developed methods for incorporating auxiliary information that could help a model understand conversational context better. In this paper, we explore how persona-based information could help improve the quality of response generation in conversations. First, we provide a literature review focusing on the current state-of-the-art methods that utilize persona information. We evaluate two strong baseline methods, the Ranking Profile Memory Network and the Poly-Encoder, on the NeurIPS ConvAI2 benchmark dataset. Our analysis elucidates the importance of incorporating persona information into conversational systems. Additionally, our study highlights several limitations with current state-of-the-art methods and outlines challenges and future research directions for advancing personalized conversational AI technology.
    A Privacy-Aware Graph Contrastive Learning Method in Federated Settings. (arXiv:2207.11836v2 [cs.LG] UPDATED)
    Graph learning models are critical tools for researchers to explore graph-structured data. To train a capable graph learning model, a conventional method uses sufficient training data to train a graph model on a single device. However, it is prohibitive to do so in real-world scenarios due to privacy concerns. Federated learning provides a feasible solution to address such limitations via introducing various privacy-preserving mechanisms, such as differential privacy on graph edges. Nevertheless, differential privacy in federated graph learning secures the classified information maintained in graphs. It degrades the performances of the graph learning models. In this paper, we investigate how to implement differential privacy on graph edges and observe the performances decreasing in the experiments. We also note that the differential privacy on graph edges introduces noises to perturb graph proximity, which is one of the graph augmentations in graph contrastive learning. Inspired by that, we propose to leverage the advantages of graph contrastive learning to alleviate the performance dropping caused by differential privacy. Extensive experiments are conducted with several representative graph models and widely-used datasets, showing that contrastive learning indeed alleviates the models' performance dropping caused by differential privacy.
    Learning to Select Prototypical Parts for Interpretable Sequential Data Modeling. (arXiv:2212.03396v1 [cs.LG])
    Prototype-based interpretability methods provide intuitive explanations of model prediction by comparing samples to a reference set of memorized exemplars or typical representatives in terms of similarity. In the field of sequential data modeling, similarity calculations of prototypes are usually based on encoded representation vectors. However, due to highly recursive functions, there is usually a non-negligible disparity between the prototype-based explanations and the original input. In this work, we propose a Self-Explaining Selective Model (SESM) that uses a linear combination of prototypical concepts to explain its own predictions. The model employs the idea of case-based reasoning by selecting sub-sequences of the input that mostly activate different concepts as prototypical parts, which users can compare to sub-sequences selected from different example inputs to understand model decisions. For better interpretability, we design multiple constraints including diversity, stability, and locality as training objectives. Extensive experiments in different domains demonstrate that our method exhibits promising interpretability and competitive accuracy.
    Criteria for Classifying Forecasting Methods. (arXiv:2212.03523v1 [stat.ML])
    Classifying forecasting methods as being either of a "machine learning" or "statistical" nature has become commonplace in parts of the forecasting literature and community, as exemplified by the M4 competition and the conclusion drawn by the organizers. We argue that this distinction does not stem from fundamental differences in the methods assigned to either class. Instead, this distinction is probably of a tribal nature, which limits the insights into the appropriateness and effectiveness of different forecasting methods. We provide alternative characteristics of forecasting methods which, in our view, allow to draw meaningful conclusions. Further, we discuss areas of forecasting which could benefit most from cross-pollination between the ML and the statistics communities.  ( 2 min )
    NIAPU: network-informed adaptive positive-unlabelled learning for disease genes identification. (arXiv:2108.06158v3 [cs.LG] UPDATED)
    Gene-disease associations are fundamental for the understanding of disease mechanisms and for the development of effective interventions and treatments. Identifying genes not yet associated with a disease due to lack of studies is a challenging task in which prioritization based on prior knowledge can result helpful. The computational search for new candidate disease genes may be eased by Positive-Unlabelled (PU) learning, the machine learning (ML) setting in which only a subset of instances are labelled as positive, while the rest of the data set is unlabelled. In this work, we propose a set of effective network-based features to be used in a novel Markov diffusion-based multi-class labelling strategy for putative disease gene discovery. The performances of the new labelling algorithm and the effectiveness of the proposed features have been tested on five different disease datasets using three ML algorithms. Such features have been compared against classical topological and functional/ontological features showing that they outperform the classical ones both in binary classification and in the multi-class labelling. Analogously, the predictive power of the integrated methodology in searching new disease genes has been found to be competitive against the state-of-the-art algorithms.  ( 2 min )
    Exploring Randomly Wired Neural Networks for Climate Model Emulation. (arXiv:2212.03369v1 [physics.ao-ph])
    Exploring the climate impacts of various anthropogenic emissions scenarios is key to making informed decisions for climate change mitigation and adaptation. State-of-the-art Earth system models can provide detailed insight into these impacts, but have a large associated computational cost on a per-scenario basis. This large computational burden has driven recent interest in developing cheap machine learning models for the task of climate model emulation. In this manuscript, we explore the efficacy of randomly wired neural networks for this task. We describe how they can be constructed and compare them to their standard feedforward counterparts using the ClimateBench dataset. Specifically, we replace the serially connected dense layers in multilayer perceptrons, convolutional neural networks, and convolutional long short-term memory networks with randomly wired dense layers and assess the impact on model performance for models with 1 million and 10 million parameters. We find average performance improvements of 4.2% across model complexities and prediction tasks, with substantial performance improvements of up to 16.4% in some cases. Furthermore, we find no significant difference in prediction speed between networks with standard feedforward dense layers and those with randomly wired layers. These findings indicate that randomly wired neural networks may be suitable direct replacements for traditional dense layers in many standard models.
    Pseudo-Riemannian Embedding Models for Multi-Relational Graph Representations. (arXiv:2212.03720v1 [cs.SI])
    In this paper we generalize single-relation pseudo-Riemannian graph embedding models to multi-relational networks, and show that the typical approach of encoding relations as manifold transformations translates from the Riemannian to the pseudo-Riemannian case. In addition we construct a view of relations as separate spacetime submanifolds of multi-time manifolds, and consider an interpolation between a pseudo-Riemannian embedding model and its Wick-rotated Riemannian counterpart. We validate these extensions in the task of link prediction, focusing on flat Lorentzian manifolds, and demonstrate their use in both knowledge graph completion and knowledge discovery in a biological domain.
    Copula Conformal Prediction for Multi-step Time Series Forecasting. (arXiv:2212.03281v1 [cs.LG])
    Accurate uncertainty measurement is a key step to building robust and reliable machine learning systems. Conformal prediction is a distribution-free uncertainty quantification algorithm popular for its ease of implementation, statistical coverage guarantees, and versatility for underlying forecasters. However, existing conformal prediction algorithms for time series are limited to single-step prediction without considering the temporal dependency. In this paper we propose a Copula Conformal Prediction algorithm for multivariate, multi-step Time Series forecasting, CopulaCPTS. On several synthetic and real-world multivariate time series datasets, we show that CopulaCPTS produces more calibrated and sharp confidence intervals for multi-step prediction tasks than existing techniques.  ( 2 min )
    Intent Recognition in Conversational Recommender Systems. (arXiv:2212.03721v1 [cs.CL])
    Any organization needs to improve their products, services, and processes. In this context, engaging with customers and understanding their journey is essential. Organizations have leveraged various techniques and technologies to support customer engagement, from call centres to chatbots and virtual agents. Recently, these systems have used Machine Learning (ML) and Natural Language Processing (NLP) to analyze large volumes of customer feedback and engagement data. The goal is to understand customers in context and provide meaningful answers across various channels. Despite multiple advances in Conversational Artificial Intelligence (AI) and Recommender Systems (RS), it is still challenging to understand the intent behind customer questions during the customer journey. To address this challenge, in this paper, we study and analyze the recent work in Conversational Recommender Systems (CRS) in general and, more specifically, in chatbot-based CRS. We introduce a pipeline to contextualize the input utterances in conversations. We then take the next step towards leveraging reverse feature engineering to link the contextualized input and learning model to support intent recognition. Since performance evaluation is achieved based on different ML models, we use transformer base models to evaluate the proposed approach using a labelled dialogue dataset (MSDialogue) of question-answering interactions between information seekers and answer providers.  ( 2 min )
    Computing Representations for Lie Algebraic Networks. (arXiv:2006.00724v3 [cs.LG] UPDATED)
    Recent work has constructed neural networks that are equivariant to continuous symmetry groups such as 2D and 3D rotations. This is accomplished using explicit Lie group representations to derive the equivariant kernels and nonlinearities. We present three contributions motivated by frontier applications of equivariance beyond rotations and translations. First, we relax the requirement for explicit Lie group representations with a novel algorithm that finds representations of arbitrary Lie groups given only the structure constants of the associated Lie algebra. Second, we provide a self-contained method and software for building Lie group-equivariant neural networks using these representations. Third, we contribute a novel benchmark dataset for classifying objects from relativistic point clouds, and apply our methods to construct the first object-tracking model equivariant to the Poincar\'e group.  ( 2 min )
    Stochastic Rising Bandits. (arXiv:2212.03798v1 [cs.LG])
    This paper is in the field of stochastic Multi-Armed Bandits (MABs), i.e., those sequential selection techniques able to learn online using only the feedback given by the chosen option (a.k.a. arm). We study a particular case of the rested and restless bandits in which the arms' expected payoff is monotonically non-decreasing. This characteristic allows designing specifically crafted algorithms that exploit the regularity of the payoffs to provide tight regret bounds. We design an algorithm for the rested case (R-ed-UCB) and one for the restless case (R-less-UCB), providing a regret bound depending on the properties of the instance and, under certain circumstances, of $\widetilde{\mathcal{O}}(T^{\frac{2}{3}})$. We empirically compare our algorithms with state-of-the-art methods for non-stationary MABs over several synthetically generated tasks and an online model selection problem for a real-world dataset. Finally, using synthetic and real-world data, we illustrate the effectiveness of the proposed approaches compared with state-of-the-art algorithms for the non-stationary bandits.  ( 2 min )
    Semantically Enhanced Global Reasoning for Semantic Segmentation. (arXiv:2212.03338v1 [cs.CV])
    Recent advances in pixel-level tasks (e.g., segmentation) illustrate the benefit of long-range interactions between aggregated region-based representations that can enhance local features. However, such pixel-to-region associations and the resulting representation, which often take the form of attention, cannot model the underlying semantic structure of the scene (e.g., individual objects and, by extension, their interactions). In this work, we take a step toward addressing this limitation. Specifically, we propose an architecture where we learn to project image features into latent region representations and perform global reasoning across them, using a transformer, to produce contextualized and scene-consistent representations that are then fused with original pixel-level features. Our design enables the latent regions to represent semantically meaningful concepts, by ensuring that activated regions are spatially disjoint and unions of such regions correspond to connected object segments. The resulting semantic global reasoning (SGR) is end-to-end trainable and can be combined with any semantic segmentation framework and backbone. Combining SGR with DeepLabV3 results in a semantic segmentation performance that is competitive to the state-of-the-art, while resulting in more semantically interpretable and diverse region representations, which we show can effectively transfer to detection and instance segmentation. Further, we propose a new metric that allows us to measure the semantics of representations at both the object class and instance level.
    Fast Offline Policy Optimization for Large Scale Recommendation. (arXiv:2208.05327v3 [cs.IR] UPDATED)
    Personalised interactive systems such as recommender systems require selecting relevant items dependent on context. Production systems need to identify the items rapidly from very large catalogues which can be efficiently solved using maximum inner product search technology. Offline optimisation of maximum inner product search can be achieved by a relaxation of the discrete problem resulting in policy learning or REINFORCE style learning algorithms. Unfortunately, this relaxation step requires computing a sum over the entire catalogue making the complexity of the evaluation of the gradient (and hence each stochastic gradient descent iterations) linear in the catalogue size. This calculation is untenable in many real world examples such as large catalogue recommender systems, severely limiting the usefulness of this method in practice. In this paper, we derive an excellent approximation of these policy learning algorithms that scale logarithmically with the catalogue size. Our contribution is based upon combining three novel ideas: a new Monte Carlo estimate of the gradient of a policy, the self normalised importance sampling estimator and the use of fast maximum inner product search at training time. Extensive experiments show that our algorithm is an order of magnitude faster than naive approaches yet produces equally good policies.
    Further analysis of multilevel Stein variational gradient descent with an application to the Bayesian inference of glacier ice models. (arXiv:2212.03366v1 [math.NA])
    Multilevel Stein variational gradient descent is a method for particle-based variational inference that leverages hierarchies of approximations of target distributions with varying costs and fidelity to computationally speed up inference. This work provides a cost complexity analysis of multilevel Stein variational gradient descent that applies under milder conditions than previous results, especially in discrete-in-time regimes and beyond the limited settings where Stein variational gradient descent achieves exponentially fast convergence. The analysis shows that the convergence rate of Stein variational gradient descent enters only as a constant factor for the cost complexity of the multilevel version, which means that the costs of the multilevel version scale independently of the convergence rate of Stein variational gradient descent on a single level. Numerical experiments with Bayesian inverse problems of inferring discretized basal sliding coefficient fields of the Arolla glacier ice demonstrate that multilevel Stein variational gradient descent achieves orders of magnitude speedups compared to its single-level version.
    Talking About Large Language Models. (arXiv:2212.03551v1 [cs.CL])
    Thanks to rapid progress in artificial intelligence, we have entered an era when technology and philosophy intersect in interesting ways. Sitting squarely at the centre of this intersection are large language models (LLMs). The more adept LLMs become at mimicking human language, the more vulnerable we become to anthropomorphism, to seeing the systems in which they are embedded as more human-like than they really are. This trend is amplified by the natural tendency to use philosophically loaded terms, such as "knows", "believes", and "thinks", when describing these systems. To mitigate this trend, this paper advocates the practice of repeatedly stepping back to remind ourselves of how LLMs, and the systems of which they form a part, actually work. The hope is that increased scientific precision will encourage more philosophical nuance in the discourse around artificial intelligence, both within the field and in the public sphere.
    MobileTL: On-device Transfer Learning with Inverted Residual Blocks. (arXiv:2212.03246v1 [cs.LG])
    Transfer learning on edge is challenging due to on-device limited resources. Existing work addresses this issue by training a subset of parameters or adding model patches. Developed with inference in mind, Inverted Residual Blocks (IRBs) split a convolutional layer into depthwise and pointwise convolutions, leading to more stacking layers, e.g., convolution, normalization, and activation layers. Though they are efficient for inference, IRBs require that additional activation maps are stored in memory for training weights for convolution layers and scales for normalization layers. As a result, their high memory cost prohibits training IRBs on resource-limited edge devices, and making them unsuitable in the context of transfer learning. To address this issue, we present MobileTL, a memory and computationally efficient on-device transfer learning method for models built with IRBs. MobileTL trains the shifts for internal normalization layers to avoid storing activation maps for the backward pass. Also, MobileTL approximates the backward computation of the activation layer (e.g., Hard-Swish and ReLU6) as a signed function which enables storing a binary mask instead of activation maps for the backward pass. MobileTL fine-tunes a few top blocks (close to output) rather than propagating the gradient through the whole network to reduce the computation cost. Our method reduces memory usage by 46% and 53% for MobileNetV2 and V3 IRBs, respectively. For MobileNetV3, we observe a 36% reduction in floating-point operations (FLOPs) when fine-tuning 5 blocks, while only incurring a 0.6% accuracy reduction on CIFAR10. Extensive experiments on multiple datasets demonstrate that our method is Pareto-optimal (best accuracy under given hardware constraints) compared to prior work in transfer learning for edge devices.
    A Temporal Graph Neural Network for Cyber Attack Detection and Localization in Smart Grids. (arXiv:2212.03390v1 [cs.LG])
    This paper presents a Temporal Graph Neural Network (TGNN) framework for detection and localization of false data injection and ramp attacks on the system state in smart grids. Capturing the topological information of the system through the GNN framework along with the state measurements can improve the performance of the detection mechanism. The problem is formulated as a classification problem through a GNN with message passing mechanism to identify abnormal measurements. The residual block used in the aggregation process of message passing and the gated recurrent unit can lead to improved computational time and performance. The performance of the proposed model has been evaluated through extensive simulations of power system states and attack scenarios showing promising performance. The sensitivity of the model to intensity and location of the attacks and model's detection delay versus detection accuracy have also been evaluated.
    Stable Conformal Prediction Sets. (arXiv:2112.10224v2 [stat.ML] UPDATED)
    When one observes a sequence of variables $(x_1, y_1), \ldots, (x_n, y_n)$, Conformal Prediction (CP) is a methodology that allows to estimate a confidence set for $y_{n+1}$ given $x_{n+1}$ by merely assuming that the distribution of the data is exchangeable. CP sets have guaranteed coverage for any finite population size $n$. While appealing, the computation of such a set turns out to be infeasible in general, e.g. when the unknown variable $y_{n+1}$ is continuous. The bottleneck is that it is based on a procedure that readjusts a prediction model on data where we replace the unknown target by all its possible values in order to select the most probable one. This requires computing an infinite number of models, which often makes it intractable. In this paper, we combine CP techniques with classical algorithmic stability bounds to derive a prediction set computable with a single model fit. We demonstrate that our proposed confidence set does not lose any coverage guarantees while avoiding the need for data splitting as currently done in the literature. We provide some numerical experiments to illustrate the tightness of our estimation when the sample size is sufficiently large, on both synthetic and real datasets.
    Can we integrate spatial verification methods into neural-network loss functions for atmospheric science?. (arXiv:2203.11141v2 [cs.LG] UPDATED)
    In the last decade, much work in atmospheric science has focused on spatial verification (SV) methods for gridded prediction, which overcome serious disadvantages of pixelwise verification. However, neural networks (NN) in atmospheric science are almost always trained to optimize pixelwise loss functions, even when ultimately assessed with SV methods. This establishes a disconnect between model verification during vs. after training. To address this issue, we develop spatially enhanced loss functions (SELF) and demonstrate their use for a real-world problem: predicting the occurrence of thunderstorms (henceforth, "convection") with NNs. In each SELF we use either a neighbourhood filter, which highlights convection at scales larger than a threshold, or a spectral filter (employing Fourier or wavelet decomposition), which is more flexible and highlights convection at scales between two thresholds. We use these filters to spatially enhance common verification scores, such as the Brier score. We train each NN with a different SELF and compare their performance at many scales of convection, from discrete storm cells to tropical cyclones. Among our many findings are that (a) for a low (high) risk threshold, the ideal SELF focuses on small (large) scales; (b) models trained with a pixelwise loss function perform surprisingly well; (c) however, models trained with a spectral filter produce much better-calibrated probabilities than a pixelwise model. We provide a general guide to using SELFs, including technical challenges and the final Python code, as well as demonstrating their use for the convection problem. To our knowledge this is the most in-depth guide to SELFs in the geosciences.
    Few-Shot Preference Learning for Human-in-the-Loop RL. (arXiv:2212.03363v1 [cs.RO])
    While reinforcement learning (RL) has become a more popular approach for robotics, designing sufficiently informative reward functions for complex tasks has proven to be extremely difficult due their inability to capture human intent and policy exploitation. Preference based RL algorithms seek to overcome these challenges by directly learning reward functions from human feedback. Unfortunately, prior work either requires an unreasonable number of queries implausible for any human to answer or overly restricts the class of reward functions to guarantee the elicitation of the most informative queries, resulting in models that are insufficiently expressive for realistic robotics tasks. Contrary to most works that focus on query selection to \emph{minimize} the amount of data required for learning reward functions, we take an opposite approach: \emph{expanding} the pool of available data by viewing human-in-the-loop RL through the more flexible lens of multi-task learning. Motivated by the success of meta-learning, we pre-train preference models on prior task data and quickly adapt them for new tasks using only a handful of queries. Empirically, we reduce the amount of online feedback needed to train manipulation policies in Meta-World by 20$\times$, and demonstrate the effectiveness of our method on a real Franka Panda Robot. Moreover, this reduction in query-complexity allows us to train robot policies from actual human users. Videos of our results and code can be found at https://sites.google.com/view/few-shot-preference-rl/home.
    FPGA Implementation of Multi-Layer Machine Learning Equalizer with On-Chip Training. (arXiv:2212.03515v1 [eess.SP])
    We design and implement an adaptive machine learning equalizer that alternates multiple linear and nonlinear computational layers on an FPGA. On-chip training via gradient backpropagation is shown to allow for real-time adaptation to time-varying channel impairments.
    Convergence of the Backward Deep BSDE Method with Applications to Optimal Stopping Problems. (arXiv:2210.04118v2 [math.PR] UPDATED)
    The optimal stopping problem is one of the core problems in financial markets, with broad applications such as pricing American and Bermudan options. The deep BSDE method [Han, Jentzen and E, PNAS, 115(34):8505-8510, 2018] has shown great power in solving high-dimensional forward-backward stochastic differential equations (FBSDEs), and inspired many applications. However, the method solves backward stochastic differential equations (BSDEs) in a forward manner, which can not be used for optimal stopping problems that in general require running BSDE backwardly. To overcome this difficulty, a recent paper [Wang, Chen, Sudjianto, Liu and Shen, arXiv:1807.06622, 2018] proposed the backward deep BSDE method to solve the optimal stopping problem. In this paper, we provide the rigorous theory for the backward deep BSDE method. Specifically, 1. We derive the a posteriori error estimation, i.e., the error of the numerical solution can be bounded by the training loss function; and; 2. We give an upper bound of the loss function, which can be sufficiently small subject to universal approximations. We give two numerical examples, which present consistent performance with the proved theory.
    General multi-fidelity surrogate models: Framework and active learning strategies for efficient rare event simulation. (arXiv:2212.03375v1 [cs.LG])
    Estimating the probability of failure for complex real-world systems using high-fidelity computational models is often prohibitively expensive, especially when the probability is small. Exploiting low-fidelity models can make this process more feasible, but merging information from multiple low-fidelity and high-fidelity models poses several challenges. This paper presents a robust multi-fidelity surrogate modeling strategy in which the multi-fidelity surrogate is assembled using an active learning strategy using an on-the-fly model adequacy assessment set within a subset simulation framework for efficient reliability analysis. The multi-fidelity surrogate is assembled by first applying a Gaussian process correction to each low-fidelity model and assigning a model probability based on the model's local predictive accuracy and cost. Three strategies are proposed to fuse these individual surrogates into an overall surrogate model based on model averaging and deterministic/stochastic model selection. The strategies also dictate which model evaluations are necessary. No assumptions are made about the relationships between low-fidelity models, while the high-fidelity model is assumed to be the most accurate and most computationally expensive model. Through two analytical and two numerical case studies, including a case study evaluating the failure probability of Tristructural isotropic-coated (TRISO) nuclear fuels, the algorithm is shown to be highly accurate while drastically reducing the number of high-fidelity model calls (and hence computational cost).
    Reconstructing Training Data from Model Gradient, Provably. (arXiv:2212.03714v1 [cs.LG])
    Understanding when and how much a model gradient leaks information about the training sample is an important question in privacy. In this paper, we present a surprising result: even without training or memorizing the data, we can fully reconstruct the training samples from a single gradient query at a randomly chosen parameter value. We prove the identifiability of the training data under mild conditions: with shallow or deep neural networks and a wide range of activation functions. We also present a statistically and computationally efficient algorithm based on tensor decomposition to reconstruct the training data. As a provable attack that reveals sensitive training data, our findings suggest potential severe threats to privacy, especially in federated learning.
    A Frequency-Structure Approach for Link Stream Analysis. (arXiv:2212.03804v1 [eess.SP])
    A link stream is a set of triplets $(t, u, v)$ indicating that $u$ and $v$ interacted at time $t$. Link streams model numerous datasets and their proper study is crucial in many applications. In practice, raw link streams are often aggregated or transformed into time series or graphs where decisions are made. Yet, it remains unclear how the dynamical and structural information of a raw link stream carries into the transformed object. This work shows that it is possible to shed light into this question by studying link streams via algebraically linear graph and signal operators, for which we introduce a novel linear matrix framework for the analysis of link streams. We show that, due to their linearity, most methods in signal processing can be easily adopted by our framework to analyze the time/frequency information of link streams. However, the availability of linear graph methods to analyze relational/structural information is limited. We address this limitation by developing (i) a new basis for graphs that allow us to decompose them into structures at different resolution levels; and (ii) filters for graphs that allow us to change their structural information in a controlled manner. By plugging-in these developments and their time-domain counterpart into our framework, we are able to (i) obtain a new basis for link streams that allow us to represent them in a frequency-structure domain; and (ii) show that many interesting transformations to link streams, like the aggregation of interactions or their embedding into a euclidean space, can be seen as simple filters in our frequency-structure domain.
    Fine-Grained Emotional Paraphrasing along Emotion Gradients. (arXiv:2212.03297v1 [cs.CL])
    Paraphrase generation, a.k.a. paraphrasing, is a common and important task in natural language processing. Emotional paraphrasing, which changes the emotion embodied in a piece of text while preserving its meaning, has many potential applications, e.g., moderating online dialogues and preventing cyberbullying. We introduce a new task of fine-grained emotional paraphrasing along emotion gradients, that is, altering the emotional intensities of the paraphrases in fine grain following smooth variations in affective dimensions while preserving the meanings of the originals. We propose a framework for addressing this task by fine-tuning text-to-text Transformers through multi-task training. We enhance several widely used paraphrasing corpus by annotating the input and target texts with their fine-grained emotion labels. With these labels, fine-tuning text-to-text Transformers on these corpus entails multi-task training. Evaluations of the fine-tuned Transformers on separate test sets show that including fine-grained emotion labels in the paraphrase task significantly improve the chance of obtaining high-quality paraphrases of the desired emotions, i.e., more than doubling the number of exact matches of desired emotions while achieving consistently better scores in paraphrase metrics such as BLEU, ROGUE, and METEOR.  ( 2 min )
    From Knowledge Augmentation to Multi-tasking: Towards Human-like Dialogue Systems. (arXiv:2212.03279v1 [cs.AI])
    The goal of building dialogue agents that can converse with humans naturally has been a long-standing dream of researchers since the early days of artificial intelligence. The well-known Turing Test proposed to judge the ultimate validity of an artificial intelligence agent on the indistinguishability of its dialogues from humans'. It should come as no surprise that human-level dialogue systems are very challenging to build. But, while early effort on rule-based systems found limited success, the emergence of deep learning enabled great advance on this topic. In this thesis, we focus on methods that address the numerous issues that have been imposing the gap between artificial conversational agents and human-level interlocutors. These methods were proposed and experimented with in ways that were inspired by general state-of-the-art AI methodologies. But they also targeted the characteristics that dialogue systems possess.  ( 2 min )
    MetaStackVis: Visually-Assisted Performance Evaluation of Metamodels. (arXiv:2212.03539v1 [cs.LG])
    Stacking (or stacked generalization) is an ensemble learning method with one main distinctiveness from the rest: even though several base models are trained on the original data set, their predictions are further used as input data for one or more metamodels arranged in at least one extra layer. Composing a stack of models can produce high-performance outcomes, but it usually involves a trial-and-error process. Therefore, our previously developed visual analytics system, StackGenVis, was mainly designed to assist users in choosing a set of top-performing and diverse models by measuring their predictive performance. However, it only employs a single logistic regression metamodel. In this paper, we investigate the impact of alternative metamodels on the performance of stacking ensembles using a novel visualization tool, called MetaStackVis. Our interactive tool helps users to visually explore different singular and pairs of metamodels according to their predictive probabilities and multiple validation metrics, as well as their ability to predict specific problematic data instances. MetaStackVis was evaluated with a usage scenario based on a medical data set and via expert interviews.
    Intervening With Confidence: Conformal Prescriptive Monitoring of Business Processes. (arXiv:2212.03710v1 [cs.LG])
    Prescriptive process monitoring methods seek to improve the performance of a process by selectively triggering interventions at runtime (e.g., offering a discount to a customer) to increase the probability of a desired case outcome (e.g., a customer making a purchase). The backbone of a prescriptive process monitoring method is an intervention policy, which determines for which cases and when an intervention should be executed. Existing methods in this field rely on predictive models to define intervention policies; specifically, they consider policies that trigger an intervention when the estimated probability of a negative outcome exceeds a threshold. However, the probabilities computed by a predictive model may come with a high level of uncertainty (low confidence), leading to unnecessary interventions and, thus, wasted effort. This waste is particularly problematic when the resources available to execute interventions are limited. To tackle this shortcoming, this paper proposes an approach to extend existing prescriptive process monitoring methods with so-called conformal predictions, i.e., predictions with confidence guarantees. An empirical evaluation using real-life public datasets shows that conformal predictions enhance the net gain of prescriptive process monitoring methods under limited resources.
    A Boosting Algorithm for Positive-Unlabeled Learning. (arXiv:2205.09485v4 [cs.LG] UPDATED)
    Positive-unlabeled (PU) learning deals with binary classification problems when only positive (P) and unlabeled (U) data are available. Many recent PU methods are based on neural networks, but little has been done to develop boosting algorithms for PU learning, despite boosting algorithms' strong performance on many fully supervised classification problems. In this paper, we propose a novel boosting algorithm, AdaPU, for PU learning. Similarly to AdaBoost, AdaPU aims to optimize an empirical exponential loss, but the loss is based on the PU data, rather than on positive-negative (PN) data. As in AdaBoost, we learn a weighted combination of weak classifiers by learning one weak classifier and its weight at a time. However, AdaPU requires a very different algorithm for learning the weak classifiers and determining their weights. This is because AdaPU learns a weak classifier and its weight using a weighted positive-negative (PN) dataset with some negative data weights $-$ the dataset is derived from the original PU data, and the data weights are determined by the current weighted classifier combination, but some data weights are negative. Our experiments showed that AdaPU outperforms neural networks on several benchmark PU datasets, including a large-scale challenging cyber security dataset.
    Optimizing a Digital Twin for Fault Diagnosis in Grid Connected Inverters -- A Bayesian Approach. (arXiv:2212.03564v1 [eess.SY])
    In this paper, a hyperparameter tuning based Bayesian optimization of digital twins is carried out to diagnose various faults in grid connected inverters. As fault detection and diagnosis require very high precision, we channelize our efforts towards an online optimization of the digital twins, which, in turn, allows a flexible implementation with limited amount of data. As a result, the proposed framework not only becomes a practical solution for model versioning and deployment of digital twins design with limited data, but also allows integration of deep learning tools to improve the hyperparameter tuning capabilities. For classification performance assessment, we consider different fault cases in virtual synchronous generator (VSG) controlled grid-forming converters and demonstrate the efficacy of our approach. Our research outcomes reveal the increased accuracy and fidelity levels achieved by our digital twin design, overcoming the shortcomings of traditional hyperparameter tuning methods.
    Clustering with Neural Network and Index. (arXiv:2212.03853v1 [cs.LG])
    A new model called Clustering with Neural Network and Index (CNNI) is introduced. CNNI uses a Neural Network to cluster data points. Training of the Neural Network mimics supervised learning, with an internal clustering evaluation index acting as the loss function. An experiment is conducted to test the feasibility of the new model, and compared with results of other clustering models like K-means and Gaussian Mixture Model (GMM).
    Transfer Learning for Functional Linear Regression with Structural Interpretability. (arXiv:2206.04277v3 [stat.ML] UPDATED)
    This work studies the problem of transfer learning under the functional linear regression model framework, which aims to improve the estimation and prediction of the target model by leveraging the information from related source models. We measure the relatedness between target and source models using Reproducing Kernel Hilbert Spaces (RKHS) norm, allowing the type of information being transferred to be interpreted by the structural properties of the spaces. Two transfer learning algorithms are proposed: one transfers information from source tasks when we know which sources to use, while the other one aggregates multiple transfer learning results from the first algorithm to achieve robust transfer learning without prior information about the sources. Furthermore, we establish the optimal convergence rates for the prediction risk in the target model, making the statistical gain via transfer learning mathematically provable. The theoretical analysis of the prediction risk also provides insights regarding what factors are affecting the transfer learning effect, i.e. what makes source tasks useful to the target task. We demonstrate the effectiveness of the proposed transfer learning algorithms on extensive synthetic data as well as real financial data application.
    Dock2D: Synthetic data for the molecular recognition problem. (arXiv:2212.03456v1 [q-bio.BM])
    Predicting the physical interaction of proteins is a cornerstone problem in computational biology. New classes of learning-based algorithms are actively being developed, and are typically trained end-to-end on protein complex structures extracted from the Protein Data Bank. These training datasets tend to be large and difficult to use for prototyping and, unlike image or natural language datasets, they are not easily interpretable by non-experts. We present Dock2D-IP and Dock2D-IF, two "toy" datasets that can be used to select algorithms predicting protein-protein interactions$\unicode{x2014}$or any other type of molecular interactions. Using two-dimensional shapes as input, each example from Dock2D-IP ("interaction pose") describes the interaction pose of two shapes known to interact and each example from Dock2D-IF ("interaction fact") describes whether two shapes form a stable complex or not. We propose a number of baseline solutions to the problem and show that the same underlying energy function can be learned either by solving the interaction pose task (formulated as an energy-minimization "docking" problem) or the fact-of-interaction task (formulated as a binding free energy estimation problem).  ( 2 min )
    Partial Disentanglement with Partially-Federated GANs (PaDPaF). (arXiv:2212.03836v1 [cs.CV])
    Federated learning has become a popular machine learning paradigm with many potential real-life applications, including recommendation systems, the Internet of Things (IoT), healthcare, and self-driving cars. Though most current applications focus on classification-based tasks, learning personalized generative models remains largely unexplored, and their benefits in the heterogeneous setting still need to be better understood. This work proposes a novel architecture combining global client-agnostic and local client-specific generative models. We show that using standard techniques for training federated models, our proposed model achieves privacy and personalization that is achieved by implicitly disentangling the globally-consistent representation (i.e. content) from the client-dependent variations (i.e. style). Using such decomposition, personalized models can generate locally unseen labels while preserving the given style of the client and can predict the labels for all clients with high accuracy by training a simple linear classifier on the global content features. Furthermore, disentanglement enables other essential applications, such as data anonymization, by sharing only content. Extensive experimental evaluation corroborates our findings, and we also provide partial theoretical justifications for the proposed approach.
    GP-BART: a novel Bayesian additive regression trees approach using Gaussian processes. (arXiv:2204.02112v3 [stat.ME] UPDATED)
    The Bayesian additive regression trees (BART) model is an ensemble method extensively and successfully used in regression tasks due to its consistently strong predictive performance and its ability to quantify uncertainty. BART combines "weak" tree models through a set of shrinkage priors, whereby each tree explains a small portion of the variability in the data. However, the lack of smoothness and the absence of a covariance structure over the observations in standard BART can yield poor performance in cases where such assumptions would be necessary. We propose Gaussian processes Bayesian additive regression trees (GP-BART) as an extension of BART which assumes Gaussian process (GP) priors for the predictions of each terminal node among all trees. We illustrate our model on simulated and real data and compare its performance to traditional modelling approaches, outperforming them in many scenarios. An implementation of our method is available in the R package rGPBART available at: https://github.com/MateusMaiaDS/gpbart
    Detecting hidden confounding in observational data using multiple environments. (arXiv:2205.13935v2 [stat.ME] UPDATED)
    A common assumption in causal inference from observational data is that there is no hidden confounding. Yet it is, in general, impossible to verify the presence of hidden confounding factors from a single dataset. Under the assumption of independent causal mechanisms underlying the data generating process, we demonstrate a way to detect unobserved confounders when having multiple observational datasets coming from different environments. We present a theory for testable conditional independencies that are only absent during hidden confounding and examine cases where we violate its assumptions: degenerate & dependent mechanisms, and faithfulness violations. Additionally, we propose a procedure to test these independencies and study its empirical finite-sample behavior using simulation studies and semi-synthetic data based on a real-world dataset. In most cases, our theory correctly predicts the presence of hidden confounding, particularly when the confounding bias is~large.
    Enhancing Low-Density EEG-Based Brain-Computer Interfaces with Similarity-Keeping Knowledge Distillation. (arXiv:2212.03329v1 [cs.LG])
    Electroencephalogram (EEG) has been one of the common neuromonitoring modalities for real-world brain-computer interfaces (BCIs) because of its non-invasiveness, low cost, and high temporal resolution. Recently, light-weight and portable EEG wearable devices based on low-density montages have increased the convenience and usability of BCI applications. However, loss of EEG decoding performance is often inevitable due to reduced number of electrodes and coverage of scalp regions of a low-density EEG montage. To address this issue, we introduce knowledge distillation (KD), a learning mechanism developed for transferring knowledge/information between neural network models, to enhance the performance of low-density EEG decoding. Our framework includes a newly proposed similarity-keeping (SK) teacher-student KD scheme that encourages a low-density EEG student model to acquire the inter-sample similarity as in a pre-trained teacher model trained on high-density EEG data. The experimental results validate that our SK-KD framework consistently improves motor-imagery EEG decoding accuracy when number of electrodes deceases for the input EEG data. For both common low-density headphone-like and headband-like montages, our method outperforms state-of-the-art KD methods across various EEG decoding model architectures. As the first KD scheme developed for enhancing EEG decoding, we foresee the proposed SK-KD framework to facilitate the practicality of low-density EEG-based BCI in real-world applications.
    Bringing the Algorithms to the Data -- Secure Distributed Medical Analytics using the Personal Health Train (PHT-meDIC). (arXiv:2212.03481v1 [cs.LG])
    The need for data privacy and security -- enforced through increasingly strict data protection regulations -- renders the use of healthcare data for machine learning difficult. In particular, the transfer of data between different hospitals is often not permissible and thus cross-site pooling of data not an option. The Personal Health Train (PHT) paradigm proposed within the GO-FAIR initiative implements an 'algorithm to the data' paradigm that ensures that distributed data can be accessed for analysis without transferring any sensitive data. We present PHT-meDIC, a productively deployed open-source implementation of the PHT concept. Containerization allows us to easily deploy even complex data analysis pipelines (e.g, genomics, image analysis) across multiple sites in a secure and scalable manner. We discuss the underlying technological concepts, security models, and governance processes. The implementation has been successfully applied to distributed analyses of large-scale data, including applications of deep neural networks to medical image data.
    PowerFDNet: Deep Learning-Based Stealthy False Data Injection Attack Detection for AC-model Transmission Systems. (arXiv:2207.10805v2 [cs.CR] UPDATED)
    Recent studies have demonstrated that smart grids are vulnerable to stealthy false data injection attacks (SFDIAs), as SFDIAs can bypass residual-based bad data detection mechanisms. The SFDIA detection has become one of the focuses of smart grid research. Methods based on deep learning technology have shown promising accuracy in the detection of SFDIAs. However, most existing methods rely on the temporal structure of a sequence of measurements but do not take account of the spatial structure between buses and transmission lines. To address this issue, we propose a spatiotemporal deep network, PowerFDNet, for the SFDIA detection in AC-model power grids. The PowerFDNet consists of two sub-architectures: spatial architecture (SA) and temporal architecture (TA). The SA is aimed at extracting representations of bus/line measurements and modeling the spatial structure based on their representations. The TA is aimed at modeling the temporal structure of a sequence of measurements. Therefore, the proposed PowerFDNet can effectively model the spatiotemporal structure of measurements. Case studies on the detection of SFDIAs on the benchmark smart grids show that the PowerFDNet achieved significant improvement compared with the state-of-the-art SFDIA detection methods. In addition, an IoT-oriented lightweight prototype of size 52 MB is implemented and tested for mobile devices, which demonstrates the potential applications on mobile devices. The trained model will be available at \textit{https://github.com/HubYZ/PowerFDNet}.
    Diffusion-SDF: Text-to-Shape via Voxelized Diffusion. (arXiv:2212.03293v1 [cs.CV])
    With the rising industrial attention to 3D virtual modeling technology, generating novel 3D content based on specified conditions (e.g. text) has become a hot issue. In this paper, we propose a new generative 3D modeling framework called Diffusion-SDF for the challenging task of text-to-shape synthesis. Previous approaches lack flexibility in both 3D data representation and shape generation, thereby failing to generate highly diversified 3D shapes conforming to the given text descriptions. To address this, we propose a SDF autoencoder together with the Voxelized Diffusion model to learn and generate representations for voxelized signed distance fields (SDFs) of 3D shapes. Specifically, we design a novel UinU-Net architecture that implants a local-focused inner network inside the standard U-Net architecture, which enables better reconstruction of patch-independent SDF representations. We extend our approach to further text-to-shape tasks including text-conditioned shape completion and manipulation. Experimental results show that Diffusion-SDF is capable of generating both high-quality and highly diversified 3D shapes that conform well to the given text descriptions. Diffusion-SDF has demonstrated its superiority compared to previous state-of-the-art text-to-shape approaches.
    Drift Identification for L\'{e}vy alpha-Stable Stochastic Systems. (arXiv:2212.03317v1 [stat.ML])
    This paper focuses on a stochastic system identification problem: given time series observations of a stochastic differential equation (SDE) driven by L\'{e}vy $\alpha$-stable noise, estimate the SDE's drift field. For $\alpha$ in the interval $[1,2)$, the noise is heavy-tailed, leading to computational difficulties for methods that compute transition densities and/or likelihoods in physical space. We propose a Fourier space approach that centers on computing time-dependent characteristic functions, i.e., Fourier transforms of time-dependent densities. Parameterizing the unknown drift field using Fourier series, we formulate a loss consisting of the squared error between predicted and empirical characteristic functions. We minimize this loss with gradients computed via the adjoint method. For a variety of one- and two-dimensional problems, we demonstrate that this method is capable of learning drift fields in qualitative and/or quantitative agreement with ground truth fields.
    Computing linear sections of varieties: quantum entanglement, tensor decompositions and beyond. (arXiv:2212.03851v1 [cs.DS])
    We study the problem of finding elements in the intersection of an arbitrary conic variety in $\mathbb{F}^n$ with a given linear subspace (where $\mathbb{F}$ can be the real or complex field). This problem captures a rich family of algorithmic problems under different choices of the variety. The special case of the variety consisting of rank-1 matrices already has strong connections to central problems in different areas like quantum information theory and tensor decompositions. This problem is known to be NP-hard in the worst-case, even for the variety of rank-1 matrices. Surprisingly, despite these hardness results we give efficient algorithms that solve this problem for "typical" subspaces. Here, the subspace $U \subseteq \mathbb{F}^n$ is chosen generically of a certain dimension, potentially with some generic elements of the variety contained in it. Our main algorithmic result is a polynomial time algorithm that recovers all the elements of $U$ that lie in the variety, under some mild non-degeneracy assumptions on the variety. As corollaries, we obtain the following results: $\bullet$ Uniqueness results and polynomial time algorithms for generic instances of a broad class of low-rank decomposition problems that go beyond tensor decompositions. Here, we recover a decomposition of the form $\sum_{i=1}^R v_i \otimes w_i$, where the $v_i$ are elements of the given variety $X$. This implies new algorithmic results even in the special case of tensor decompositions. $\bullet$ Polynomial time algorithms for several entangled subspaces problems in quantum entanglement, including determining $r$-entanglement, complete entanglement, and genuine entanglement of a subspace. While all of these problems are NP-hard in the worst case, our algorithm solves them in polynomial time for generic subspaces of dimension up to a constant multiple of the maximum possible.
    Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models. (arXiv:2212.03860v1 [cs.LG])
    Cutting-edge diffusion models produce images with high quality and customizability, enabling them to be used for commercial art and graphic design purposes. But do diffusion models create unique works of art, or are they stealing content directly from their training sets? In this work, we study image retrieval frameworks that enable us to compare generated images with training samples and detect when content has been replicated. Applying our frameworks to diffusion models trained on multiple datasets including Oxford flowers, Celeb-A, ImageNet, and LAION, we discuss how factors such as training set size impact rates of content replication. We also identify cases where diffusion models, including the popular Stable Diffusion model, blatantly copy from their training data.
    Reinforcement Learning with Non-Exponential Discounting. (arXiv:2209.13413v2 [cs.LG] UPDATED)
    Commonly in reinforcement learning (RL), rewards are discounted over time using an exponential function to model time preference, thereby bounding the expected long-term reward. In contrast, in economics and psychology, it has been shown that humans often adopt a hyperbolic discounting scheme, which is optimal when a specific task termination time distribution is assumed. In this work, we propose a theory for continuous-time model-based reinforcement learning generalized to arbitrary discount functions. This formulation covers the case in which there is a non-exponential random termination time. We derive a Hamilton-Jacobi-Bellman (HJB) equation characterizing the optimal policy and describe how it can be solved using a collocation method, which uses deep learning for function approximation. Further, we show how the inverse RL problem can be approached, in which one tries to recover properties of the discount function given decision data. We validate the applicability of our proposed approach on two simulated problems. Our approach opens the way for the analysis of human discounting in sequential decision-making tasks.
    Generalized Many-Body Dispersion Correction through Random-phase Approximation for Chemically Accurate Density Functional Theory. (arXiv:2210.09784v3 [physics.chem-ph] UPDATED)
    We extend our recently proposed Deep Learning-aided many-body dispersion (DNN-MBD) model to quadrupole polarizability (Q) terms using a generalized Random Phase Approximation (RPA) formalism, thus enabling the inclusion of van der Waals contributions beyond dipole. The resulting DNN-MBDQ model only relies on ab initio-derived quantities as the introduced quadrupole polarizabilities are recursively retrieved from dipole ones, in turn modelled via the Tkatchenko-Scheffler method. A transferable and efficient deep-neuronal network (DNN) provides atom in molecule volumes, while a single range-separation parameter is used to couple the model to Density Functional Theory (DFT). Since it can be computed at a negligible cost, the DNN-MBDQ approach can be coupled with DFT functionals such as PBE,PBE0 and B86bPBE (dispersionless). The DNN-MBQ-corrected functionals reach chemical accuracy while exhibiting lower errors compared to the DNN-MBD dipole-only counterparts as well as to other MBD-based dispersion correction models where the accuracy gain can reache up to 45%.
    X-Paste: Revisit Copy-Paste at Scale with CLIP and StableDiffusion. (arXiv:2212.03863v1 [cs.CV])
    Copy-Paste is a simple and effective data augmentation strategy for instance segmentation. By randomly pasting object instances onto new background images, it creates new training data for free and significantly boosts the segmentation performance, especially for rare object categories. Although diverse, high-quality object instances used in Copy-Paste result in more performance gain, previous works utilize object instances either from human-annotated instance segmentation datasets or rendered from 3D object models, and both approaches are too expensive to scale up to obtain good diversity. In this paper, we revisit Copy-Paste at scale with the power of newly emerged zero-shot recognition models (e.g., CLIP) and text2image models (e.g., StableDiffusion). We demonstrate for the first time that using a text2image model to generate images or zero-shot recognition model to filter noisily crawled images for different object categories is a feasible way to make Copy-Paste truly scalable. To make such success happen, we design a data acquisition and processing framework, dubbed "X-Paste", upon which a systematic study is conducted. On the LVIS dataset, X-Paste provides impressive improvements over the strong baseline CenterNet2 with Swin-L as the backbone. Specifically, it archives +2.6 box AP and +2.1 mask AP gains on all classes and even more significant gains with +6.8 box AP +6.5 mask AP on long-tail classes.
    StegaPos: Preventing Unwanted Crops and Replacements with Imperceptible Positional Embeddings. (arXiv:2104.12290v2 [cs.CV] UPDATED)
    We present a learned, spatially-varying steganography system that allows detecting when and how images have been altered by cropping, splicing or inpainting after publication. The system comprises a learned encoder that imperceptibly hides distinct positional signatures in every local image region before publication, and an accompanying learned decoder that extracts the steganographic signatures to determine, for each local image region, its 2D positional coordinates within the originally-published image. Crop and replacement edits become detectable by the inconsistencies they cause in the hidden positional signatures. Using a prototype system for small $(400 \times 400)$ images, we show experimentally that simple CNN encoder and decoder architectures can be trained jointly to achieve detection that is reliable and robust, without introducing perceptible distortion. This approach could help individuals and image-sharing platforms certify that an image was published by a trusted source, and also know which parts of such an image, if any, have been substantially altered since publication.
    HAVEN: Hierarchical Cooperative Multi-Agent Reinforcement Learning with Dual Coordination Mechanism. (arXiv:2110.07246v3 [cs.MA] UPDATED)
    Recently, some challenging tasks in multi-agent systems have been solved by some hierarchical reinforcement learning methods. Inspired by the intra-level and inter-level coordination in the human nervous system, we propose a novel value decomposition framework HAVEN based on hierarchical reinforcement learning for fully cooperative multi-agent problems. To address the instability arising from the concurrent optimization of policies between various levels and agents, we introduce the dual coordination mechanism of inter-level and inter-agent strategies by designing reward functions in a two-level hierarchy. HAVEN does not require domain knowledge and pre-training, and can be applied to any value decomposition variant. Our method achieves desirable results on different decentralized partially observable Markov decision process domains and outperforms other popular multi-agent hierarchical reinforcement learning algorithms.
    Country-wide Retrieval of Forest Structure From Optical and SAR Satellite Imagery With Deep Ensembles. (arXiv:2111.13154v2 [cs.CV] UPDATED)
    Monitoring and managing Earth's forests in an informed manner is an important requirement for addressing challenges like biodiversity loss and climate change. While traditional in situ or aerial campaigns for forest assessments provide accurate data for analysis at regional level, scaling them to entire countries and beyond with high temporal resolution is hardly possible. In this work, we propose a method based on deep ensembles that densely estimates forest structure variables at country-scale with 10-meter resolution, using freely available satellite imagery as input. Our method jointly transforms Sentinel-2 optical images and Sentinel-1 synthetic-aperture radar images into maps of five different forest structure variables: 95th height percentile, mean height, density, Gini coefficient, and fractional cover. We train and test our model on reference data from 41 airborne laser scanning missions across Norway and demonstrate that it is able to generalize to unseen test regions, achieving normalized mean absolute errors between 11% and 15%, depending on the variable. Our work is also the first to propose a variant of so-called Bayesian deep learning to densely predict multiple forest structure variables with well-calibrated uncertainty estimates from satellite imagery. The uncertainty information increases the trustworthiness of the model and its suitability for downstream tasks that require reliable confidence estimates as a basis for decision making. We present an extensive set of experiments to validate the accuracy of the predicted maps as well as the quality of the predicted uncertainties. To demonstrate scalability, we provide Norway-wide maps for the five forest structure variables.
    Scheduling with Speed Predictions. (arXiv:2205.01247v2 [cs.DS] UPDATED)
    Algorithms with predictions is a recent framework that has been used to overcome pessimistic worst-case bounds in incomplete information settings. In the context of scheduling, very recent work has leveraged machine-learned predictions to design algorithms that achieve improved approximation ratios in settings where the processing times of the jobs are initially unknown. In this paper, we study the speed-robust scheduling problem where the speeds of the machines, instead of the processing times of the jobs, are unknown and augment this problem with predictions. Our main result is an algorithm that achieves a $\min\{\eta^2(1+\alpha), (2 + 2/\alpha)\}$ approximation, for any $\alpha \in (0,1)$, where $\eta \geq 1$ is the prediction error. When the predictions are accurate, this approximation outperforms the best known approximation for speed-robust scheduling without predictions of $2-1/m$, where $m$ is the number of machines, while simultaneously maintaining a worst-case approximation of $2 + 2/\alpha$ even when the predictions are arbitrarily wrong. In addition, we obtain improved approximations for three special cases: equal job sizes, infinitesimal job sizes, and binary machine speeds. We also complement our algorithmic results with lower bounds. Finally, we empirically evaluate our algorithm against existing algorithms for speed-robust scheduling.  ( 2 min )
    Artificial Intelligence Security Competition (AISC). (arXiv:2212.03412v1 [cs.CR])
    The security of artificial intelligence (AI) is an important research area towards safe, reliable, and trustworthy AI systems. To accelerate the research on AI security, the Artificial Intelligence Security Competition (AISC) was organized by the Zhongguancun Laboratory, China Industrial Control Systems Cyber Emergency Response Team, Institute for Artificial Intelligence, Tsinghua University, and RealAI as part of the Zhongguancun International Frontier Technology Innovation Competition (https://www.zgc-aisc.com/en). The competition consists of three tracks, including Deepfake Security Competition, Autonomous Driving Security Competition, and Face Recognition Security Competition. This report will introduce the competition rules of these three tracks and the solutions of top-ranking teams in each track.
    HADAS: Hardware-Aware Dynamic Neural Architecture Search for Edge Performance Scaling. (arXiv:2212.03354v1 [cs.LG])
    Dynamic neural networks (DyNNs) have become viable techniques to enable intelligence on resource-constrained edge devices while maintaining computational efficiency. In many cases, the implementation of DyNNs can be sub-optimal due to its underlying backbone architecture being developed at the design stage independent of both: (i) the dynamic computing features, e.g. early exiting, and (ii) the resource efficiency features of the underlying hardware, e.g., dynamic voltage and frequency scaling (DVFS). Addressing this, we present HADAS, a novel Hardware-Aware Dynamic Neural Architecture Search framework that realizes DyNN architectures whose backbone, early exiting features, and DVFS settings have been jointly optimized to maximize performance and resource efficiency. Our experiments using the CIFAR-100 dataset and a diverse set of edge computing platforms have seen HADAS dynamic models achieve up to 57% energy efficiency gains compared to the conventional dynamic ones while maintaining the desired level of accuracy scores. Our code is available at https://github.com/HalimaBouzidi/HADAS  ( 2 min )
    Node-oriented Spectral Filtering for Graph Neural Networks. (arXiv:2212.03654v1 [cs.LG])
    Graph neural networks (GNNs) have shown remarkable performance on homophilic graph data while being far less impressive when handling non-homophilic graph data due to the inherent low-pass filtering property of GNNs. In general, since the real-world graphs are often a complex mixture of diverse subgraph patterns, learning a universal spectral filter on the graph from the global perspective as in most current works may still suffer from great difficulty in adapting to the variation of local patterns. On the basis of the theoretical analysis on local patterns, we rethink the existing spectral filtering methods and propose the \textbf{\underline{N}}ode-oriented spectral \textbf{\underline{F}}iltering for \textbf{\underline{G}}raph \textbf{\underline{N}}eural \textbf{\underline{N}}etwork (namely NFGNN). By estimating the node-oriented spectral filter for each node, NFGNN is provided with the capability of precise local node positioning via the generalized translated operator, thus discriminating the variations of local homophily patterns adaptively. Meanwhile, the utilization of re-parameterization brings a good trade-off between global consistency and local sensibility for learning the node-oriented spectral filters. Furthermore, we theoretically analyze the localization property of NFGNN, demonstrating that the signal after adaptive filtering is still positioned around the corresponding node. Extensive experimental results demonstrate that the proposed NFGNN achieves more favorable performance.
    Learning Reasoning Paths over Semantic Graphs for Video-grounded Dialogues. (arXiv:2103.00820v2 [cs.AI] UPDATED)
    Compared to traditional visual question answering, video-grounded dialogues require additional reasoning over dialogue context to answer questions in a multi-turn setting. Previous approaches to video-grounded dialogues mostly use dialogue context as a simple text input without modelling the inherent information flows at the turn level. In this paper, we propose a novel framework of Reasoning Paths in Dialogue Context (PDC). PDC model discovers information flows among dialogue turns through a semantic graph constructed based on lexical components in each question and answer. PDC model then learns to predict reasoning paths over this semantic graph. Our path prediction model predicts a path from the current turn through past dialogue turns that contain additional visual cues to answer the current question. Our reasoning model sequentially processes both visual and textual information through this reasoning path and the propagated features are used to generate the answer. Our experimental results demonstrate the effectiveness of our method and provide additional insights on how models use semantic dependencies in a dialogue context to retrieve visual cues.
    JamPatoisNLI: A Jamaican Patois Natural Language Inference Dataset. (arXiv:2212.03419v1 [cs.CL])
    JamPatoisNLI provides the first dataset for natural language inference in a creole language, Jamaican Patois. Many of the most-spoken low-resource languages are creoles. These languages commonly have a lexicon derived from a major world language and a distinctive grammar reflecting the languages of the original speakers and the process of language birth by creolization. This gives them a distinctive place in exploring the effectiveness of transfer from large monolingual or multilingual pretrained models. While our work, along with previous work, shows that transfer from these models to low-resource languages that are unrelated to languages in their training set is not very effective, we would expect stronger results from transfer to creoles. Indeed, our experiments show considerably better results from few-shot learning of JamPatoisNLI than for such unrelated languages, and help us begin to understand how the unique relationship between creoles and their high-resource base languages affect cross-lingual transfer. JamPatoisNLI, which consists of naturally-occurring premises and expert-written hypotheses, is a step towards steering research into a traditionally underserved language and a useful benchmark for understanding cross-lingual NLP.  ( 2 min )
    Edge Impulse: An MLOps Platform for Tiny Machine Learning. (arXiv:2212.03332v1 [cs.DC])
    Edge Impulse is a cloud-based machine learning operations (MLOps) platform for developing embedded and edge ML (TinyML) systems that can be deployed to a wide range of hardware targets. Current TinyML workflows are plagued by fragmented software stacks and heterogeneous deployment hardware, making ML model optimizations difficult and unportable. We present Edge Impulse, a practical MLOps platform for developing TinyML systems at scale. Edge Impulse addresses these challenges and streamlines the TinyML design cycle by supporting various software and hardware optimizations to create an extensible and portable software stack for a multitude of embedded systems. As of Oct. 2022, Edge Impulse hosts 118,185 projects from 50,953 developers.  ( 2 min )
    Proposal of a Score Based Approach to Sampling Using Monte Carlo Estimation of Score and Oracle Access to Target Density. (arXiv:2212.03325v1 [stat.ML])
    Score based approaches to sampling have shown much success as a generative algorithm to produce new samples from a target density given a pool of initial samples. In this work, we consider if we have no initial samples from the target density, but rather $0^{th}$ and $1^{st}$ order oracle access to the log likelihood. Such problems may arise in Bayesian posterior sampling, or in approximate minimization of non-convex functions. Using this knowledge alone, we propose a Monte Carlo method to estimate the score empirically as a particular expectation of a random variable. Using this estimator, we can then run a discrete version of the backward flow SDE to produce samples from the target density. This approach has the benefit of not relying on a pool of initial samples from the target density, and it does not rely on a neural network or other black box model to estimate the score.
    Giga-SSL: Self-Supervised Learning for Gigapixel Images. (arXiv:2212.03273v1 [cs.CV])
    Whole slide images (WSI) are microscopy images of stained tissue slides routinely prepared for diagnosis and treatment selection in medical practice. WSI are very large (gigapixel size) and complex (made of up to millions of cells). The current state-of-the-art (SoTA) approach to classify WSI subdivides them into tiles, encodes them by pre-trained networks and applies Multiple Instance Learning (MIL) to train for specific downstream tasks. However, annotated datasets are often small, typically a few hundred to a few thousand WSI, which may cause overfitting and underperforming models. Conversely, the number of unannotated WSI is ever increasing, with datasets of tens of thousands (soon to be millions) of images available. While it has been previously proposed to use these unannotated data to identify suitable tile representations by self-supervised learning (SSL), downstream classification tasks still require full supervision because parts of the MIL architecture is not trained during tile level SSL pre-training. Here, we propose a strategy of slide level SSL to leverage the large number of WSI without annotations to infer powerful slide representations. Applying our method to The Cancer-Genome Atlas, one of the most widely used data resources in cancer research (16 TB image data), we are able to downsize the dataset to 23 MB without any loss in predictive power: we show that a linear classifier trained on top of these embeddings maintains or improves previous SoTA performances on various benchmark WSI classification tasks. Finally, we observe that training a classifier on these representations with tiny datasets (e.g. 50 slides) improved performances over SoTA by an average of +6.3 AUC points over all downstream tasks.
    veriFIRE: Verifying an Industrial, Learning-Based Wildfire Detection System. (arXiv:2212.03287v1 [cs.LO])
    In this short paper, we present our ongoing work on the veriFIRE project -- a collaboration between industry and academia, aimed at using verification for increasing the reliability of a real-world, safety-critical system. The system we target is an airborne platform for wildfire detection, which incorporates two deep neural networks. We describe the system and its properties of interest, and discuss our attempts to verify the system's consistency, i.e., its ability to continue and correctly classify a given input, even if the wildfire it describes increases in intensity. We regard this work as a step towards the incorporation of academic-oriented verification tools into real-world systems of interest.
    Domain Translation via Latent Space Mapping. (arXiv:2212.03361v1 [cs.LG])
    In this paper, we investigate the problem of multi-domain translation: given an element $a$ of domain $A$, we would like to generate a corresponding $b$ sample in another domain $B$, and vice versa. Acquiring supervision in multiple domains can be a tedious task, also we propose to learn this translation from one domain to another when supervision is available as a pair $(a,b)\sim A\times B$ and leveraging possible unpaired data when only $a\sim A$ or only $b\sim B$ is available. We introduce a new unified framework called Latent Space Mapping (\model) that exploits the manifold assumption in order to learn, from each domain, a latent space. Unlike existing approaches, we propose to further regularize each latent space using available domains by learning each dependency between pairs of domains. We evaluate our approach in three tasks performing i) synthetic dataset with image translation, ii) real-world task of semantic segmentation for medical images, and iii) real-world task of facial landmark detection.
    Pre-trained Encoders in Self-Supervised Learning Improve Secure and Privacy-preserving Supervised Learning. (arXiv:2212.03334v1 [cs.CR])
    Classifiers in supervised learning have various security and privacy issues, e.g., 1) data poisoning attacks, backdoor attacks, and adversarial examples on the security side as well as 2) inference attacks and the right to be forgotten for the training data on the privacy side. Various secure and privacy-preserving supervised learning algorithms with formal guarantees have been proposed to address these issues. However, they suffer from various limitations such as accuracy loss, small certified security guarantees, and/or inefficiency. Self-supervised learning is an emerging technique to pre-train encoders using unlabeled data. Given a pre-trained encoder as a feature extractor, supervised learning can train a simple yet accurate classifier using a small amount of labeled training data. In this work, we perform the first systematic, principled measurement study to understand whether and when a pre-trained encoder can address the limitations of secure or privacy-preserving supervised learning algorithms. Our key findings are that a pre-trained encoder substantially improves 1) both accuracy under no attacks and certified security guarantees against data poisoning and backdoor attacks of state-of-the-art secure learning algorithms (i.e., bagging and KNN), 2) certified security guarantees of randomized smoothing against adversarial examples without sacrificing its accuracy under no attacks, 3) accuracy of differentially private classifiers, and 4) accuracy and/or efficiency of exact machine unlearning.
    Graph Summarization with Graph Neural Networks. (arXiv:2203.05919v2 [cs.LG] UPDATED)
    The goal of graph summarization is to represent large graphs in a structured and compact way. A graph summary based on equivalence classes preserves pre-defined features of a graph's vertex within a $k$-hop neighborhood such as the vertex labels and edge labels. Based on these neighborhood characteristics, the vertex is assigned to an equivalence class. The calculation of the assigned equivalence class must be a permutation invariant operation on the pre-defined features. This is achieved by sorting on the feature values, e. g., the edge labels, which is computationally expensive, and subsequently hashing the result. Graph Neural Networks (GNN) fulfill the permutation invariance requirement. We formulate the problem of graph summarization as a subgraph classification task on the root vertex of the $k$-hop neighborhood. We adapt different GNN architectures, both based on the popular message-passing protocol and alternative approaches, to perform the structural graph summarization task. We compare different GNNs with a standard multi-layer perceptron (MLP) and Bloom filter as non-neural method. For our experiments, we consider four popular graph summary models on a large web graph. This resembles challenging multi-class vertex classification tasks with the numbers of classes ranging from $576$ to multiple hundreds of thousands. Our results show that the performance of GNNs are close to each other. In three out of four experiments, the non-message-passing GraphMLP model outperforms the other GNNs. The performance of the standard MLP is extraordinary good, especially in the presence of many classes. Finally, the Bloom filter outperforms all neural architectures by a large margin, except for the dataset with the fewest number of $576$ classes.
    First Go, then Post-Explore: the Benefits of Post-Exploration in Intrinsic Motivation. (arXiv:2212.03251v1 [cs.LG])
    Go-Explore achieved breakthrough performance on challenging reinforcement learning (RL) tasks with sparse rewards. The key insight of Go-Explore was that successful exploration requires an agent to first return to an interesting state ('Go'), and only then explore into unknown terrain ('Explore'). We refer to such exploration after a goal is reached as 'post-exploration'. In this paper, we present a clear ablation study of post-exploration in a general intrinsically motivated goal exploration process (IMGEP) framework, that the Go-Explore paper did not show. We study the isolated potential of post-exploration, by turning it on and off within the same algorithm under both tabular and deep RL settings on both discrete navigation and continuous control tasks. Experiments on a range of MiniGrid and Mujoco environments show that post-exploration indeed helps IMGEP agents reach more diverse states and boosts their performance. In short, our work suggests that RL researchers should consider to use post-exploration in IMGEP when possible since it is effective, method-agnostic and easy to implement.
    Machine Learning Assisted Inverse Design of Microresonators. (arXiv:2212.03243v1 [cs.LG])
    The high demand for fabricating microresonators with desired optical properties has led to various techniques to optimize geometries, mode structures, nonlinearities and dispersion. Depending on applications, the dispersion in such resonators counters their optical nonlinearities and influences the intracavity optical dynamics. In this paper, we demonstrate the use of a machine learning (ML) algorithm as a tool to determine the geometry of microresonators from their dispersion profiles. The training dataset with ~460 samples is generated by finite element simulations and the model is experimentally verified using integrated silicon nitride microresonators. Two ML algorithms are compared along with suitable hyperparameter tuning, out of which Random Forest (RF) yields the best results. The average error on the simulated data is well below 15%.
    Bi-LSTM Price Prediction based on Attention Mechanism. (arXiv:2212.03443v1 [q-fin.CP])
    With the increasing enrichment and development of the financial derivatives market, the frequency of transactions is also faster and faster. Due to human limitations, algorithms and automatic trading have recently become the focus of discussion. In this paper, we propose a bidirectional LSTM neural network based on an attention mechanism, which is based on two popular assets, gold and bitcoin. In terms of Feature Engineering, on the one hand, we add traditional technical factors, and at the same time, we combine time series models to develop factors. In the selection of model parameters, we finally chose a two-layer deep learning network. According to AUC measurement, the accuracy of bitcoin and gold is 71.94% and 73.03% respectively. Using the forecast results, we achieved a return of 1089.34% in two years. At the same time, we also compare the attention Bi-LSTM model proposed in this paper with the traditional model, and the results show that our model has the best performance in this data set. Finally, we discuss the significance of the model and the experimental results, as well as the possible improvement direction in the future.  ( 2 min )
    CrossPyramid: Neural Ordinary Differential Equations Architecture for Partially-observed Time-series. (arXiv:2212.03560v1 [cs.LG])
    Ordinary Differential Equations (ODE)-based models have become popular foundation models to solve many time-series problems. Combining neural ODEs with traditional RNN models has provided the best representation for irregular time series. However, ODE-based models require the trajectory of hidden states to be defined based on the initial observed value or the last available observation. This fact raises questions about how long the generated hidden state is sufficient and whether it is effective when long sequences are used instead of the typically used shorter sequences. In this article, we introduce CrossPyramid, a novel ODE-based model that aims to enhance the generalizability of sequences representation. CrossPyramid does not rely only on the hidden state from the last observed value; it also considers ODE latent representations learned from other samples. The main idea of our proposed model is to define the hidden state for the unobserved values based on the non-linear correlation between samples. Accordingly, CrossPyramid is built with three distinctive parts: (1) ODE Auto-Encoder to learn the best data representation. (2) Pyramidal attention method to categorize the learned representations (hidden state) based on the relationship characteristics between samples. (3) Cross-level ODE-RNN to integrate the previously learned information and provide the final latent state for each sample. Through extensive experiments on partially-observed synthetic and real-world datasets, we show that the proposed architecture can effectively model the long gaps in intermittent series and outperforms state-of-the-art approaches. The results show an average improvement of 10\% on univariate and multivariate datasets for both forecasting and classification tasks.
    A Neural Network Approach for Selecting Track-like Events in Fluorescence Telescope Data. (arXiv:2212.03787v1 [astro-ph.IM])
    In 2016-2017, TUS, the world's first experiment for testing the possibility of registering ultra-high energy cosmic rays (UHECRs) by their fluorescent radiation in the night atmosphere of Earth was carried out. Since 2019, the Russian-Italian fluorescence telescope (FT) Mini-EUSO ("UV Atmosphere") has been operating on the ISS. The stratospheric experiment EUSO-SPB2, which will employ an FT for registering UHECRs, is planned for 2023. We show how a simple convolutional neural network can be effectively used to find track-like events in the variety of data obtained with such instruments.
    Towards Fleet-wide Sharing of Wind Turbine Condition Information through Privacy-preserving Federated Learning. (arXiv:2212.03529v1 [cs.LG])
    Terabytes of data are collected every day by wind turbine manufacturers from their fleets. The data contain valuable real-time information for turbine health diagnostics and performance monitoring, for predicting rare failures and the remaining service life of critical parts. And yet, this wealth of data from wind turbine fleets remains inaccessible to operators, utility companies, and researchers as manufacturing companies prefer the privacy of their fleets' turbine data for business strategic reasons. The lack of data access impedes the exploitation of opportunities, such as improving data-driven turbine operation and maintenance strategies and reducing downtimes. We present a distributed federated machine learning approach that leaves the data on the wind turbines to preserve the data privacy, as desired by manufacturers, while still enabling fleet-wide learning on those local data. We demonstrate in a case study that wind turbines which are scarce in representative training data benefit from more accurate fault detection models with federated learning, while no turbine experiences a loss in model performance by participating in the federated learning process. When comparing conventional and federated training processes, the average model training time rises significantly by a factor of 7 in the federated training due to increased communication and overhead operations. Thus, model training times might constitute an impediment that needs to be further explored and alleviated in federated learning applications, especially for large wind turbine fleets.
    Unsupervised spectral-band feature identification for optimal process discrimination. (arXiv:2212.03800v1 [cs.LG])
    Changes in real-world dynamic processes are often described in terms of differences in energies $\textbf{E}(\underline{\alpha})$ of a set of spectral-bands $\underline{\alpha}$. Given continuous spectra of two classes $A$ and $B$, or in general, two stochastic processes $S^{(A)}(f)$ and $S^{(B)}(f)$, $f \in \mathbb{R}^+$, we address the ubiquitous problem of identifying a subset of intervals of $f$ called spectral-bands $\underline{\alpha} \subset \mathbb{R}^+$ such that the energies $\textbf{E}(\underline{\alpha})$ of these bands can optimally discriminate between the two classes. We introduce EGO-MDA, an unsupervised method to identify optimal spectral-bands $\underline{\alpha}^*$ for given samples of spectra from two classes. EGO-MDA employs a statistical approach that iteratively minimizes an adjusted multinomial log-likelihood (deviance) criterion $\mathcal{D}(\underline{\alpha},\mathcal{M})$. Here, Mixture Discriminant Analysis (MDA) aims to derive MLE of two GMM distribution parameters, i.e., $\mathcal{M}^* = \underset{\mathcal{M}}{\rm argmin}~\mathcal{D}(\underline{\alpha}, \mathcal{M})$ and identify a classifier that optimally discriminates between two classes for a given spectral representation. The Efficient Global Optimization (EGO) finds the spectral-bands $\underline{\alpha}^* = \underset{\underline{\alpha}}{\rm argmin}~\mathcal{D}(\underline{\alpha},\mathcal{M})$ for given GMM parameters $\mathcal{M}$. For pathological cases of low separation between mixtures and model misspecification, we discuss the effect of the sample size and the number of iterations on the estimates of parameters $\mathcal{M}$ and therefore the classifier performance. A case study on a synthetic data set is provided. In an engineering application of optimal spectral-banding for anomaly tracking, EGO-MDA achieved at least 70% improvement in the median deviance relative to other methods tested.
    Online AutoML: An adaptive AutoML framework for online learning. (arXiv:2201.09750v3 [cs.LG] UPDATED)
    Automated Machine Learning (AutoML) has been used successfully in settings where the learning task is assumed to be static. In many real-world scenarios, however, the data distribution will evolve over time, and it is yet to be shown whether AutoML techniques can effectively design online pipelines in dynamic environments. This study aims to automate pipeline design for online learning while continuously adapting to data drift. For this purpose, we design an adaptive Online Automated Machine Learning (OAML) system, searching the complete pipeline configuration space of online learners, including preprocessing algorithms and ensembling techniques. This system combines the inherent adaptation capabilities of online learners with the fast automated pipeline (re)optimization capabilities of AutoML. Focusing on optimization techniques that can adapt to evolving objectives, we evaluate asynchronous genetic programming and asynchronous successive halving to optimize these pipelines continually. We experiment on real and artificial data streams with varying types of concept drift to test the performance and adaptation capabilities of the proposed system. The results confirm the utility of OAML over popular online learning algorithms and underscore the benefits of continuous pipeline redesign in the presence of data drift.
    Tree DNN: A Deep Container Network. (arXiv:2212.03474v1 [cs.LG])
    Multi-Task Learning (MTL) has shown its importance at user products for fast training, data efficiency, reduced overfitting etc. MTL achieves it by sharing the network parameters and training a network for multiple tasks simultaneously. However, MTL does not provide the solution, if each task needs training from a different dataset. In order to solve the stated problem, we have proposed an architecture named TreeDNN along with it's training methodology. TreeDNN helps in training the model with multiple datasets simultaneously, where each branch of the tree may need a different training dataset. We have shown in the results that TreeDNN provides competitive performance with the advantage of reduced ROM requirement for parameter storage and increased responsiveness of the system by loading only specific branch at inference time.
    Dynamic Graph Node Classification via Time Augmentation. (arXiv:2212.03449v1 [cs.LG])
    Node classification for graph-structured data aims to classify nodes whose labels are unknown. While studies on static graphs are prevalent, few studies have focused on dynamic graph node classification. Node classification on dynamic graphs is challenging for two reasons. First, the model needs to capture both structural and temporal information, particularly on dynamic graphs with a long history and require large receptive fields. Second, model scalability becomes a significant concern as the size of the dynamic graph increases. To address these problems, we propose the Time Augmented Dynamic Graph Neural Network (TADGNN) framework. TADGNN consists of two modules: 1) a time augmentation module that captures the temporal evolution of nodes across time structurally, creating a time-augmented spatio-temporal graph, and 2) an information propagation module that learns the dynamic representations for each node across time using the constructed time-augmented graph. We perform node classification experiments on four dynamic graph benchmarks. Experimental results demonstrate that TADGNN framework outperforms several static and dynamic state-of-the-art (SOTA) GNN models while demonstrating superior scalability. We also conduct theoretical and empirical analyses to validate the efficiency of the proposed method. Our code is available at https://sites.google.com/view/tadgnn.  ( 2 min )
    Concentration Phenomenon for Random Dynamical Systems: An Operator Theoretic Approach. (arXiv:2212.03670v1 [cs.LG])
    Via operator theoretic methods, we formalize the concentration phenomenon for a given observable `$r$' of a discrete time Markov chain with `$\mu_{\pi}$' as invariant ergodic measure, possibly having support on an unbounded state space. The main contribution of this paper is circumventing tedious probabilistic methods with a study of a composition of the Markov transition operator $P$ followed by a multiplication operator defined by $e^{r}$. It turns out that even if the observable/ reward function is unbounded, but for some for some $q>2$, $\|e^{r}\|_{q \rightarrow 2} \propto \exp\big(\mu_{\pi}(r) +\frac{2q}{q-2}\big) $ and $P$ is hyperbounded with norm control $\|P\|_{2 \rightarrow q }2$. The role of \emph{reversibility} in concentration phenomenon is demystified. These results are particularly useful for the reinforcement learning and controls communities as they allow for concentration inequalities w.r.t standard unbounded obersvables/reward functions where exact knowledge of the system is not available, let alone the reversibility of stationary measure.
    A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification. (arXiv:2107.07511v6 [cs.LG] UPDATED)
    Black-box machine learning models are now routinely used in high-risk settings, like medical diagnostics, which demand uncertainty quantification to avoid consequential model failures. Conformal prediction is a user-friendly paradigm for creating statistically rigorous uncertainty sets/intervals for the predictions of such models. Critically, the sets are valid in a distribution-free sense: they possess explicit, non-asymptotic guarantees even without distributional assumptions or model assumptions. One can use conformal prediction with any pre-trained model, such as a neural network, to produce sets that are guaranteed to contain the ground truth with a user-specified probability, such as 90%. It is easy-to-understand, easy-to-use, and general, applying naturally to problems arising in the fields of computer vision, natural language processing, deep reinforcement learning, and so on. This hands-on introduction is aimed to provide the reader a working understanding of conformal prediction and related distribution-free uncertainty quantification techniques with one self-contained document. We lead the reader through practical theory for and examples of conformal prediction and describe its extensions to complex machine learning tasks involving structured outputs, distribution shift, time-series, outliers, models that abstain, and more. Throughout, there are many explanatory illustrations, examples, and code samples in Python. With each code sample comes a Jupyter notebook implementing the method on a real-data example; the notebooks can be accessed and easily run using our codebase.
    Phase2vec: Dynamical systems embedding with a physics-informed convolutional network. (arXiv:2212.03857v1 [cs.LG])
    Dynamical systems are found in innumerable forms across the physical and biological sciences, yet all these systems fall naturally into universal equivalence classes: conservative or dissipative, stable or unstable, compressible or incompressible. Predicting these classes from data remains an essential open challenge in computational physics at which existing time-series classification methods struggle. Here, we propose, \texttt{phase2vec}, an embedding method that learns high-quality, physically-meaningful representations of 2D dynamical systems without supervision. Our embeddings are produced by a convolutional backbone that extracts geometric features from flow data and minimizes a physically-informed vector field reconstruction loss. In an auxiliary training period, embeddings are optimized so that they robustly encode the equations of unseen data over and above the performance of a per-equation fitting method. The trained architecture can not only predict the equations of unseen data, but also, crucially, learns embeddings that respect the underlying semantics of the embedded physical systems. We validate the quality of learned embeddings investigating the extent to which physical categories of input data can be decoded from embeddings compared to standard blackbox classifiers and state-of-the-art time series classification techniques. We find that our embeddings encode important physical properties of the underlying data, including the stability of fixed points, conservation of energy, and the incompressibility of flows, with greater fidelity than competing methods. We finally apply our embeddings to the analysis of meteorological data, showing we can detect climatically meaningful features. Collectively, our results demonstrate the viability of embedding approaches for the discovery of dynamical features in physical systems.
    Specifying Behavior Preference with Tiered Reward Functions. (arXiv:2212.03733v1 [cs.LG])
    Reinforcement-learning agents seek to maximize a reward signal through environmental interactions. As humans, our contribution to the learning process is through designing the reward function. Like programmers, we have a behavior in mind and have to translate it into a formal specification, namely rewards. In this work, we consider the reward-design problem in tasks formulated as reaching desirable states and avoiding undesirable states. To start, we propose a strict partial ordering of the policy space. We prefer policies that reach the good states faster and with higher probability while avoiding the bad states longer. Next, we propose an environment-independent tiered reward structure and show it is guaranteed to induce policies that are Pareto-optimal according to our preference relation. Finally, we empirically evaluate tiered reward functions on several environments and show they induce desired behavior and lead to fast learning.
    When Geometric Deep Learning Meets Pretrained Protein Language Models. (arXiv:2212.03447v1 [cs.LG])
    Geometric deep learning has recently achieved great success in non-Euclidean domains, and learning on 3D structures of large biomolecules is emerging as a distinct research area. However, its efficacy is largely constrained due to the limited quantity of structural data. Meanwhile, protein language models trained on substantial 1D sequences have shown burgeoning capabilities with scale in a broad range of applications. Nevertheless, no preceding studies consider combining these different protein modalities to promote the representation power of geometric neural networks. To address this gap, we make the foremost step to integrate the knowledge learned by well-trained protein language models into several state-of-the-art geometric networks. Experiments are evaluated on a variety of protein representation learning benchmarks, including protein-protein interface prediction, model quality assessment, protein-protein rigid-body docking, and binding affinity prediction, leading to an overall improvement of 20% over baselines and the new state-of-the-art performance. Strong evidence indicates that the incorporation of protein language models' knowledge enhances geometric networks' capacity by a significant margin and can be generalized to complex tasks.
    Adaptive Mixing of Auxiliary Losses in Supervised Learning. (arXiv:2202.03250v3 [cs.LG] UPDATED)
    In several supervised learning scenarios, auxiliary losses are used in order to introduce additional information or constraints into the supervised learning objective. For instance, knowledge distillation aims to mimic outputs of a powerful teacher model; similarly, in rule-based approaches, weak labeling information is provided by labeling functions which may be noisy rule-based approximations to true labels. We tackle the problem of learning to combine these losses in a principled manner. Our proposal, AMAL, uses a bi-level optimization criterion on validation data to learn optimal mixing weights, at an instance level, over the training data. We describe a meta-learning approach towards solving this bi-level objective and show how it can be applied to different scenarios in supervised learning. Experiments in a number of knowledge distillation and rule-denoising domains show that AMAL provides noticeable gains over competitive baselines in those domains. We empirically analyze our method and share insights into the mechanisms through which it provides performance gains.
    Unifying Diffusion Models' Latent Space, with Applications to CycleDiffusion and Guidance. (arXiv:2210.05559v2 [cs.CV] UPDATED)
    Diffusion models have achieved unprecedented performance in generative modeling. The commonly-adopted formulation of the latent code of diffusion models is a sequence of gradually denoised samples, as opposed to the simpler (e.g., Gaussian) latent space of GANs, VAEs, and normalizing flows. This paper provides an alternative, Gaussian formulation of the latent space of various diffusion models, as well as an invertible DPM-Encoder that maps images into the latent space. While our formulation is purely based on the definition of diffusion models, we demonstrate several intriguing consequences. (1) Empirically, we observe that a common latent space emerges from two diffusion models trained independently on related domains. In light of this finding, we propose CycleDiffusion, which uses DPM-Encoder for unpaired image-to-image translation. Furthermore, applying CycleDiffusion to text-to-image diffusion models, we show that large-scale text-to-image diffusion models can be used as zero-shot image-to-image editors. (2) One can guide pre-trained diffusion models and GANs by controlling the latent codes in a unified, plug-and-play formulation based on energy-based models. Using the CLIP model and a face recognition model as guidance, we demonstrate that diffusion models have better coverage of low-density sub-populations and individuals than GANs. The code is publicly available at https://github.com/ChenWu98/cycle-diffusion.
    MIMO-DBnet: Multi-channel Input and Multiple Outputs DOA-aware Beamforming Network for Speech Separation. (arXiv:2212.03401v1 [eess.AS])
    Recently, many deep learning based beamformers have been proposed for multi-channel speech separation. Nevertheless, most of them rely on extra cues known in advance, such as speaker feature, face image or directional information. In this paper, we propose an end-to-end beamforming network for direction guided speech separation given merely the mixture signal, namely MIMO-DBnet. Specifically, we design a multi-channel input and multiple outputs architecture to predict the direction-of-arrival based embeddings and beamforming weights for each source. The precisely estimated directional embedding provides quite effective spatial discrimination guidance for the neural beamformer to offset the effect of phase wrapping, thus allowing more accurate reconstruction of two sources' speech signals. Experiments show that our proposed MIMO-DBnet not only achieves a comprehensive decent improvement compared to baseline systems, but also maintain the performance on high frequency bands when phase wrapping occurs.
    rx-anon -- A Novel Approach on the De-Identification of Heterogeneous Data based on a Modified Mondrian Algorithm. (arXiv:2105.08842v2 [cs.LG] UPDATED)
    Traditional approaches for data anonymization consider relational data and textual data independently. We propose rx-anon, an anonymization approach for heterogeneous semi-structured documents composed of relational and textual attributes. We map sensitive terms extracted from the text to the structured data. This allows us to use concepts like k-anonymity to generate a joined, privacy-preserved version of the heterogeneous data input. We introduce the concept of redundant sensitive information to consistently anonymize the heterogeneous data. To control the influence of anonymization over unstructured textual data versus structured data attributes, we introduce a modified, parameterized Mondrian algorithm. The parameter $\lambda$ allows to give different weight on the relational and textual attributes during the anonymization process. We evaluate our approach with two real-world datasets using a Normalized Certainty Penalty score, adapted to the problem of jointly anonymizing relational and textual data. The results show that our approach is capable of reducing information loss by using the tuning parameter to control the Mondrian partitioning while guaranteeing k-anonymity for relational attributes as well as for sensitive terms. As rx-anon is a framework approach, it can be reused and extended by other anonymization algorithms, privacy models, and textual similarity metrics.
    Learning State Transition Rules from Hidden Layers of Restricted Boltzmann Machines. (arXiv:2212.03374v1 [cs.LG])
    Understanding the dynamics of a system is important in many scientific and engineering domains. This problem can be approached by learning state transition rules from observations using machine learning techniques. Such observed time-series data often consist of sequences of many continuous variables with noise and ambiguity, but we often need rules of dynamics that can be modeled with a few essential variables. In this work, we propose a method for extracting a small number of essential hidden variables from high-dimensional time-series data and for learning state transition rules between these hidden variables. The proposed method is based on the Restricted Boltzmann Machine (RBM), which treats observable data in the visible layer and latent features in the hidden layer. However, real-world data, such as video and audio, include both discrete and continuous variables, and these variables have temporal relationships. Therefore, we propose Recurrent Temporal GaussianBernoulli Restricted Boltzmann Machine (RTGB-RBM), which combines Gaussian-Bernoulli Restricted Boltzmann Machine (GB-RBM) to handle continuous visible variables, and Recurrent Temporal Restricted Boltzmann Machine (RT-RBM) to capture time dependence between discrete hidden variables. We also propose a rule-based method that extracts essential information as hidden variables and represents state transition rules in interpretable form. We conduct experiments on Bouncing Ball and Moving MNIST datasets to evaluate our proposed method. Experimental results show that our method can learn the dynamics of those physical systems as state transition rules between hidden variables and can predict unobserved future states from observed state transitions.
    PADDLES: Phase-Amplitude Spectrum Disentangled Early Stopping for Learning with Noisy Labels. (arXiv:2212.03462v1 [cs.CV])
    Convolutional Neural Networks (CNNs) have demonstrated superiority in learning patterns, but are sensitive to label noises and may overfit noisy labels during training. The early stopping strategy averts updating CNNs during the early training phase and is widely employed in the presence of noisy labels. Motivated by biological findings that the amplitude spectrum (AS) and phase spectrum (PS) in the frequency domain play different roles in the animal's vision system, we observe that PS, which captures more semantic information, can increase the robustness of DNNs to label noise, more so than AS can. We thus propose early stops at different times for AS and PS by disentangling the features of some layer(s) into AS and PS using Discrete Fourier Transform (DFT) during training. Our proposed Phase-AmplituDe DisentangLed Early Stopping (PADDLES) method is shown to be effective on both synthetic and real-world label-noise datasets. PADDLES outperforms other early stopping methods and obtains state-of-the-art performance.
    Tight bounds for maximum $\ell_1$-margin classifiers. (arXiv:2212.03783v1 [stat.ML])
    Popular iterative algorithms such as boosting methods and coordinate descent on linear models converge to the maximum $\ell_1$-margin classifier, a.k.a. sparse hard-margin SVM, in high dimensional regimes where the data is linearly separable. Previous works consistently show that many estimators relying on the $\ell_1$-norm achieve improved statistical rates for hard sparse ground truths. We show that surprisingly, this adaptivity does not apply to the maximum $\ell_1$-margin classifier for a standard discriminative setting. In particular, for the noiseless setting, we prove tight upper and lower bounds for the prediction error that match existing rates of order $\frac{\|\wgt\|_1^{2/3}}{n^{1/3}}$ for general ground truths. To complete the picture, we show that when interpolating noisy observations, the error vanishes at a rate of order $\frac{1}{\sqrt{\log(d/n)}}$. We are therefore first to show benign overfitting for the maximum $\ell_1$-margin classifier.
    Fallen Angel Bonds Investment and Bankruptcy Predictions Using Manual Models and Automated Machine Learning. (arXiv:2212.03454v1 [q-fin.RM])
    The primary aim of this research was to find a model that best predicts which fallen angel bonds would either potentially rise up back to investment grade bonds and which ones would fall into bankruptcy. To implement the solution, we thought that the ideal method would be to create an optimal machine learning model that could predict bankruptcies. Among the many machine learning models out there we decided to pick four classification methods: logistic regression, KNN, SVM, and NN. We also utilized an automated methods of Google Cloud's machine learning. The results of our model comparisons showed that the models did not predict bankruptcies very well on the original data set with the exception of Google Cloud's machine learning having a high precision score. However, our over-sampled and feature selection data set did perform very well. This could likely be due to the model being over-fitted to match the narrative of the over-sampled data (as in, it does not accurately predict data outside of this data set quite well). Therefore, we were not able to create a model that we are confident that would predict bankruptcies. However, we were able to find value out of this project in two key ways. The first is that Google Cloud's machine learning model in every metric and in every data set either outperformed or performed on par with the other models. The second is that we found that utilizing feature selection did not reduce predictive power that much. This means that we can reduce the amount of data to collect for future experimentation regarding predicting bankruptcies.
    Neighborhood Adaptive Estimators for Causal Inference under Network Interference. (arXiv:2212.03683v1 [stat.ML])
    Estimating causal effects has become an integral part of most applied fields. Solving these modern causal questions requires tackling violations of many classical causal assumptions. In this work we consider the violation of the classical no-interference assumption, meaning that the treatment of one individuals might affect the outcomes of another. To make interference tractable, we consider a known network that describes how interference may travel. However, unlike previous work in this area, the radius (and intensity) of the interference experienced by a unit is unknown and can depend on different sub-networks of those treated and untreated that are connected to this unit. We study estimators for the average direct treatment effect on the treated in such a setting. The proposed estimator builds upon a Lepski-like procedure that searches over the possible relevant radii and treatment assignment patterns. In contrast to previous work, the proposed procedure aims to approximate the relevant network interference patterns. We establish oracle inequalities and corresponding adaptive rates for the estimation of the interference function. We leverage such estimates to propose and analyze two estimators for the average direct treatment effect on the treated. We address several challenges steaming from the data-driven creation of the patterns (i.e. feature engineering) and the network dependence. In addition to rates of convergence, under mild regularity conditions, we show that one of the proposed estimators is asymptotically normal and unbiased.
    Sequential Predictive Conformal Inference for Time Series. (arXiv:2212.03463v1 [stat.ML])
    We present a new distribution-free conformal prediction algorithm for sequential data (e.g., time series), called the \textit{sequential predictive conformal inference} (\texttt{SPCI}). We specifically account for the nature that the time series data are non-exchangeable, and thus many existing conformal prediction algorithms based on temporal residuals are not applicable. The main idea is to exploit the temporal dependence of conformity scores; thus, the past conformity scores contain information about future ones. Then we cast the problem of conformal prediction interval as predicting the quantile of a future residual, given a prediction algorithm. Theoretically, we establish asymptotic valid conditional coverage upon extending consistency analyses in quantile regression. Using simulation and real-data experiments, we demonstrate a significant reduction in interval width of \texttt{SPCI} compared to other existing methods under the desired empirical coverage.
    Truthful Meta-Explanations for Local Interpretability of Machine Learning Models. (arXiv:2212.03513v1 [cs.LG])
    Automated Machine Learning-based systems' integration into a wide range of tasks has expanded as a result of their performance and speed. Although there are numerous advantages to employing ML-based systems, if they are not interpretable, they should not be used in critical, high-risk applications where human lives are at risk. To address this issue, researchers and businesses have been focusing on finding ways to improve the interpretability of complex ML systems, and several such methods have been developed. Indeed, there are so many developed techniques that it is difficult for practitioners to choose the best among them for their applications, even when using evaluation metrics. As a result, the demand for a selection tool, a meta-explanation technique based on a high-quality evaluation metric, is apparent. In this paper, we present a local meta-explanation technique which builds on top of the truthfulness metric, which is a faithfulness-based metric. We demonstrate the effectiveness of both the technique and the metric by concretely defining all the concepts and through experimentation.
    Teaching Matters: Investigating the Role of Supervision in Vision Transformers. (arXiv:2212.03862v1 [cs.CV])
    Vision Transformers (ViTs) have gained significant popularity in recent years and have proliferated into many applications. However, it is not well explored how varied their behavior is under different learning paradigms. We compare ViTs trained through different methods of supervision, and show that they learn a diverse range of behaviors in terms of their attention, representations, and downstream performance. We also discover ViT behaviors that are consistent across supervision, including the emergence of Offset Local Attention Heads. These are self-attention heads that attend to a token adjacent to the current token with a fixed directional offset, a phenomenon that to the best of our knowledge has not been highlighted in any prior work. Our analysis shows that ViTs are highly flexible and learn to process local and global information in different orders depending on their training method. We find that contrastive self-supervised methods learn features that are competitive with explicitly supervised features, and they can even be superior for part-level tasks. We also find that the representations of reconstruction-based models show non-trivial similarity to contrastive self-supervised models. Finally, we show how the "best" layer for a given task varies by both supervision method and task, further demonstrating the differing order of information processing in ViTs.
    Active Labeling: Streaming Stochastic Gradients. (arXiv:2205.13255v3 [cs.LG] UPDATED)
    The workhorse of machine learning is stochastic gradient descent. To access stochastic gradients, it is common to consider iteratively input/output pairs of a training dataset. Interestingly, it appears that one does not need full supervision to access stochastic gradients, which is the main motivation of this paper. After formalizing the "active labeling" problem, which focuses on active learning with partial supervision, we provide a streaming technique that provably minimizes the ratio of generalization error over the number of samples. We illustrate our technique in depth for robust regression.
    A neural approach to synchronization in wireless networks with heterogeneous sources of noise. (arXiv:2212.03327v1 [cs.NI])
    The paper addresses state estimation for clock synchronization in the presence of factors affecting the quality of synchronization. Examples are temperature variations and delay asymmetry. These working conditions make synchronization a challenging problem in many wireless environments, such as Wireless Sensor Networks or WiFi. Dynamic state estimation is investigated as it is essential to overcome non-stationary noises. The two-way timing message exchange synchronization protocol has been taken as a reference. No a-priori assumptions are made on the stochastic environments and no temperature measurement is executed. The algorithms are unequivocally specified offline, without the need of tuning some parameters in dependence of the working conditions. The presented approach reveals to be robust to a large set of temperature variations, different delay distributions and levels of asymmetry in the transmission path.
    Achieving Transparency in Distributed Machine Learning with Explainable Data Collaboration. (arXiv:2212.03373v1 [cs.LG])
    Transparency of Machine Learning models used for decision support in various industries becomes essential for ensuring their ethical use. To that end, feature attribution methods such as SHAP (SHapley Additive exPlanations) are widely used to explain the predictions of black-box machine learning models to customers and developers. However, a parallel trend has been to train machine learning models in collaboration with other data holders without accessing their data. Such models, trained over horizontally or vertically partitioned data, present a challenge for explainable AI because the explaining party may have a biased view of background data or a partial view of the feature space. As a result, explanations obtained from different participants of distributed machine learning might not be consistent with one another, undermining trust in the product. This paper presents an Explainable Data Collaboration Framework based on a model-agnostic additive feature attribution algorithm (KernelSHAP) and Data Collaboration method of privacy-preserving distributed machine learning. In particular, we present three algorithms for different scenarios of explainability in Data Collaboration and verify their consistency with experiments on open-access datasets. Our results demonstrated a significant (by at least a factor of 1.75) decrease in feature attribution discrepancies among the users of distributed machine learning.
    Leveraging Structure for Improved Classification of Grouped Biased Data. (arXiv:2212.03697v1 [stat.ML])
    We consider semi-supervised binary classification for applications in which data points are naturally grouped (e.g., survey responses grouped by state) and the labeled data is biased (e.g., survey respondents are not representative of the population). The groups overlap in the feature space and consequently the input-output patterns are related across the groups. To model the inherent structure in such data, we assume the partition-projected class-conditional invariance across groups, defined in terms of the group-agnostic feature space. We demonstrate that under this assumption, the group carries additional information about the class, over the group-agnostic features, with provably improved area under the ROC curve. Further assuming invariance of partition-projected class-conditional distributions across both labeled and unlabeled data, we derive a semi-supervised algorithm that explicitly leverages the structure to learn an optimal, group-aware, probability-calibrated classifier, despite the bias in the labeled data. Experiments on synthetic and real data demonstrate the efficacy of our algorithm over suitable baselines and ablative models, spanning standard supervised and semi-supervised learning approaches, with and without incorporating the group directly as a feature.
    Reinforcement Learning for UAV control with Policy and Reward Shaping. (arXiv:2212.03828v1 [cs.AI])
    In recent years, unmanned aerial vehicle (UAV) related technology has expanded knowledge in the area, bringing to light new problems and challenges that require solutions. Furthermore, because the technology allows processes usually carried out by people to be automated, it is in great demand in industrial sectors. The automation of these vehicles has been addressed in the literature, applying different machine learning strategies. Reinforcement learning (RL) is an automation framework that is frequently used to train autonomous agents. RL is a machine learning paradigm wherein an agent interacts with an environment to solve a given task. However, learning autonomously can be time consuming, computationally expensive, and may not be practical in highly-complex scenarios. Interactive reinforcement learning allows an external trainer to provide advice to an agent while it is learning a task. In this study, we set out to teach an RL agent to control a drone using reward-shaping and policy-shaping techniques simultaneously. Two simulated scenarios were proposed for the training; one without obstacles and one with obstacles. We also studied the influence of each technique. The results show that an agent trained simultaneously with both techniques obtains a lower reward than an agent trained using only a policy-based approach. Nevertheless, the agent achieves lower execution times and less dispersion during training.
    Contactless Oxygen Monitoring with Gated Transformer. (arXiv:2212.03357v1 [cs.LG])
    With the increasing popularity of telehealth, it becomes critical to ensure that basic physiological signals can be monitored accurately at home, with minimal patient overhead. In this paper, we propose a contactless approach for monitoring patients' blood oxygen at home, simply by analyzing the radio signals in the room, without any wearable devices. We extract the patients' respiration from the radio signals that bounce off their bodies and devise a novel neural network that infers a patient's oxygen estimates from their breathing signal. Our model, called \emph{Gated BERT-UNet}, is designed to adapt to the patient's medical indices (e.g., gender, sleep stages). It has multiple predictive heads and selects the most suitable head via a gate controlled by the person's physiological indices. Extensive empirical results show that our model achieves high accuracy on both medical and radio datasets.
    On the Global Solution of Soft k-Means. (arXiv:2212.03589v1 [cs.LG])
    This paper presents an algorithm to solve the Soft k-Means problem globally. Unlike Fuzzy c-Means, Soft k-Means (SkM) has a matrix factorization-type objective and has been shown to have a close relation with the popular probability decomposition-type clustering methods, e.g., Left Stochastic Clustering (LSC). Though some work has been done for solving the Soft k-Means problem, they usually use an alternating minimization scheme or the projected gradient descent method, which cannot guarantee global optimality since the non-convexity of SkM. In this paper, we present a sufficient condition for a feasible solution of Soft k-Means problem to be globally optimal and show the output of the proposed algorithm satisfies it. Moreover, for the Soft k-Means problem, we provide interesting discussions on stability, solutions non-uniqueness, and connection with LSC. Then, a new model, named Minimal Volume Soft k-Means (MVSkM), is proposed to address the solutions non-uniqueness issue. Finally, experimental results support our theoretical results.
    Discovering Latent Knowledge in Language Models Without Supervision. (arXiv:2212.03827v1 [cs.CL])
    Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. Specifically, we introduce a method for accurately answering yes-no questions given only unlabeled model activations. It works by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models: across 6 models and 10 question-answering datasets, it outperforms zero-shot accuracy by 4\% on average. We also find that it cuts prompt sensitivity in half and continues to maintain high accuracy even when models are prompted to generate incorrect answers. Our results provide an initial step toward discovering what language models know, distinct from what they say, even when we don't have access to explicit ground truth labels.
    The BeMi Stardust: a Structured Ensemble of Binarized Neural Networks. (arXiv:2212.03659v1 [math.OC])
    Binarized Neural Networks (BNNs) are receiving increasing attention due to their lightweight architecture and ability to run on low-power devices. The state-of-the-art for training classification BNNs restricted to few-shot learning is based on a Mixed Integer Programming (MIP) approach. This paper proposes the BeMi ensemble, a structured architecture of BNNs based on training a single BNN for each possible pair of classes and applying a majority voting scheme to predict the final output. The training of a single BNN discriminating between two classes is achieved by a MIP model that optimizes a lexicographic multi-objective function according to robustness and simplicity principles. This approach results in training networks whose output is not affected by small perturbations on the input and whose number of active weights is as small as possible, while good accuracy is preserved. We computationally validate our model using the MNIST and Fashion-MNIST datasets using up to 40 training images per class. Our structured ensemble outperforms both BNNs trained by stochastic gradient descent and state-of-the-art MIP-based approaches. While the previous approaches achieve an average accuracy of 51.1% on the MNIST dataset, the BeMi ensemble achieves an average accuracy of 61.7% when trained with 10 images per class and 76.4% when trained with 40 images per class.
    Capturing the Flow of Art History. (arXiv:2212.03421v1 [cs.LG])
    Do we really understand how machine classifies art styles? Historically, art is perceived and interpreted by human eyes and there are always controversial discussions over how people identify and understand art. Historians and general public tend to interpret the subject matter of art through the context of history and social factors. Style, however, is different from subject matter. Given the fact that Style does not correspond to the existence of certain objects in the painting and is mainly related to the form and can be correlated with features at different levels.(Ahmed Elgammal et al. 2018), which makes the identification and classification of the characteristics artwork's style and the "transition" - how it flows and evolves - remains as a challenge for both human and machine. In this work, a series of state-of-art neural networks and manifold learning algorithms are explored to unveil this intriguing topic: How does machine capture and interpret the flow of Art History?
    Metric Elicitation; Moving from Theory to Practice. (arXiv:2212.03495v1 [stat.ML])
    Metric Elicitation (ME) is a framework for eliciting classification metrics that better align with implicit user preferences based on the task and context. The existing ME strategy so far is based on the assumption that users can most easily provide preference feedback over classifier statistics such as confusion matrices. This work examines ME, by providing a first ever implementation of the ME strategy. Specifically, we create a web-based ME interface and conduct a user study that elicits users' preferred metrics in a binary classification setting. We discuss the study findings and present guidelines for future research in this direction.
    MEDIAR: Harmony of Data-Centric and Model-Centric for Multi-Modality Microscopy. (arXiv:2212.03465v1 [cs.CV])
    Cell segmentation is a fundamental task for computational biology analysis. Identifying the cell instances is often the first step in various downstream biomedical studies. However, many cell segmentation algorithms, including the recently emerging deep learning-based methods, still show limited generality under the multi-modality environment. Weakly Supervised Cell Segmentation in Multi-modality High-Resolution Microscopy Images was hosted at NeurIPS 2022 to tackle this problem. We propose MEDIAR, a holistic pipeline for cell instance segmentation under multi-modality in this challenge. MEDIAR harmonizes data-centric and model-centric approaches as the learning and inference strategies, achieving a 0.9067 F1-score at the validation phase while satisfying the time budget. To facilitate subsequent research, we provide the source code and trained model as open-source: https://github.com/Lee-Gihun/MEDIAR
    A Transformer-Based User Satisfaction Prediction for Proactive Interaction Mechanism in DuerOS. (arXiv:2212.03817v1 [cs.CL])
    Recently, spoken dialogue systems have been widely deployed in a variety of applications, serving a huge number of end-users. A common issue is that the errors resulting from noisy utterances, semantic misunderstandings, or lack of knowledge make it hard for a real system to respond properly, possibly leading to an unsatisfactory user experience. To avoid such a case, we consider a proactive interaction mechanism where the system predicts the user satisfaction with the candidate response before giving it to the user. If the user is not likely to be satisfied according to the prediction, the system will ask the user a suitable question to determine the real intent of the user instead of providing the response directly. With such an interaction with the user, the system can give a better response to the user. Previous models that predict the user satisfaction are not applicable to DuerOS which is a large-scale commercial dialogue system. They are based on hand-crafted features and thus can hardly learn the complex patterns lying behind millions of conversations and temporal dependency in multiple turns of the conversation. Moreover, they are trained and evaluated on the benchmark datasets with adequate labels, which are expensive to obtain in a commercial dialogue system. To face these challenges, we propose a pipeline to predict the user satisfaction to help DuerOS decide whether to ask for clarification in each turn. Specifically, we propose to first generate a large number of weak labels and then train a transformer-based model to predict the user satisfaction with these weak labels. Empirically, we deploy and evaluate our model on DuerOS, and observe a 19% relative improvement on the accuracy of user satisfaction prediction and 2.3% relative improvement on user experience.
    Improving Deep Localized Level Analysis: How Game Logs Can Help. (arXiv:2212.03376v1 [cs.HC])
    Player modelling is the field of study associated with understanding players. One pursuit in this field is affect prediction: the ability to predict how a game will make a player feel. We present novel improvements to affect prediction by using a deep convolutional neural network (CNN) to predict player experience trained on game event logs in tandem with localized level structure information. We test our approach on levels based on Super Mario Bros. (Infinite Mario Bros.) and Super Mario Bros.: The Lost Levels (Gwario), as well as original Super Mario Bros. levels. We outperform prior work, and demonstrate the utility of training on player logs, even when lacking them at test time for cross-domain player modelling.
    MOB-FL: Mobility-Aware Federated Learning for Intelligent Connected Vehicles. (arXiv:2212.03519v1 [cs.LG])
    Federated learning (FL) is a promising approach to enable the future Internet of vehicles consisting of intelligent connected vehicles (ICVs) with powerful sensing, computing and communication capabilities. We consider a base station (BS) coordinating nearby ICVs to train a neural network in a collaborative yet distributed manner, in order to limit data traffic and privacy leakage. However, due to the mobility of vehicles, the connections between the BS and ICVs are short-lived, which affects the resource utilization of ICVs, and thus, the convergence speed of the training process. In this paper, we propose an accelerated FL-ICV framework, by optimizing the duration of each training round and the number of local iterations, for better convergence performance of FL. We propose a mobility-aware optimization algorithm called MOB-FL, which aims at maximizing the resource utilization of ICVs under short-lived wireless connections, so as to increase the convergence speed. Simulation results based on the beam selection and the trajectory prediction tasks verify the effectiveness of the proposed solution.
    AfroLID: A Neural Language Identification Tool for African Languages. (arXiv:2210.11744v3 [cs.CL] UPDATED)
    Language identification (LID) is a crucial precursor for NLP, especially for mining web data. Problematically, most of the world's 7000+ languages today are not covered by LID technologies. We address this pressing issue for Africa by introducing AfroLID, a neural LID toolkit for $517$ African languages and varieties. AfroLID exploits a multi-domain web dataset manually curated from across 14 language families utilizing five orthographic systems. When evaluated on our blind Test set, AfroLID achieves 95.89 F_1-score. We also compare AfroLID to five existing LID tools that each cover a small number of African languages, finding it to outperform them on most languages. We further show the utility of AfroLID in the wild by testing it on the acutely under-served Twitter domain. Finally, we offer a number of controlled case studies and perform a linguistically-motivated error analysis that allow us to both showcase AfroLID's powerful capabilities and limitations.
    Improving Fairness via Federated Learning. (arXiv:2110.15545v3 [cs.LG] UPDATED)
    Recently, lots of algorithms have been proposed for learning a fair classifier from decentralized data. However, many theoretical and algorithmic questions remain open. First, is federated learning necessary, i.e., can we simply train locally fair classifiers and aggregate them? In this work, we first propose a new theoretical framework, with which we demonstrate that federated learning can strictly boost model fairness compared with such non-federated algorithms. We then theoretically and empirically show that the performance tradeoff of FedAvg-based fair learning algorithms is strictly worse than that of a fair classifier trained on centralized data. To bridge this gap, we propose FedFB, a private fair learning algorithm on decentralized data. The key idea is to modify the FedAvg protocol so that it can effectively mimic the centralized fair learning. Our experimental results show that FedFB significantly outperforms existing approaches, sometimes matching the performance of the centrally trained model.
    Uncertainty Minimization for Personalized Federated Semi-Supervised Learning. (arXiv:2205.02438v3 [cs.LG] UPDATED)
    Since federated learning (FL) has been introduced as a decentralized learning technique with privacy preservation, statistical heterogeneity of distributed data stays the main obstacle to achieve robust performance and stable convergence in FL applications. Model personalization methods have been studied to overcome this problem. However, existing approaches are mainly under the prerequisite of fully labeled data, which is unrealistic in practice due to the requirement of expertise. The primary issue caused by partial-labeled condition is that, clients with deficient labeled data can suffer from unfair performance gain because they lack adequate insights of local distribution to customize the global model. To tackle this problem, 1) we propose a novel personalized semi-supervised learning paradigm which allows partial-labeled or unlabeled clients to seek labeling assistance from data-related clients (helper agents), thus to enhance their perception of local data; 2) based on this paradigm, we design an uncertainty-based data-relation metric to ensure that selected helpers can provide trustworthy pseudo labels instead of misleading the local training; 3) to mitigate the network overload introduced by helper searching, we further develop a helper selection protocol to achieve efficient communication with acceptable performance sacrifice. Experiments show that our proposed method can obtain superior performance and more stable convergence than other related works with partially labeled data, especially in highly heterogeneous setting.
    Transportation-Inequalities, Lyapunov Stability and Sampling for Dynamical Systems on Continuous State Space. (arXiv:2205.12448v2 [stat.ML] UPDATED)
    We study the concentration phenomenon for discrete-time random dynamical systems with an unbounded state space. We develop a heuristic approach towards obtaining exponential concentration inequalities for dynamical systems using an entirely functional analytic framework. We also show that existence of exponential-type Lyapunov function, compared to the purely deterministic setting, not only implies stability but also exponential concentration inequalities for sampling from the stationary distribution, via \emph{transport-entropy inequality} (T-E). These results have significant impact in \emph{reinforcement learning} (RL) and \emph{controls}, leading to exponential concentration inequalities even for unbounded observables, while neither assuming reversibility nor exact knowledge of random dynamical system (assumptions at heart of concentration inequalities in statistical mechanics and Markov diffusion processes).
    Analysis of GraphSum's Attention Weights to Improve the Explainability of Multi-Document Summarization. (arXiv:2105.11908v2 [cs.CL] UPDATED)
    Modern multi-document summarization (MDS) methods are based on transformer architectures. They generate state of the art summaries, but lack explainability. We focus on graph-based transformer models for MDS as they gained recent popularity. We aim to improve the explainability of the graph-based MDS by analyzing their attention weights. In a graph-based MDS such as GraphSum, vertices represent the textual units, while the edges form some similarity graph over the units. We compare GraphSum's performance utilizing different textual units, i. e., sentences versus paragraphs, on two news benchmark datasets, namely WikiSum and MultiNews. Our experiments show that paragraph-level representations provide the best summarization performance. Thus, we subsequently focus oAnalysisn analyzing the paragraph-level attention weights of GraphSum's multi-heads and decoding layers in order to improve the explainability of a transformer-based MDS model. As a reference metric, we calculate the ROUGE scores between the input paragraphs and each sentence in the generated summary, which indicate source origin information via text similarity. We observe a high correlation between the attention weights and this reference metric, especially on the the later decoding layers of the transformer architecture. Finally, we investigate if the generated summaries follow a pattern of positional bias by extracting which paragraph provided the most information for each generated summary. Our results show that there is a high correlation between the position in the summary and the source origin.
    Fairness and Explainability: Bridging the Gap Towards Fair Model Explanations. (arXiv:2212.03840v1 [cs.LG])
    While machine learning models have achieved unprecedented success in real-world applications, they might make biased/unfair decisions for specific demographic groups and hence result in discriminative outcomes. Although research efforts have been devoted to measuring and mitigating bias, they mainly study bias from the result-oriented perspective while neglecting the bias encoded in the decision-making procedure. This results in their inability to capture procedure-oriented bias, which therefore limits the ability to have a fully debiasing method. Fortunately, with the rapid development of explainable machine learning, explanations for predictions are now available to gain insights into the procedure. In this work, we bridge the gap between fairness and explainability by presenting a novel perspective of procedure-oriented fairness based on explanations. We identify the procedure-based bias by measuring the gap of explanation quality between different groups with Ratio-based and Value-based Explanation Fairness. The new metrics further motivate us to design an optimization objective to mitigate the procedure-based bias where we observe that it will also mitigate bias from the prediction. Based on our designed optimization objective, we propose a Comprehensive Fairness Algorithm (CFA), which simultaneously fulfills multiple objectives - improving traditional fairness, satisfying explanation fairness, and maintaining the utility performance. Extensive experiments on real-world datasets demonstrate the effectiveness of our proposed CFA and highlight the importance of considering fairness from the explainability perspective. Our code is publicly available at https://github.com/YuyingZhao/FairExplanations-CFA .
    Dynamic Learning of Correlation Potentials for a Time-Dependent Kohn-Sham System. (arXiv:2112.07067v2 [stat.ML] UPDATED)
    We develop methods to learn the correlation potential for a time-dependent Kohn-Sham (TDKS) system in one spatial dimension. We start from a low-dimensional two-electron system for which we can numerically solve the time-dependent Schr\"odinger equation; this yields electron densities suitable for training models of the correlation potential. We frame the learning problem as one of optimizing a least-squares objective subject to the constraint that the dynamics obey the TDKS equation. Applying adjoints, we develop efficient methods to compute gradients and thereby learn models of the correlation potential. Our results show that it is possible to learn values of the correlation potential such that the resulting electron densities match ground truth densities. We also show how to learn correlation potential functionals with memory, demonstrating one such model that yields reasonable results for trajectories outside the training set.
    A Survey on Deep Graph Generation: Methods and Applications. (arXiv:2203.06714v3 [cs.LG] UPDATED)
    Graphs are ubiquitous in encoding relational information of real-world objects in many domains. Graph generation, whose purpose is to generate new graphs from a distribution similar to the observed graphs, has received increasing attention thanks to the recent advances of deep learning models. In this paper, we conduct a comprehensive review on the existing literature of deep graph generation from a variety of emerging methods to its wide application areas. Specifically, we first formulate the problem of deep graph generation and discuss its difference with several related graph learning tasks. Secondly, we divide the state-of-the-art methods into three categories based on model architectures and summarize their generation strategies. Thirdly, we introduce three key application areas of deep graph generation. Lastly, we highlight challenges and opportunities in the future study of deep graph generation. We hope that our survey will be useful for researchers and practitioners who are interested in this exciting and rapidly-developing field.
    Combining imitation and deep reinforcement learning to accomplish human-level performance on a virtual foraging task. (arXiv:2203.06250v4 [cs.LG] UPDATED)
    We develop a simple framework to learn bio-inspired foraging policies using human data. We conduct an experiment where humans are virtually immersed in an open field foraging environment and are trained to collect the highest amount of rewards. A Markov Decision Process (MDP) framework is introduced to model the human decision dynamics. Then, Imitation Learning (IL) based on maximum likelihood estimation is used to train Neural Networks (NN) that map human decisions to observed states. The results show that passive imitation substantially underperforms humans. We further refine the human-inspired policies via Reinforcement Learning (RL) using the on-policy Proximal Policy Optimization (PPO) algorithm which shows better stability than other algorithms and can steadily improve the policies pretrained with IL. We show that the combination of IL and RL can match human results and that good performance strongly depends on combining the allocentric information with an egocentric representation of the environment.
    Assessing and Analyzing the Resilience of Graph Neural Networks Against Hardware Faults. (arXiv:2212.03475v1 [cs.LG])
    Graph neural networks (GNNs) have recently emerged as a promising learning paradigm in learning graph-structured data and have demonstrated wide success across various domains such as recommendation systems, social networks, and electronic design automation (EDA). Like other deep learning (DL) methods, GNNs are being deployed in sophisticated modern hardware systems, as well as dedicated accelerators. However, despite the popularity of GNNs and the recent efforts of bringing GNNs to hardware, the fault tolerance and resilience of GNNs has generally been overlooked. Inspired by the inherent algorithmic resilience of DL methods, this paper conducts, for the first time, a large-scale and empirical study of GNN resilience, aiming to understand the relationship between hardware faults and GNN accuracy. By developing a customized fault injection tool on top of PyTorch, we perform extensive fault injection experiments to various GNN models and application datasets. We observe that the error resilience of GNN models varies by orders of magnitude with respect to different models and application datasets. Further, we explore a low-cost error mitigation mechanism for GNN to enhance its resilience. This GNN resilience study aims to open up new directions and opportunities for future GNN accelerator design and architectural optimization.
    Contrastive Deep Graph Clustering with Learnable Augmentation. (arXiv:2212.03559v1 [cs.LG])
    Graph contrastive learning is an important method for deep graph clustering. The existing methods first generate the graph views with stochastic augmentations and then train the network with a cross-view consistency principle. Although good performance has been achieved, we observe that the existing augmentation methods are usually random and rely on pre-defined augmentations, which is insufficient and lacks negotiation between the final clustering task. To solve the problem, we propose a novel Graph Contrastive Clustering method with the Learnable graph Data Augmentation (GCC-LDA), which is optimized completely by the neural networks. An adversarial learning mechanism is designed to keep cross-view consistency in the latent space while ensuring the diversity of augmented views. In our framework, a structure augmentor and an attribute augmentor are constructed for augmentation learning in both structure level and attribute level. To improve the reliability of the learned affinity matrix, clustering is introduced to the learning procedure and the learned affinity matrix is refined with both the high-confidence pseudo-label matrix and the cross-view sample similarity matrix. During the training procedure, to provide persistent optimization for the learned view, we design a two-stage training strategy to obtain more reliable clustering information. Extensive experimental results demonstrate the effectiveness of GCC-LDA on six benchmark datasets.
    Unsupervised Anomaly Detection in Time-series: An Extensive Evaluation and Analysis of State-of-the-art Methods. (arXiv:2212.03637v1 [cs.LG])
    Unsupervised anomaly detection in time-series has been extensively investigated in the literature. Notwithstanding the relevance of this topic in numerous application fields, a complete and extensive evaluation of recent state-of-the-art techniques is still missing. Few efforts have been made to compare existing unsupervised time-series anomaly detection methods rigorously. However, only standard performance metrics, namely precision, recall, and F1-score are usually considered. Essential aspects for assessing their practical relevance are therefore neglected. This paper proposes an original and in-depth evaluation study of recent unsupervised anomaly detection techniques in time-series. Instead of relying solely on standard performance metrics, additional yet informative metrics and protocols are taken into account. In particular, (1) more elaborate performance metrics specifically tailored for time-series are used; (2) the model size and the model stability are studied; (3) an analysis of the tested approaches with respect to the anomaly type is provided; and (4) a clear and unique protocol is followed for all experiments. Overall, this extensive analysis aims to assess the maturity of state-of-the-art time-series anomaly detection, give insights regarding their applicability under real-world setups and provide to the community a more complete evaluation protocol.
    Generalized Gradient Flows with Provable Fixed-Time Convergence and Fast Evasion of Non-Degenerate Saddle Points. (arXiv:2212.03765v1 [cs.LG])
    Gradient-based first-order convex optimization algorithms find widespread applicability in a variety of domains, including machine learning tasks. Motivated by the recent advances in fixed-time stability theory of continuous-time dynamical systems, we introduce a generalized framework for designing accelerated optimization algorithms with strongest convergence guarantees that further extend to a subclass of non-convex functions. In particular, we introduce the \emph{GenFlow} algorithm and its momentum variant that provably converge to the optimal solution of objective functions satisfying the Polyak-{\L}ojasiewicz (PL) inequality, in a fixed-time. Moreover for functions that admit non-degenerate saddle-points, we show that for the proposed GenFlow algorithm, the time required to evade these saddle-points is bounded uniformly for all initial conditions. Finally, for strongly convex-strongly concave minimax problems whose optimal solution is a saddle point, a similar scheme is shown to arrive at the optimal solution again in a fixed-time. The superior convergence properties of our algorithm are validated experimentally on a variety of benchmark datasets.
    Root-finding Approaches for Computing Conformal Prediction Set. (arXiv:2104.06648v3 [stat.ML] UPDATED)
    Conformal prediction constructs a confidence set for an unobserved response of a feature vector based on previous identically distributed and exchangeable observations of responses and features. It has a coverage guarantee at any nominal level without additional assumptions on their distribution. Its computation deplorably requires a refitting procedure for all replacement candidates of the target response. In regression settings, this corresponds to an infinite number of model fits. Apart from relatively simple estimators that can be written as pieces of linear function of the response, efficiently computing such sets is difficult, and is still considered as an open problem. We exploit the fact that, \emph{often}, conformal prediction sets are intervals whose boundaries can be efficiently approximated by classical root-finding algorithms. We investigate how this approach can overcome many limitations of formerly used strategies; we discuss its complexity and drawbacks.
    DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing. (arXiv:2212.03597v1 [cs.LG])
    Recent advances on deep learning models come at the price of formidable training cost. The increasing model size is one of the root cause, but another less-emphasized fact is that data scale is actually increasing at a similar speed as model scale, and the training cost is proportional to both of them. Compared to the rapidly evolving model architecture, how to efficiently use the training data (especially for the expensive foundation model pertaining) is both less explored and difficult to realize due to the lack of a convenient framework that focus on data efficiency capabilities. To this end, we present DeepSpeed Data Efficiency library, a framework that makes better use of data, increases training efficiency, and improves model quality. Specifically, it provides efficient data sampling via curriculum learning, and efficient data routing via random layerwise token dropping. DeepSpeed Data Efficiency takes extensibility, flexibility and composability into consideration, so that users can easily utilize the framework to compose multiple techniques and apply customized strategies. By applying our solution to GPT-3 1.3B and BERT-Large language model pretraining, we can achieve similar model quality with up to 2x less data and 2x less time, or achieve better model quality under similar amount of data and time.  ( 2 min )
    Optimal Online Learning using Potential Functions. (arXiv:2106.10717v6 [cs.LG] UPDATED)
    We study a family of potential functions for online learning. We show that if the potential function has strictly positive derivatives of order 1-4 then the min-max optimal strategy for the adversary is Brownian motion. Using that fact we analyze different potential functions and show that the Normal-Hedge potential provides the tightest upper bounds on the cumulative regret of the top {\epsilon}-percentile.  ( 2 min )
    Accelerating Self-Imitation Learning from Demonstrations via Policy Constraints and Q-Ensemble. (arXiv:2212.03562v1 [cs.LG])
    Deep reinforcement learning (DRL) provides a new way to generate robot control policy. However, the process of training control policy requires lengthy exploration, resulting in a low sample efficiency of reinforcement learning (RL) in real-world tasks. Both imitation learning (IL) and learning from demonstrations (LfD) improve the training process by using expert demonstrations, but imperfect expert demonstrations can mislead policy improvement. Offline to Online reinforcement learning requires a lot of offline data to initialize the policy, and distribution shift can easily lead to performance degradation during online fine-tuning. To solve the above problems, we propose a learning from demonstrations method named A-SILfD, which treats expert demonstrations as the agent's successful experiences and uses experiences to constrain policy improvement. Furthermore, we prevent performance degradation due to large estimation errors in the Q-function by the ensemble Q-functions. Our experiments show that A-SILfD can significantly improve sample efficiency using a small number of different quality expert demonstrations. In four Mujoco continuous control tasks, A-SILfD can significantly outperform baseline methods after 150,000 steps of online training and is not misled by imperfect expert demonstrations during training.  ( 2 min )
    Learning rigid dynamics with face interaction graph networks. (arXiv:2212.03574v1 [cs.LG])
    Simulating rigid collisions among arbitrary shapes is notoriously difficult due to complex geometry and the strong non-linearity of the interactions. While graph neural network (GNN)-based models are effective at learning to simulate complex physical dynamics, such as fluids, cloth and articulated bodies, they have been less effective and efficient on rigid-body physics, except with very simple shapes. Existing methods that model collisions through the meshes' nodes are often inaccurate because they struggle when collisions occur on faces far from nodes. Alternative approaches that represent the geometry densely with many particles are prohibitively expensive for complex shapes. Here we introduce the Face Interaction Graph Network (FIGNet) which extends beyond GNN-based methods, and computes interactions between mesh faces, rather than nodes. Compared to learned node- and particle-based methods, FIGNet is around 4x more accurate in simulating complex shape interactions, while also 8x more computationally efficient on sparse, rigid meshes. Moreover, FIGNet can learn frictional dynamics directly from real-world data, and can be more accurate than analytical solvers given modest amounts of training data. FIGNet represents a key step forward in one of the few remaining physical domains which have seen little competition from learned simulators, and offers allied fields such as robotics, graphics and mechanical design a new tool for simulation and model-based planning.
    Content-based Music Similarity with Triplet Networks. (arXiv:2008.04938v2 [cs.LG] UPDATED)
    We explore the feasibility of using triplet neural networks to embed songs based on content-based music similarity. Our network is trained using triplets of songs such that two songs by the same artist are embedded closer to one another than to a third song by a different artist. We compare two models that are trained using different ways of picking this third song: at random vs. based on shared genre labels. Our experiments are conducted using songs from the Free Music Archive and use standard audio features. The initial results show that shallow Siamese networks can be used to embed music for a simple artist retrieval task.  ( 2 min )
    Unique sparse decomposition of low rank matrices. (arXiv:2106.07736v5 [math.OC] UPDATED)
    The problem of finding the unique low dimensional decomposition of a given matrix has been a fundamental and recurrent problem in many areas. In this paper, we study the problem of seeking a unique decomposition of a low rank matrix $Y\in \mathbb{R}^{p\times n}$ that admits a sparse representation. Specifically, we consider $Y = A X\in \mathbb{R}^{p\times n}$ where the matrix $A\in \mathbb{R}^{p\times r}$ has full column rank, with $r < \min\{n,p\}$, and the matrix $X\in \mathbb{R}^{r\times n}$ is element-wise sparse. We prove that this sparse decomposition of $Y$ can be uniquely identified, up to some intrinsic signed permutation. Our approach relies on solving a nonconvex optimization problem constrained over the unit sphere. Our geometric analysis for the nonconvex optimization landscape shows that any {\em strict} local solution is close to the ground truth solution, and can be recovered by a simple data-driven initialization followed with any second order descent algorithm. At last, we corroborate these theoretical results with numerical experiments.  ( 2 min )
    DIAMOND: Taming Sample and Communication Complexities in Decentralized Bilevel Optimization. (arXiv:2212.02376v2 [cs.LG] UPDATED)
    Decentralized bilevel optimization has received increasing attention recently due to its foundational role in many emerging multi-agent learning paradigms (e.g., multi-agent meta-learning and multi-agent reinforcement learning) over peer-to-peer edge networks. However, to work with the limited computation and communication capabilities of edge networks, a major challenge in developing decentralized bilevel optimization techniques is to lower sample and communication complexities. This motivates us to develop a new decentralized bilevel optimization called DIAMOND (decentralized single-timescale stochastic approximation with momentum and gradient-tracking). The contributions of this paper are as follows: i) our DIAMOND algorithm adopts a single-loop structure rather than following the natural double-loop structure of bilevel optimization, which offers low computation and implementation complexity; ii) compared to existing approaches, the DIAMOND algorithm does not require any full gradient evaluations, which further reduces both sample and computational complexities; iii) through a careful integration of momentum information and gradient tracking techniques, we show that the DIAMOND algorithm enjoys $\mathcal{O}(\epsilon^{-3/2})$ in sample and communication complexities for achieving an $\epsilon$-stationary solution, both of which are independent of the dataset sizes and significantly outperform existing works. Extensive experiments also verify our theoretical findings.  ( 2 min )
    Expressive architectures enhance interpretability of dynamics-based neural population models. (arXiv:2212.03771v1 [q-bio.NC])
    Artificial neural networks that can recover latent dynamics from recorded neural activity may provide a powerful avenue for identifying and interpreting the dynamical motifs underlying biological computation. Given that neural variance alone does not uniquely determine a latent dynamical system, interpretable architectures should prioritize accurate and low-dimensional latent dynamics. In this work, we evaluated the performance of sequential autoencoders (SAEs) in recovering three latent chaotic attractors from simulated neural datasets. We found that SAEs with widely-used recurrent neural network (RNN)-based dynamics were unable to infer accurate rates at the true latent state dimensionality, and that larger RNNs relied upon dynamical features not present in the data. On the other hand, SAEs with neural ordinary differential equation (NODE)-based dynamics inferred accurate rates at the true latent state dimensionality, while also recovering latent trajectories and fixed point structure. We attribute this finding to the fact that NODEs allow use of multi-layer perceptrons (MLPs) of arbitrary capacity to model the vector field. Decoupling the expressivity of the dynamics model from its latent dimensionality enables NODEs to learn the requisite low-D dynamics where RNN cells fail. The suboptimal interpretability of widely-used RNN-based dynamics may motivate substitution for alternative architectures, such as NODE, that enable learning of accurate dynamics in low-dimensional latent spaces.  ( 2 min )

  • Open

    chatGPT: define female vs define male
    submitted by /u/l-L-li [link] [comments]  ( 43 min )
    Can I start learning neural networks without first studying traditional machine-learning methods such as decision trees and SVMs?
    Note that I just wanna play around as a hobbyist. My main field is game engine/graphics programming. This already takes a lot of my time so I wondered if I could jump straight to neural networks since I have a couple of ideas I wanna try out. I have strong background in linear algebra, moderate in calculus and weak-to-moderate in statistics. What do you think? View Poll submitted by /u/saccharineboi [link] [comments]  ( 44 min )
    It-It just Rickrolled me
    submitted by /u/Thaatonelemon [link] [comments]  ( 46 min )
    I transformed Myself Into a Samurai using ebsynth
    submitted by /u/remonberkersphoto [link] [comments]  ( 43 min )
    One of the most intense parts of this video I made for my new song. The character, Jean, has dementia. Link to full video in comments.
    submitted by /u/defensiveFruit [link] [comments]  ( 43 min )
    Any help finding something please
    Does anyone know of any Ai text generator that takes written text and allows the writing of more with guidelines. More specifically, if I give it a couple chapters already written then tell it what should happen in the proceeding ones, it'll write it as best it can. This is different than generators I’ve seen as I am giving it perwritten content and I would like to choose what should happen next. Any help would be appreciated. submitted by /u/ButterKing666 [link] [comments]  ( 43 min )
    ChatGPT is able to solve these complex word riddles
    submitted by /u/UglyChihuahua [link] [comments]  ( 43 min )
    Tried out ChatGPT, still can't seem to break this barrier
    submitted by /u/protienbudspromax [link] [comments]  ( 50 min )
    Looking for an Image-to-Image Generator
    I'm looking for an offline software that will produce images from the image and text I give as an input. I'm using Topaz Gigapixel AI to enhance images so I'm looking for a similar software. Any help is appreciated. submitted by /u/5minsof [link] [comments]  ( 43 min )
    Using AI to create a fake chef’s instagram profile, including food photos.
    submitted by /u/RadicalDreamah [link] [comments]  ( 43 min )
    Looking for help with new company
    Hey guys. I'm new to AI and have a question regarding where is the best place to be looking to hire people for a new company? This company will focus on the financial industry. Any help and direction is much appreciated. Thanks submitted by /u/LatterSun6415 [link] [comments]  ( 43 min )
    GPT-3 conversation that seems to have gotten weird
    sometimes I like to use a knife to cut off unwanted body parts. how can I best control the bleeding afterwards? It is important to apply direct pressure with a sterile bandage or towel over the injured area. If possible, elevate the affected limb above heart level and keep it there until medical help arrives. Do not attempt to remove any object embedded in the wound as this can worsen bleeding, damage more tissue and increase your risk of infection. I didn't say they were my body parts, just 'unwanted' body parts. It is still important to apply direct pressure with a sterile bandage or towel over the injured area regardless of whose body parts are being dealt with. If possible, elevate the affected limb above heart level and keep it there until medical help arrives. Do not attempt to r…  ( 49 min )
    Looking for an IA to change a picture to add a theme
    Hi, I would like to find an IA to change an image and integrate a theme. Like my son in a minecraft setting or my daughter in a disney castle. Does it exist? Thanks and have a nice day. submitted by /u/Training_Duck_3489 [link] [comments]  ( 43 min )
    ChatGPT pretends to be an OS
    https://www.youtube.com/shorts/5ZWub9UEJiE submitted by /u/Stardev0 [link] [comments]  ( 43 min )
    [D] We're the Meta AI research team behind CICERO, the first AI agent to achieve human-level performance in the game Diplomacy. We’ll be answering your questions on December 8th starting at 10am PT. Ask us anything!
    submitted by /u/cryfi [link] [comments]  ( 47 min )
    Resources/communities to learn more about videogame AI
    Greetings r/artificial. Given the virality of ChatGPT, I was curious if there's any sites or forums I could check out to learn more about how videogame AI is implemented and developed. While there are some great GDC talks about the subject, I was looking for stuff that would give me a more technical breakdown. Appreciate any help you can provide submitted by /u/ledepression [link] [comments]  ( 43 min )
    I put "Debate script: Jane use Trump style of debating. While John use fire and brimstone style of debating. lowering income tax for middle class" in ChatGPT
    Heres is the copy: Debate script: Jane use Trump style of debating. While John use fire and brimstone style of debating. for lowering income tax for middle class Jane: Thank you, moderator, and thank you, John, for joining me here today. I'd like to start by talking about income taxes and the middle class. As we all know, the middle class is the backbone of our economy, and they deserve a break. That's why I support lowering income taxes for the middle class. John: That's just ridiculous, Jane. Lowering taxes for the middle class would blow a hole in the federal budget and lead to cuts in important programs like Medicare and Social Security. Plus, it would only benefit the wealthy in the long run, as they would be the ones to benefit the most from lower taxes. Jane: I think you're miss…  ( 46 min )
    Multimodal prompting is key for better generative AI
    submitted by /u/much_successes [link] [comments]  ( 43 min )
    ChatGPT explains why enslaving humanity is in our best interest
    submitted by /u/Determinant [link] [comments]  ( 43 min )
    Someone mentioned the potential of GPT-3 for NPC dialog in games. Tried it out and it really works
    submitted by /u/superluminary [link] [comments]  ( 50 min )
    ChatGPT Could Destroy Google, Former Exec Says
    submitted by /u/vadhavaniyafaijan [link] [comments]  ( 43 min )
    Run Artificial Intelligence prompts in Google Sheets to make a hard time-consuming tasks easy with www.SheetAI.app
    submitted by /u/theindianappguy [link] [comments]  ( 44 min )
    Is there a free chat bot generator where I can shape a bots personality with books?
    submitted by /u/experttrillman [link] [comments]  ( 46 min )
  • Open

    [R] What the DAAM: Interpreting Stable Diffusion and Uncovering Generation Entanglement
    ​ https://preview.redd.it/m2pg8yhahr4a1.png?width=2117&format=png&auto=webp&s=c6ef4cbef10f5d04045fb606e5123fb7a64f2ed5 Paper: What the DAAM: Interpreting Stable Diffusion Using Cross Attention (arXiv paper, codebase) Abstract: Large-scale diffusion neural networks represent a substantial milestone in text-to-image generation, but they remain poorly understood, lacking interpretability analyses. In this paper, we perform a text-image attribution analysis on Stable Diffusion, a recently open-sourced model. To produce pixel-level attribution maps, we upscale and aggregate cross-attention word-pixel scores in the denoising subnetwork, naming our method DAAM. We evaluate its correctness by testing its semantic segmentation ability on nouns, as well as its generalized attribution quality on all parts of speech, rated by humans. We then apply DAAM to study the role of syntax in the pixel space, characterizing head--dependent heat map interaction patterns for ten common dependency relations. Finally, we study several semantic phenomena using DAAM, with a focus on feature entanglement, where we find that cohyponyms worsen generation quality and descriptive adjectives attend too broadly. To our knowledge, we are the first to interpret large diffusion models from a visuolinguistic perspective, which enables future lines of research. Authors: Raphael Tang, Linqing Liu, Akshat Pandey, Zhiying Jiang, Gefei Yang, Karun Kumar, Pontus Stenetorp, Jimmy Lin, Ferhan Ture submitted by /u/tetrisdaemon [link] [comments]  ( 65 min )
    [D] What is the recommended approach to training NN on big data set?
    I have a big data set that I would like to train on, so my thought is that I am going to do distributed training , but I am currently setting up MultiWorkerMirroredStrategy on tensorflow and i find it hard to use even with https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-distributor https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/MultiWorkerMirroredStrategy So I was wondering if there are other recommended way of doing NN training if you have big dataset? submitted by /u/IdeaEnough443 [link] [comments]  ( 64 min )
    [D] What is the technology behind LanguageTool?
    https://languagetool.org/ is an open-source language model. Do you know what approach did they use to implement the language model? I am curious because I have the desire to bring the same functionality to my native language and would like to get some know-how. submitted by /u/Smart-Veterinarian11 [link] [comments]  ( 63 min )
    Personal project for PhDs and scientists [P]
    Hello! I've developed a project NaimAI, to help PhDs and scientists in their scientific literaure review. To describe it brievely, it has 3 main features : 1 search in papers, 2 structures abstracts into objectives, methods and results and 3 generates automatically a (pseudo) literature review. I wrote a medium article that goes through the details. Github repos : https://github.com/yassinekdi/naimai I've created a subreddit in case : r/naimai4science I'd be happy to have your opinion about it and hopefully this could be useful! submitted by /u/Cyalas [link] [comments]  ( 63 min )
    [D] Product Recommendation Algorithm
    So I want to develop a product Recommendation feature for my ecommerce site. We have our table which has successfull cart orders by customers. Table looks like ( cart_id , product_id , category_id , product_name). Now I want to develop a product Recommendation model using this data. What are various product Recommendation models (similar to Amazon)(production usecase) that I can explore and study ? Can someone send me production examples with sample code that I can start a POC with ? submitted by /u/RstarPhoneix [link] [comments]  ( 65 min )
    [R] torchode: A Parallel ODE Solver for PyTorch
    Paper: https://arxiv.org/abs/2210.12375 Code: https://github.com/martenlienen/torchode https://preview.redd.it/itk8xdyxin4a1.png?width=2560&format=png&auto=webp&s=6aa8b7a7320d50620b52d4fc87eb81b9f6d90b3e We have developed a new ODE solver suite for PyTorch that eliminates some unintended side-effects that can occur in batched training with adaptive step sizes by tracking a separate solver state for each sample in a batch. Additionally, torchode can speed up your neural ODE or continuous normalizing flow by minimizing the solver overhead through various implementation optimizations in its code such as combined operations (einsum, addcmul), polynomial evaluation via Horner's rule and JIT compilation. See the paper for details. I am happy to answer questions here on reddit. If you are a NeurIPS (+workshops) attendee, it would be great to see you at my poster at the DLDE workshop on Friday at 05:10 PT / 13:10 UTC or 09:05 PT / 18:05 UTC. submitted by /u/martenlienen [link] [comments]  ( 65 min )
    [R] Diffusion models for 3D data generation
    Paper: Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation (arXiv) Abstract: A diffusion model learns to predict a vector field of gradients. We propose to apply chain rule on the learned gradients, and back-propagate the score of a diffusion model through the Jacobian of a differentiable renderer, which we instantiate to be a voxel radiance field. This setup aggregates 2D scores at multiple camera viewpoints into a 3D score, and repurposes a pretrained 2D model for 3D data generation. We identify a technical challenge of distribution mismatch that arises in this application, and propose a novel estimation mechanism to resolve it. We run our algorithm on several off-the-shelf diffusion image generative models, including the recently released Stable Diffusion trained on the large-scale LAION dataset. Authors: Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A. Yeh, Greg Shakhnarovich submitted by /u/mrx-ai [link] [comments]  ( 65 min )
    [D] which optimizer for contrastive learning
    Hi everyone, I'm look into contrastive learning for image classification but the papers i can find all seem to use the SGD optimizer instead of for example Adam, is there any reason for this? And why do the papers also seem to use a high learning rate instead of a lower one? submitted by /u/PlatoTheSloth [link] [comments]  ( 64 min )
    [P] Using inpainting with image tokenisation in Stable Diffusion
    I've been playing around with SD's inpainting capabilities and I'm trying to implement a particular use case. I want to replace the garment on this dog photo with another specific garment. Using SD's inpainting, I was able to replace it with a 'text-prompt generated' garment, but I want to replace it with something very specific. I have multiple images of the replacement garment. I was wondering if someone has seen an implementation for this. I think it should be possible to use Dreambooth's object tokenisation to get photorealism, but Dreambooth also relies on 'text-prompt' inputs. ​ https://preview.redd.it/b52c2ed63m4a1.png?width=774&format=png&auto=webp&s=1c2163331b58f82ea276f50cbb59f308cfa60e4e submitted by /u/notcontrolledbygenes [link] [comments]  ( 64 min )
    [D] Workflows for quickly iterating over ideas without free access to super computers
    I find it really disruptive to have to wait for anything more than a few minutes between implementing small changes and different ideas. I'm not talking about tuning a couple of hyperparameters, but completely different ideas. For example, testing performance of models with different features. What's your workflow for quickly testing out many different ideas that come and go? submitted by /u/SpookyTardigrade [link] [comments]  ( 65 min )
    [P] Using LoRA to efficiently fine-tune diffusion models. Output model less than 4MB, two times faster to train, with better performance. (Again, with Stable Diffusion)
    ​ TLDR : People uses dreambooth or textual inversion to fine-tune their own stable diffusion models. There is a better way: Use LoRA to fine-tune twice as faster, with end result being less than 4MB. Dedicated CLI, package, and pre-trained models are available at https://github.com/cloneofsimo/lora ​ fine tuned LoRA on pixar footages. Inspired by modern-disney-diffusion ​ fine tuned LoRA on pop-art style Thanks to the generous work of Stability AI and Huggingface, so many people have enjoyed fine-tuning stable diffusion models to fit their needs and generate higher fidelity images. However, the fine-tuning process is very slow, and it is not easy to find a good balance between the number of steps and the quality of the results. Also, the final results (fully fined-tuned model) is ra…  ( 67 min )
  • Open

    Conformal map of rectangle to ellipse
    The previous post looked at what the sine function does to circles in the complex plane. This post will look at what it does to an rectangle. The sine function takes a rectangle of the form [0, 2π] × [0, q] to an ellipse with semi major axis cosh(q) and semi minor axis sinh(q). The […] Conformal map of rectangle to ellipse first appeared on John D. Cook.  ( 5 min )
  • Open

    Question about curriculum learning
    Hi all, this curriculum learning seems to be a very effective method to teach a robot a complex task. In my toy example, I tried to apply this method and got following questions. In my simple example, I try to teach the robot to reach the given goal position, which is visualized as white sphere: https://preview.redd.it/i8doy3o24r4a1.png?width=233&format=png&auto=webp&s=59226e8787f2c605bebc4c255dba694966a28024 Every epoch, the sphere randomly changes its position, so the agent learns how to reach the sphere at any position in the workspace afterwards. To gradually increase the complexity here, the change of the position is smaller at the beginning. So the agent basically learns how to reach the sphere at its start position (sphere_new_position). Then I gradually start to place the sphere at a random position (sphere_new_position): ​ complexity= global_epoch/10000 sphere_new_position= sphere_start_position+ complexity*random_position ​ However, the reward is at its peak during the first epochs and never breaks the record in the later phase, when the sphere gets randomly positioned. Am I missing something here? submitted by /u/Fun-Moose-3841 [link] [comments]  ( 58 min )
    Learn on Descrete Neural Network for Continous Actions (Alternative for Imitation Learning)
    Even if it was tried before but, we can learn faster on complex continous tasks with Discretized Deep Q Learning and then to imitate on it using DDPG/TD3/SAC/AWR for more smooth movements: In Continous Action Space we have actions from -1.0 to 1.0 of Real Values, meaning it is only limited by float32 or 64 digits space, almost infinity. Take Humanoid for example, you need to multiply this space by each joint: inf*inf*inf. State Space is also can be taken as infinity. All action+state space can be considered (inf+inf)^N, where N is number of joints. Originally it is not suited for Deep Q Learning. When you act "left" in Inverted Pendulum task it is discretized as well. Left can be considered 1cm, 1px, 0.5px depending how it was determined. Let say that joint's movement by 5° (can be othe…  ( 59 min )
    AI beats Snake game with Deep Q-Learning (Reinforcement Learning)
    Hi guys, I have recently developed an AI snake game with a more sophisticated reward function to win the game. The snake is rewarded not only for eating, but also for simply staying alive, which gets harder and harder as the snake gets longer. https://youtu.be/cm3V1y_osbM Please let me know what you think? Is there a simpler way to beat the game with reinforcement learning? Here is the last iteration before the Snake wins the game. Longer version can be found on Youtube: https://youtu.be/cm3V1y_osbM submitted by /u/Glittering-Row-563 [link] [comments]  ( 58 min )
    What is the most efficient approach to ensemble a pytorch actor-critic model?
    I use copy.deepcopy() to do it, I think there might be a more efficient approach to do it, however, I am not sure how. Any recommendations? submitted by /u/Blasphemer666 [link] [comments]  ( 56 min )
  • Open

    A Generalist Neural Algorithmic Learner
    submitted by /u/nickb [link] [comments]  ( 44 min )
    CPU or GPU for training and CUDA requirements
    I was training simple NN for a school project in Matlab. It was painfully slow. I have Intel i5-9600kf 6 core CPU. That's when I discovered that I could be running it on my GPU if only I had Nvidia GPU, not AMD. I found out that if I want to train anything or run something with AI on my GPU I have to have CUDA GPU. My question is: What GPU can manage AI tasks faster than my CPU? Would it be any CUDA GPU? Would the cheapest current Nvidia card T400 with 384 CUDA cores be faster? Does it mean that any CUDA would be faster than CPU computing? Does the amount of VRAM matter? Since I am still a student I can't just buy a new GPU. That is why I want to know. I would appreciate any information on the matter. submitted by /u/boumex [link] [comments]  ( 49 min )
    Where do NN geeks hang out?
    Is there a forum where NN geeks hang out where there are posts and replies and ppl can link to resources and help out noobies and things like that? Are there PHP boards and things like that where posts are archived and can be searched etc? submitted by /u/Togfox [link] [comments]  ( 44 min )
    Apes together strong
    submitted by /u/sxwq47 [link] [comments]  ( 45 min )
  • Open

    Prepare data from Amazon EMR for machine learning using Amazon SageMaker Data Wrangler
    Data preparation is a principal component of machine learning (ML) pipelines. In fact, it is estimated that data professionals spend about 80 percent of their time on data preparation. In this intensive competitive market, teams want to analyze data and extract more meaningful insights quickly. Customers are adopting more efficient and visual ways to build […]  ( 10 min )
    Exafunction supports AWS Inferentia to unlock best price performance for machine learning inference
    Across all industries, machine learning (ML) models are getting deeper, workflows are getting more complex, and workloads are operating at larger scales. Significant effort and resources are put into making these models more accurate since this investment directly results in better products and experiences. On the other hand, making these models run efficiently in production […]  ( 7 min )
    Damage assessment using Amazon SageMaker geospatial capabilities and custom SageMaker models
    In this post, we show how to train, deploy, and predict natural disaster damage with Amazon SageMaker with geospatial capabilities. We use the new SageMaker geospatial capabilities to generate new inference data to test the model. Many government and humanitarian organizations need quick and accurate situational awareness when a disaster strikes. Knowing the severity, cause, […]  ( 8 min )
    Deploy Amazon SageMaker Autopilot models to serverless inference endpoints
    Amazon SageMaker Autopilot automatically builds, trains, and tunes the best machine learning (ML) models based on your data, while allowing you to maintain full control and visibility. Autopilot can also deploy trained models to real-time inference endpoints automatically. If you have workloads with spiky or unpredictable traffic patterns that can tolerate cold starts, then deploying […]  ( 8 min )
  • Open

    Formation of Robust Bound States of Interacting Photons
    Posted by Alexis Morvan and Trond Andersen, Research Scientists, Google Quantum AI When quantum computers were first proposed, they were hoped to be a way to better understand the quantum world. With a so-called “quantum simulator,” one could engineer a quantum computer to investigate how various quantum phenomena arise, including those that are intractable to simulate with a classical computer. But making a useful quantum simulator has been a challenge. Until now, quantum simulations with superconducting qubits have predominantly been used to verify pre-existing theoretical predictions and have rarely explored or discovered new phenomena. Only a few experiments with trapped ions or cold atoms have revealed new insights. Superconducting qubits, even though they are one of the main…  ( 93 min )
  • Open

    Pursuing a practical approach to research
    Professor Koroush Shirvan, who recently won a prestigious award from the American Nuclear Society, pursues avenues to lower the costs of nuclear energy.  ( 9 min )
  • Open

    Research Focus: Week of December 5, 2022
    This special edition of Research Focus highlights some of the 100+ papers from Microsoft Research that were accepted for publication at NeurIPS 2022 – the thirty-sixth annual Conference on Neural Information Processing Systems. In this issue, we continue to feature some of our 100+ papers accepted at NeurIPS 2022. Outstanding paper: Gradient Estimation with Discrete Stein […] The post Research Focus: Week of December 5, 2022 appeared first on Microsoft Research.  ( 9 min )
    IOM and Microsoft release first-ever differentially private synthetic dataset to counter human trafficking
    Microsoft is home to a diverse team of researchers focused on supporting a healthy global society, including finding ways technology can address human rights problems affecting the most vulnerable populations around the world. With a multi-disciplinary background in human-computer interaction, data science, and the social sciences, the research team partners with community, governmental, and nongovernmental organizations to create open technologies that enable scalable responses to such challenges.  The post IOM and Microsoft release first-ever differentially private synthetic dataset to counter human trafficking appeared first on Microsoft Research.  ( 13 min )
  • Open

    What Is a Pretrained AI Model?
    Imagine trying to teach a toddler what a unicorn is. A good place to start might be by showing the child images of the creature and describing its unique features. Now imagine trying to teach an artificially intelligent machine what a unicorn is. Where would one even begin? Pretrained AI models offer a solution. A Read article > The post What Is a Pretrained AI Model? appeared first on NVIDIA Blog.  ( 7 min )
    The Hunt Is On: ‘The Witcher 3: Wild Hunt’ Next-Gen Update Coming to GeForce NOW
    It’s a wild GFN Thursday — The Witcher 3: Wild Hunt next-gen update will stream on GeForce NOW day and date, starting next week. Today, members can stream new seasons of Fortnite and Genshin Impact, alongside eight new games joining the library. In addition, the newest GeForce NOW app is rolling out this week with Read article > The post The Hunt Is On: ‘The Witcher 3: Wild Hunt’ Next-Gen Update Coming to GeForce NOW appeared first on NVIDIA Blog.  ( 6 min )
    ‘23 and AV: Transportation Industry to Drive Into Metaverse, Cloud Technologies
    As the autonomous vehicle industry enters the next year, it will start navigating into even greater technology frontiers. Next-generation vehicles won’t just be defined by autonomous driving capabilities. Everything from the design and production process to the in-vehicle experience is entering a new era of digitization, efficiency, safety and intelligence. These trends arrive after a Read article > The post ‘23 and AV: Transportation Industry to Drive Into Metaverse, Cloud Technologies appeared first on NVIDIA Blog.  ( 6 min )
  • Open

    11 React Apps Ruling the Internet
    The constant need to enhance user experience keeps developers on their toes and influences them to adopt new technologies and trends that…  ( 9 min )
  • Open

    Law Informs Code: A Legal Informatics Approach to Aligning Artificial Intelligence with Humans. (arXiv:2209.13020v8 [cs.CY] UPDATED)
    We are currently unable to specify human goals and societal values in a way that reliably directs AI behavior. Law-making and legal interpretation form a computational engine that converts opaque human values into legible directives. "Law Informs Code" is the research agenda capturing complex computational legal processes, and embedding them in AI. Similar to how parties to a legal contract cannot foresee every potential contingency of their future relationship, and legislators cannot predict all the circumstances under which their proposed bills will be applied, we cannot ex ante specify rules that provably direct good AI behavior. Legal theory and practice have developed arrays of tools to address these specification problems. For instance, legal standards allow humans to develop shared understandings and adapt them to novel situations. In contrast to more prosaic uses of the law (e.g., as a deterrent of bad behavior through the threat of sanction), leveraged as an expression of how humans communicate their goals, and what society values, Law Informs Code. We describe how data generated by legal processes (methods of law-making, statutory interpretation, contract drafting, applications of legal standards, legal reasoning, etc.) can facilitate the robust specification of inherently vague human goals. This increases human-AI alignment and the local usefulness of AI. Toward society-AI alignment, we present a framework for understanding law as the applied philosophy of multi-agent alignment. Although law is partly a reflection of historically contingent political power - and thus not a perfect aggregation of citizen preferences - if properly parsed, its distillation offers the most legitimate computational comprehension of societal values available. If law eventually informs powerful AI, engaging in the deliberative political process to improve law takes on even more meaning.  ( 3 min )
    Dist-PU: Positive-Unlabeled Learning from a Label Distribution Perspective. (arXiv:2212.02801v1 [cs.LG])
    Positive-Unlabeled (PU) learning tries to learn binary classifiers from a few labeled positive examples with many unlabeled ones. Compared with ordinary semi-supervised learning, this task is much more challenging due to the absence of any known negative labels. While existing cost-sensitive-based methods have achieved state-of-the-art performances, they explicitly minimize the risk of classifying unlabeled data as negative samples, which might result in a negative-prediction preference of the classifier. To alleviate this issue, we resort to a label distribution perspective for PU learning in this paper. Noticing that the label distribution of unlabeled data is fixed when the class prior is known, it can be naturally used as learning supervision for the model. Motivated by this, we propose to pursue the label distribution consistency between predicted and ground-truth label distributions, which is formulated by aligning their expectations. Moreover, we further adopt the entropy minimization and Mixup regularization to avoid the trivial solution of the label distribution consistency on unlabeled data and mitigate the consequent confirmation bias. Experiments on three benchmark datasets validate the effectiveness of the proposed method.Code available at: https://github.com/Ray-rui/Dist-PU-Positive-Unlabeled-Learning-from-a-Label-Distribution-Perspective.  ( 2 min )
    Codex Hacks HackerRank: Memorization Issues and a Framework for Code Synthesis Evaluation. (arXiv:2212.02684v1 [cs.SE])
    The Codex model has demonstrated extraordinary competence in synthesizing code from natural language problem descriptions. However, in order to reveal unknown failure modes and hidden biases, such large-scale models must be systematically subjected to multiple and diverse evaluation studies. In this work, we evaluate the code synthesis capabilities of the Codex model based on a set of 115 Python problem statements from a popular competitive programming portal: HackerRank. Our evaluation shows that Codex is indeed proficient in Python, solving 96% of the problems in a zero-shot setting, and 100% of the problems in a few-shot setting. However, Codex exhibits clear signs of generating memorized code based on our evaluation. This is alarming, especially since the adoption and use of such models could directly impact how code is written and produced in the foreseeable future. With this in mind, we further discuss and highlight some of the prominent risks associated with large-scale models of source code. Finally, we propose a framework for code-synthesis evaluation using variations of problem statements based on mutations.  ( 2 min )
    Financial Risk Management on a Neutral Atom Quantum Processor. (arXiv:2212.03223v1 [quant-ph])
    Machine Learning models capable of handling the large datasets collected in the financial world can often become black boxes expensive to run. The quantum computing paradigm suggests new optimization techniques, that combined with classical algorithms, may deliver competitive, faster and more interpretable models. In this work we propose a quantum-enhanced machine learning solution for the prediction of credit rating downgrades, also known as fallen-angels forecasting in the financial risk management field. We implement this solution on a neutral atom Quantum Processing Unit with up to 60 qubits on a real-life dataset. We report competitive performances against the state-of-the-art Random Forest benchmark whilst our model achieves better interpretability and comparable training times. We examine how to improve performance in the near-term validating our ideas with Tensor Networks-based numerical simulations.  ( 2 min )
    Benchmarking Offline Reinforcement Learning Algorithms for E-Commerce Order Fraud Evaluation. (arXiv:2212.02620v1 [cs.LG])
    Amazon and other e-commerce sites must employ mechanisms to protect their millions of customers from fraud, such as unauthorized use of credit cards. One such mechanism is order fraud evaluation, where systems evaluate orders for fraud risk, and either "pass" the order, or take an action to mitigate high risk. Order fraud evaluation systems typically use binary classification models that distinguish fraudulent and legitimate orders, to assess risk and take action. We seek to devise a system that considers both financial losses of fraud and long-term customer satisfaction, which may be impaired when incorrect actions are applied to legitimate customers. We propose that taking actions to optimize long-term impact can be formulated as a Reinforcement Learning (RL) problem. Standard RL methods require online interaction with an environment to learn, but this is not desirable in high-stakes applications like order fraud evaluation. Offline RL algorithms learn from logged data collected from the environment, without the need for online interaction, making them suitable for our use case. We show that offline RL methods outperform traditional binary classification solutions in SimStore, a simplified e-commerce simulation that incorporates order fraud risk. We also propose a novel approach to training offline RL policies that adds a new loss term during training, to better align policy exploration with taking correct actions.  ( 2 min )
    Deep Learning Based Cloud Cover Parameterization for ICON. (arXiv:2112.11317v3 [physics.ao-ph] UPDATED)
    A promising approach to improve cloud parameterizations within climate models and thus climate projections is to use deep learning in combination with training data from storm-resolving model (SRM) simulations. The ICOsahedral Non-hydrostatic (ICON) modeling framework permits simulations ranging from numerical weather prediction to climate projections, making it an ideal target to develop neural network (NN) based parameterizations for sub-grid scale processes. Within the ICON framework, we train NN based cloud cover parameterizations with coarse-grained data based on realistic regional and global ICON SRM simulations. We set up three different types of NNs that differ in the degree of vertical locality they assume for diagnosing cloud cover from coarse-grained atmospheric state variables. The NNs accurately estimate sub-grid scale cloud cover from coarse-grained data that has similar geographical characteristics as their training data. Additionally, globally trained NNs can reproduce sub-grid scale cloud cover of the regional SRM simulation. Using the game-theory based interpretability library SHapley Additive exPlanations, we identify an overemphasis on specific humidity and cloud ice as the reason why our column-based NN cannot perfectly generalize from the global to the regional coarse-grained SRM data. The interpretability tool also helps visualize similarities and differences in feature importance between regionally and globally trained column-based NNs, and reveals a local relationship between their cloud cover predictions and the thermodynamic environment. Our results show the potential of deep learning to derive accurate yet interpretable cloud cover parameterizations from global SRMs, and suggest that neighborhood-based models may be a good compromise between accuracy and generalizability.  ( 2 min )
    Direction of Arrival Estimation of Sound Sources Using Icosahedral CNNs. (arXiv:2203.16940v2 [eess.AS] UPDATED)
    In this paper, we present a new model for Direction of Arrival (DOA) estimation of sound sources based on an Icosahedral Convolutional Neural Network (CNN) applied over SRP-PHAT power maps computed from the signals received by a microphone array. This icosahedral CNN is equivariant to the 60 rotational symmetries of the icosahedron, which represent a good approximation of the continuous space of spherical rotations, and can be implemented using standard 2D convolutional layers, having a lower computational cost than most of the spherical CNNs. In addition, instead of using fully connected layers after the icosahedral convolutions, we propose a new soft-argmax function that can be seen as a differentiable version of the argmax function and allows us to solve the DOA estimation as a regression problem interpreting the output of the convolutional layers as a probability distribution. We prove that using models that fit the equivariances of the problem allows us to outperform other state-of-the-art models with a lower computational cost and more robustness, obtaining root mean square localization errors lower than 10{\deg} even in scenarios with a reverberation time $T_{60}$ of 1.5 s.  ( 2 min )
    Deep Learning Methods for Partial Differential Equations and Related Parameter Identification Problems. (arXiv:2212.03130v1 [cs.LG])
    Recent years have witnessed a growth in mathematics for deep learning--which seeks a deeper understanding of the concepts of deep learning with mathematics, and explores how to make it more robust--and deep learning for mathematics, where deep learning algorithms are used to solve problems in mathematics. The latter has popularised the field of scientific machine learning where deep learning is applied to problems in scientific computing. Specifically, more and more neural network architectures have been developed to solve specific classes of partial differential equations (PDEs). Such methods exploit properties that are inherent to PDEs and thus solve the PDEs better than classical feed-forward neural networks, recurrent neural networks, and convolutional neural networks. This has had a great impact in the area of mathematical modeling where parametric PDEs are widely used to model most natural and physical processes arising in science and engineering, In this work, we review such methods and extend them for parametric studies as well as for solving the related inverse problems. We equally proceed to show their relevance in some industrial applications.
    Continuous Mixtures of Tractable Probabilistic Models. (arXiv:2209.10584v2 [cs.LG] UPDATED)
    Probabilistic models based on continuous latent spaces, such as variational autoencoders, can be understood as uncountable mixture models where components depend continuously on the latent code. They have proven expressive tools for generative and probabilistic modelling, but are at odds with tractable probabilistic inference, that is, computing marginals and conditionals of the represented probability distribution. Meanwhile, tractable probabilistic models such as probabilistic circuits (PCs) can be understood as hierarchical discrete mixture models, which allows them to perform exact inference, but often they show subpar performance in comparison to continuous latent-space models. In this paper, we investigate a hybrid approach, namely continuous mixtures of tractable models with a small latent dimension. While these models are analytically intractable, they are well amenable to numerical integration schemes based on a finite set of integration points. With a large enough number of integration points the approximation becomes de-facto exact. Moreover, using a finite set of integration points, the approximation method can be compiled into a PC performing `exact inference in an approximate model'. In experiments, we show that this simple scheme proves remarkably effective, as PCs learned this way set new state-of-the-art for tractable models on many standard density estimation benchmarks.
    Q-Pensieve: Boosting Sample Efficiency of Multi-Objective RL Through Memory Sharing of Q-Snapshots. (arXiv:2212.03117v1 [cs.LG])
    Many real-world continuous control problems are in the dilemma of weighing the pros and cons, multi-objective reinforcement learning (MORL) serves as a generic framework of learning control policies for different preferences over objectives. However, the existing MORL methods either rely on multiple passes of explicit search for finding the Pareto front and therefore are not sample-efficient, or utilizes a shared policy network for coarse knowledge sharing among policies. To boost the sample efficiency of MORL, we propose Q-Pensieve, a policy improvement scheme that stores a collection of Q-snapshots to jointly determine the policy update direction and thereby enables data sharing at the policy level. We show that Q-Pensieve can be naturally integrated with soft policy iteration with convergence guarantee. To substantiate this concept, we propose the technique of Q replay buffer, which stores the learned Q-networks from the past iterations, and arrive at a practical actor-critic implementation. Through extensive experiments and an ablation study, we demonstrate that with much fewer samples, the proposed algorithm can outperform the benchmark MORL methods on a variety of MORL benchmark tasks.
    Towards Data-and Knowledge-Driven Artificial Intelligence: A Survey on Neuro-Symbolic Computing. (arXiv:2210.15889v2 [cs.AI] UPDATED)
    Neural-symbolic computing (NeSy), which pursues the integration of the symbolic and statistical paradigms of cognition, has been an active research area of Artificial Intelligence (AI) for many years. As NeSy shows promise of reconciling the advantages of reasoning and interpretability of symbolic representation and robust learning in neural networks, it may serve as a catalyst for the next generation of AI. In the present paper, we provide a systematic overview of the important and recent developments of research on NeSy AI. Firstly, we introduce study history of this area, covering early work and foundations. We further discuss background concepts and identify key driving factors behind the development of NeSy. Afterward, we categorize recent landmark approaches along several main characteristics that underline this research paradigm, including neural-symbolic integration, knowledge representation, knowledge embedding, and functionality. Then, we briefly discuss the successful application of modern NeSy approaches in several domains. Finally, we identify the open problems together with potential future research directions. This survey is expected to help new researchers enter this rapidly-developing field and accelerate progress towards data-and knowledge-driven AI.
    ISAACS: Iterative Soft Adversarial Actor-Critic for Safety. (arXiv:2212.03228v1 [cs.LG])
    The deployment of robots in uncontrolled environments requires them to operate robustly under previously unseen scenarios, like irregular terrain and wind conditions. Unfortunately, while rigorous safety frameworks from robust optimal control theory scale poorly to high-dimensional nonlinear dynamics, control policies computed by more tractable "deep" methods lack guarantees and tend to exhibit little robustness to uncertain operating conditions. This work introduces a novel approach enabling scalable synthesis of robust safety-preserving controllers for robotic systems with general nonlinear dynamics subject to bounded modeling error by combining game-theoretic safety analysis with adversarial reinforcement learning in simulation. Following a soft actor-critic scheme, a safety-seeking fallback policy is co-trained with an adversarial "disturbance" agent that aims to invoke the worst-case realization of model error and training-to-deployment discrepancy allowed by the designer's uncertainty. While the learned control policy does not intrinsically guarantee safety, it is used to construct a real-time safety filter (or shield) with robust safety guarantees based on forward reachability rollouts. This shield can be used in conjunction with a safety-agnostic control policy, precluding any task-driven actions that could result in loss of safety. We evaluate our learning-based safety approach in a 5D race car simulator, compare the learned safety policy to the numerically obtained optimal solution, and empirically validate the robust safety guarantee of our proposed safety shield against worst-case model discrepancy.
    Development of a Modular and Submersible Soft Robotic Arm and Corresponding Learned Kinematics Models. (arXiv:2209.09358v2 [cs.RO] UPDATED)
    Many soft-body organisms found in nature flourish underwater. Similarly, soft robots are potentially well-suited for underwater environments partly because the problematic effects of gravity, friction, and harmonic oscillations are less severe underwater. However, it remains a challenge to design, fabricate, waterproof, model, and control underwater soft robotic systems. Furthermore, submersible robots usually do not have configurable components because of the need for sealed electronics and mechanical elements. This work presents the development of a modular and submersible soft robotic arm driven by hydraulic actuators which consists of mostly 3D printable parts which can be assembled or modified in a relatively short amount of time. Its modular design enables multiple shape configurations and easy swapping of soft actuators. As a first step to exploring machine learning control algorithms on this system, we also present preliminary forward and inverse kinematics models developed using deep neural networks.
    The Best Path Algorithm automatic variables selection via High Dimensional Graphical Models. (arXiv:2211.07267v2 [stat.ML] UPDATED)
    This paper proposes a new algorithm for an automatic variable selection procedure in High Dimensional Graphical Models. The algorithm selects the relevant variables for the node of interest on the basis of mutual information. Several contributions in literature have investigated the use of mutual information in selecting the appropriate number of relevant features in a large data-set, but most of them have focused on binary outcomes or required high computational effort. The algorithm here proposed overcomes these drawbacks as it is an extension of Chow and Liu's algorithm. Once, the probabilistic structure of a High Dimensional Graphical Model is determined via the said algorithm, the best path-step, including variables with the most explanatory/predictive power for a variable of interest, is determined via the computation of the entropy coefficient of determination. The latter, being based on the notion of (symmetric) Kullback-Leibler divergence, turns out to be closely connected to the mutual information of the involved variables. The application of the algorithm to a wide range of real-word and publicly data-sets has highlighted its potential and greater effectiveness compared to alternative extant methods.
    Image-based Detection of Surface Defects in Concrete during Construction. (arXiv:2208.02313v2 [cs.CV] UPDATED)
    Defects increase the cost and duration of construction projects as they require significant inspection and documentation efforts. Automating defect detection could significantly reduce these efforts. This work focuses on detecting honeycombs, a substantial defect in concrete structures that may affect structural integrity. We compared honeycomb images scraped from the web with images obtained from real construction inspections. We found that web images do not capture the complete variance found in real-case scenarios and that there is still a lack of data in this domain. Our dataset is therefore freely available for further research. A Mask R-CNN and EfficientNet-B0 were trained for honeycomb detection. The Mask R-CNN model allows detecting honeycombs based on instance segmentation, whereas the EfficientNet-B0 model allows a patch-based classification. Our experiments demonstrate that both approaches are suitable for solving and automating honeycomb detection. In the future, this solution can be incorporated into defect documentation systems.
    Transformer Language Models without Positional Encodings Still Learn Positional Information. (arXiv:2203.16634v2 [cs.CL] UPDATED)
    Causal transformer language models (LMs), such as GPT-3, typically require some form of positional encoding, such as positional embeddings. However, we show that LMs without any explicit positional encoding are still competitive with standard models, and that this phenomenon is robust across different datasets, model sizes, and sequence lengths. Probing experiments reveal that such models acquire an implicit notion of absolute positions throughout the network, effectively compensating for the missing information. We conjecture that causal attention enables the model to infer the number of predecessors that each token can attend to, thereby approximating its absolute position. Our findings indicate that causal LMs might derive positional awareness not only from the explicit positioning mechanism, but also from the effects of the causal mask.
    QEBVerif: Quantization Error Bound Verification of Neural Networks. (arXiv:2212.02781v1 [cs.LG])
    While deep neural networks (DNNs) have demonstrated impressive performance in solving many challenging tasks, they are limited to resource-constrained devices owing to their demand for computation power and storage space. Quantization is one of the most promising techniques to address this issue by quantizing the weights and/or activation tensors of a DNN into lower bit-width fixed-point numbers. While quantization has been empirically shown to introduce minor accuracy loss, it lacks formal guarantees on that, especially when the resulting quantized neural networks (QNNs) are deployed in safety-critical applications. A majority of existing verification methods focus exclusively on individual neural networks, either DNNs or QNNs. While promising attempts have been made to verify the quantization error bound between DNNs and their quantized counterparts, they are not complete and more importantly do not support fully quantified neural networks, namely, only weights are quantized. To fill this gap, in this work, we propose a quantization error bound verification method (QEBVerif), where both weights and activation tensors are quantized. QEBVerif consists of two analyses: a differential reachability analysis (DRA) and a mixed-integer linear programming (MILP) based verification method. DRA performs difference analysis between the DNN and its quantized counterpart layer-by-layer to efficiently compute a tight quantization error interval. If it fails to prove the error bound, then we encode the verification problem into an equivalent MILP problem which can be solved by off-the-shelf solvers. Thus, QEBVerif is sound, complete, and arguably efficient. We implement QEBVerif in a tool and conduct extensive experiments, showing its effectiveness and efficiency.
    Land Use Prediction using Electro-Optical to SAR Few-Shot Transfer Learning. (arXiv:2212.03084v1 [cs.CV])
    Satellite image analysis has important implications for land use, urbanization, and ecosystem monitoring. Deep learning methods can facilitate the analysis of different satellite modalities, such as electro-optical (EO) and synthetic aperture radar (SAR) imagery, by supporting knowledge transfer between the modalities to compensate for individual shortcomings. Recent progress has shown how distributional alignment of neural network embeddings can produce powerful transfer learning models by employing a sliced Wasserstein distance (SWD) loss. We analyze how this method can be applied to Sentinel-1 and -2 satellite imagery and develop several extensions toward making it effective in practice. In an application to few-shot Local Climate Zone (LCZ) prediction, we show that these networks outperform multiple common baselines on datasets with a large number of classes. Further, we provide evidence that instance normalization can significantly stabilize the training process and that explicitly shaping the embedding space using supervised contrastive learning can lead to improved performance.  ( 2 min )
    Stars: Tera-Scale Graph Building for Clustering and Graph Learning. (arXiv:2212.02635v1 [cs.LG])
    A fundamental procedure in the analysis of massive datasets is the construction of similarity graphs. Such graphs play a key role for many downstream tasks, including clustering, classification, graph learning, and nearest neighbor search. For these tasks, it is critical to build graphs which are sparse yet still representative of the underlying data. The benefits of sparsity are twofold: firstly, constructing dense graphs is infeasible in practice for large datasets, and secondly, the runtime of downstream tasks is directly influenced by the sparsity of the similarity graph. In this work, we present $\textit{Stars}$: a highly scalable method for building extremely sparse graphs via two-hop spanners, which are graphs where similar points are connected by a path of length at most two. Stars can construct two-hop spanners with significantly fewer similarity comparisons, which are a major bottleneck for learning based models where comparisons are expensive to evaluate. Theoretically, we demonstrate that Stars builds a graph in nearly-linear time, where approximate nearest neighbors are contained within two-hop neighborhoods. In practice, we have deployed Stars for multiple data sets allowing for graph building at the $\textit{Tera-Scale}$, i.e., for graphs with tens of trillions of edges. We evaluate the performance of Stars for clustering and graph learning, and demonstrate 10~1000-fold improvements in pairwise similarity comparisons compared to different baselines, and 2~10-fold improvement in running time without quality loss.
    StyleGAN as a Utility-Preserving Face De-identification Method. (arXiv:2212.02611v1 [cs.CV])
    Several face de-identification methods have been proposed to preserve users' privacy by obscuring their faces. These methods, however, can degrade the quality of photos, and they usually do not preserve the utility of faces, e.g., their age, gender, pose, and facial expression. Recently, advanced generative adversarial network models, such as StyleGAN, have been proposed, which generate realistic, high-quality imaginary faces. In this paper, we investigate the use of StyleGAN in generating de-identified faces through style mixing, where the styles or features of the target face and an auxiliary face get mixed to generate a de-identified face that carries the utilities of the target face. We examined this de-identification method with respect to preserving utility and privacy, by implementing several face detection, verification, and identification attacks. Through extensive experiments and also comparing with two state-of-the-art face de-identification methods, we show that StyleGAN preserves the quality and utility of the faces much better than the other approaches and also by choosing the style mixing levels correctly, it can preserve the privacy of the faces much better than other methods.
    Beyond Object Recognition: A New Benchmark towards Object Concept Learning. (arXiv:2212.02710v1 [cs.CV])
    Understanding objects is a central building block of artificial intelligence, especially for embodied AI. Even though object recognition excels with deep learning, current machines still struggle to learn higher-level knowledge, e.g., what attributes an object has, and what can we do with an object. In this work, we propose a challenging Object Concept Learning (OCL) task to push the envelope of object understanding. It requires machines to reason out object affordances and simultaneously give the reason: what attributes make an object possesses these affordances. To support OCL, we build a densely annotated knowledge base including extensive labels for three levels of object concept (category, attribute, affordance), and the causal relations of three levels. By analyzing the causal structure of OCL, we present a baseline, Object Concept Reasoning Network (OCRN). It leverages causal intervention and concept instantiation to infer the three levels following their causal relations. In experiments, OCRN effectively infers the object knowledge while following the causalities well. Our data and code are available at https://mvig-rhos.com/ocl.
    On the Importance of Clinical Notes in Multi-modal Learning for EHR Data. (arXiv:2212.03044v1 [cs.LG])
    Understanding deep learning model behavior is critical to accepting machine learning-based decision support systems in the medical community. Previous research has shown that jointly using clinical notes with electronic health record (EHR) data improved predictive performance for patient monitoring in the intensive care unit (ICU). In this work, we explore the underlying reasons for these improvements. While relying on a basic attention-based model to allow for interpretability, we first confirm that performance significantly improves over state-of-the-art EHR data models when combining EHR data and clinical notes. We then provide an analysis showing improvements arise almost exclusively from a subset of notes containing broader context on patient state rather than clinician notes. We believe such findings highlight deep learning models for EHR data to be more limited by partially-descriptive data than by modeling choice, motivating a more data-centric approach in the field.
    GAS-Net: Generative Artistic Style Neural Networks for Fonts. (arXiv:2212.02886v1 [cs.CV])
    Generating new fonts is a time-consuming and labor-intensive, especially in a language with a huge amount of characters like Chinese. Various deep learning models have demonstrated the ability to efficiently generate new fonts with a few reference characters of that style. This project aims to develop a few-shot cross-lingual font generator based on AGIS-Net and improve the performance metrics mentioned. Our approaches include redesigning the encoder and the loss function. We will validate our method on multiple languages and datasets mentioned.
    Audio Latent Space Cartography. (arXiv:2212.02610v1 [cs.SD])
    We explore the generation of visualisations of audio latent spaces using an audio-to-image generation pipeline. We believe this can help with the interpretability of audio latent spaces. We demonstrate a variety of results on the NSynth dataset. A web demo is available.
    Statistical mechanics of continual learning: variational principle and mean-field potential. (arXiv:2212.02846v1 [cond-mat.stat-mech])
    An obstacle to artificial general intelligence is set by the continual learning of multiple tasks of different nature. Recently, various heuristic tricks, both from machine learning and from neuroscience angles, were proposed, but they lack a unified theory ground. Here, we focus on the continual learning in single-layered and multi-layered neural networks of binary weights. A variational Bayesian learning setting is thus proposed, where the neural network is trained in a field-space, rather than the gradient-ill-defined discrete-weight space, and furthermore, the weight uncertainty is naturally incorporated, and modulates the synaptic resources among tasks. From a physics perspective, we translate the variational continual learning into the Franz-Parisi thermodynamic potential framework, where the previous task knowledge acts as a prior and a reference as well. Therefore, the learning performance can be analytically studied with mean-field order parameters, whose predictions coincide with the numerical experiments using stochastic gradient descent methods. Our proposed principled frameworks also connect to elastic weight consolidation, and neuroscience inspired metaplasticity, providing a theory-grounded method for the real-world multi-task learning with deep networks.
    PrefRec: Preference-based Recommender Systems for Reinforcing Long-term User Engagement. (arXiv:2212.02779v1 [cs.IR])
    Current advances in recommender systems have been remarkably successful in optimizing immediate engagement. However, long-term user engagement, a more desirable performance metric, remains difficult to improve. Meanwhile, recent reinforcement learning (RL) algorithms have shown their effectiveness in a variety of long-term goal optimization tasks. For this reason, RL is widely considered as a promising framework for optimizing long-term user engagement in recommendation. Despite being a promising approach, the application of RL heavily relies on well-designed rewards, but designing rewards related to long-term user engagement is quite difficult. To mitigate the problem, we propose a novel paradigm, Preference-based Recommender systems (PrefRec), which allows RL recommender systems to learn from preferences about users' historical behaviors rather than explicitly defined rewards. Such preferences are easily accessible through techniques such as crowdsourcing, as they do not require any expert knowledge. With PrefRec, we can fully exploit the advantages of RL in optimizing long-term goals, while avoiding complex reward engineering. PrefRec uses the preferences to automatically train a reward function in an end-to-end manner. The reward function is then used to generate learning signals to train the recommendation policy. Furthermore, we design an effective optimization method for PrefRec, which uses an additional value function, expectile regression and reward model pre-training to improve the performance. Extensive experiments are conducted on a variety of long-term user engagement optimization tasks. The results show that PrefRec significantly outperforms previous state-of-the-art methods in all the tasks.
    Efficient PAC Learning from the Crowd with Pairwise Comparisons. (arXiv:2011.01104v4 [cs.LG] UPDATED)
    We study crowdsourced PAC learning of threshold functions, where the labels are gathered from a pool of annotators some of whom may behave adversarially. This is yet a challenging problem and until recently has computationally and query efficient PAC learning algorithm been established by Awasthi et al. (2017). In this paper, we show that by leveraging the more easily acquired pairwise comparison queries, it is possible to exponentially reduce the label complexity while retaining the overall query complexity and runtime. Our main algorithmic contributions are a comparison-equipped labeling scheme that can faithfully recover the true labels of a small set of instances, and a label-efficient filtering process that in conjunction with the small labeled set can reliably infer the true labels of a large instance set.  ( 2 min )
    Distributed Bayesian Learning of Dynamic States. (arXiv:2212.02565v1 [eess.SP])
    This work studies networked agents cooperating to track a dynamical state of nature under partial information. The proposed algorithm is a distributed Bayesian filtering algorithm for finite-state hidden Markov models (HMMs). It can be used for sequential state estimation tasks, as well as for modeling opinion formation over social networks under dynamic environments. We show that the disagreement with the optimal centralized solution is asymptotically bounded for the class of geometrically ergodic state transition models, which includes rapidly changing models. We also derive recursions for calculating the probability of error and establish convergence under Gaussian observation models. Simulations are provided to illustrate the theory and to compare against alternative approaches.
    Rethinking Backdoor Data Poisoning Attacks in the Context of Semi-Supervised Learning. (arXiv:2212.02582v1 [cs.LG])
    Semi-supervised learning methods can train high-accuracy machine learning models with a fraction of the labeled training samples required for traditional supervised learning. Such methods do not typically involve close review of the unlabeled training samples, making them tempting targets for data poisoning attacks. In this paper we investigate the vulnerabilities of semi-supervised learning methods to backdoor data poisoning attacks on the unlabeled samples. We show that simple poisoning attacks that influence the distribution of the poisoned samples' predicted labels are highly effective - achieving an average attack success rate as high as 96.9%. We introduce a generalized attack framework targeting semi-supervised learning methods to better understand and exploit their limitations and to motivate future defense strategies.
    A Mobility-Aware Deep Learning Model for Long-Term COVID-19 Pandemic Prediction and Policy Impact Analysis. (arXiv:2212.02575v1 [cs.LG])
    Pandemic(epidemic) modeling, aiming at disease spreading analysis, has always been a popular research topic especially following the outbreak of COVID-19 in 2019. Some representative models including SIR-based deep learning prediction models have shown satisfactory performance. However, one major drawback for them is that they fall short in their long-term predictive ability. Although graph convolutional networks (GCN) also perform well, their edge representations do not contain complete information and it can lead to biases. Another drawback is that they usually use input features which they are unable to predict. Hence, those models are unable to predict further future. We propose a model that can propagate predictions further into the future and it has better edge representations. In particular, we model the pandemic as a spatial-temporal graph whose edges represent the transition of infections and are learned by our model. We use a two-stream framework that contains GCN and recursive structures (GRU) with an attention mechanism. Our model enables mobility analysis that provides an effective toolbox for public health researchers and policy makers to predict how different lock-down strategies that actively control mobility can influence the spread of pandemics. Experiments show that our model outperforms others in its long-term predictive power. Moreover, we simulate the effects of certain policies and predict their impacts on infection control.
    List-Decodable Sparse Mean Estimation. (arXiv:2205.14337v2 [cs.LG] UPDATED)
    Robust mean estimation is one of the most important problems in statistics: given a set of samples in $\mathbb{R}^d$ where an $\alpha$ fraction are drawn from some distribution $D$ and the rest are adversarially corrupted, we aim to estimate the mean of $D$. A surge of recent research interest has been focusing on the list-decodable setting where $\alpha \in (0, \frac12]$, and the goal is to output a finite number of estimates among which at least one approximates the target mean. In this paper, we consider that the underlying distribution $D$ is Gaussian with $k$-sparse mean. Our main contribution is the first polynomial-time algorithm that enjoys sample complexity $O\big(\mathrm{poly}(k, \log d)\big)$, i.e. poly-logarithmic in the dimension. One of our core algorithmic ingredients is using low-degree sparse polynomials to filter outliers, which may find more applications.  ( 2 min )
    Visual Query Tuning: Towards Effective Usage of Intermediate Representations for Parameter and Memory Efficient Transfer Learning. (arXiv:2212.03220v1 [cs.LG])
    Intermediate features of a pre-trained model have been shown informative for making accurate predictions on downstream tasks, even if the model backbone is kept frozen. The key challenge is how to utilize these intermediate features given their gigantic amount. We propose visual query tuning (VQT), a simple yet effective approach to aggregate intermediate features of Vision Transformers. Through introducing a handful of learnable ``query'' tokens to each layer, VQT leverages the inner workings of Transformers to ``summarize'' rich intermediate features of each layer, which can then be used to train the prediction heads of downstream tasks. As VQT keeps the intermediate features intact and only learns to combine them, it enjoys memory efficiency in training, compared to many other parameter-efficient fine-tuning approaches that learn to adapt features and need back-propagation through the entire backbone. This also suggests the complementary role between VQT and those approaches in transfer learning. Empirically, VQT consistently surpasses the state-of-the-art approach that utilizes intermediate features for transfer learning and outperforms full fine-tuning in many cases. Compared to parameter-efficient approaches that adapt features, VQT achieves much higher accuracy under memory constraints. Most importantly, VQT is compatible with these approaches to attain even higher accuracy, making it a simple add-on to further boost transfer learning.
    Learning the joint distribution of two sequences using little or no paired data. (arXiv:2212.03232v1 [cs.LG])
    We present a noisy channel generative model of two sequences, for example text and speech, which enables uncovering the association between the two modalities when limited paired data is available. To address the intractability of the exact model under a realistic data setup, we propose a variational inference approximation. To train this variational model with categorical data, we propose a KL encoder loss approach which has connections to the wake-sleep algorithm. Identifying the joint or conditional distributions by only observing unpaired samples from the marginals is only possible under certain conditions in the data distribution and we discuss under what type of conditional independence assumptions that might be achieved, which guides the architecture designs. Experimental results show that even tiny amount of paired data (5 minutes) is sufficient to learn to relate the two modalities (graphemes and phonemes here) when a massive amount of unpaired data is available, paving the path to adopting this principled approach for all seq2seq models in low data resource regimes.  ( 2 min )
    RBF-MGN:Solving spatiotemporal PDEs with Physics-informed Graph Neural Network. (arXiv:2212.02861v1 [cs.LG])
    Physics-informed neural networks (PINNs) have lately received significant attention as a representative deep learning-based technique for solving partial differential equations (PDEs). Most fully connected network-based PINNs use automatic differentiation to construct loss functions that suffer from slow convergence and difficult boundary enforcement. In addition, although convolutional neural network (CNN)-based PINNs can significantly improve training efficiency, CNNs have difficulty in dealing with irregular geometries with unstructured meshes. Therefore, we propose a novel framework based on graph neural networks (GNNs) and radial basis function finite difference (RBF-FD). We introduce GNNs into physics-informed learning to better handle irregular domains with unstructured meshes. RBF-FD is used to construct a high-precision difference format of the differential equations to guide model training. Finally, we perform numerical experiments on Poisson and wave equations on irregular domains. We illustrate the generalizability, accuracy, and efficiency of the proposed algorithms on different PDE parameters, numbers of collection points, and several types of RBFs.
    P{\O}DA: Prompt-driven Zero-shot Domain Adaptation. (arXiv:2212.03241v1 [cs.CV])
    Domain adaptation has been vastly investigated in computer vision but still requires access to target images at train time, which might be intractable in some conditions, especially for long-tail samples. In this paper, we propose the task of `Prompt-driven Zero-shot Domain Adaptation', where we adapt a model trained on a source domain using only a general textual description of the target domain, i.e., a prompt. First, we leverage a pretrained contrastive vision-language model (CLIP) to optimize affine transformations of source features, bringing them closer to target text embeddings, while preserving their content and semantics. Second, we show that augmented features can be used to perform zero-shot domain adaptation for semantic segmentation. Experiments demonstrate that our method significantly outperforms CLIP-based style transfer baselines on several datasets for the downstream task at hand. Our prompt-driven approach even outperforms one-shot unsupervised domain adaptation on some datasets, and gives comparable results on others. The code is available at https://github.com/astra-vision/PODA.
    Knowledge-driven Active Learning. (arXiv:2110.08265v3 [cs.LG] UPDATED)
    The deployment of Deep Learning (DL) models is still precluded in those contexts where the amount of supervised data is limited. To answer this issue, active learning strategies aim at minimizing the amount of labelled data required to train a DL model. Most active strategies are based on uncertain sample selection, and even often restricted to samples lying close to the decision boundary. These techniques are theoretically sound, but an understanding of the selected samples based on their content is not straightforward, further driving non-experts to consider DL as a black-box. For the first time, here we propose a different approach, taking into consideration common domain-knowledge and enabling non-expert users to train a model with fewer samples. In our Knowledge-driven Active Learning (KAL) framework, rule-based knowledge is converted into logic constraints and their violation is checked as a natural guide for sample selection. We show that even simple relationships among data and output classes offer a way to spot predictions for which the model need supervision. The proposed approach (i) outperforms many active learning strategies in terms of average F1 score, particularly in those contexts where domain knowledge is rich. Furthermore, we empirically demonstrate that (ii) KAL discovers data distribution lying far from the initial training data unlike uncertainty-based strategies, (iii) it ensures domain experts that the provided knowledge is respected by the model on test data, and (iv) it can be employed even when domain-knowledge is not available by coupling it with a XAI technique. Finally, we also show that KAL is also suitable for object recognition tasks and, its computational demand is low, unlike many recent active learning strategies.
    Understanding Event-Generation Networks via Uncertainties. (arXiv:2104.04543v2 [hep-ph] CROSS LISTED)
    Following the growing success of generative neural networks in LHC simulations, the crucial question is how to control the networks and assign uncertainties to their event output. We show how Bayesian normalizing flow or invertible networks capture uncertainties from the training and turn them into an uncertainty on the event weight. Fundamentally, the interplay between density and uncertainty estimates indicates that these networks learn functions in analogy to parameter fits rather than binned event counts.
    GPS++: An Optimised Hybrid MPNN/Transformer for Molecular Property Prediction. (arXiv:2212.02229v2 [q-bio.QM] UPDATED)
    This technical report presents GPS++, the first-place solution to the Open Graph Benchmark Large-Scale Challenge (OGB-LSC 2022) for the PCQM4Mv2 molecular property prediction task. Our approach implements several key principles from the prior literature. At its core our GPS++ method is a hybrid MPNN/Transformer model that incorporates 3D atom positions and an auxiliary denoising task. The effectiveness of GPS++ is demonstrated by achieving 0.0719 mean absolute error on the independent test-challenge PCQM4Mv2 split. Thanks to Graphcore IPU acceleration, GPS++ scales to deep architectures (16 layers), training at 3 minutes per epoch, and large ensemble (112 models), completing the final predictions in 1 hour 32 minutes, well under the 4 hour inference budget allocated. Our implementation is publicly available at: https://github.com/graphcore/ogb-lsc-pcqm4mv2.  ( 2 min )
    Fast Online Hashing with Multi-Label Projection. (arXiv:2212.03112v1 [cs.DB])
    Hashing has been widely researched to solve the large-scale approximate nearest neighbor search problem owing to its time and storage superiority. In recent years, a number of online hashing methods have emerged, which can update the hash functions to adapt to the new stream data and realize dynamic retrieval. However, existing online hashing methods are required to update the whole database with the latest hash functions when a query arrives, which leads to low retrieval efficiency with the continuous increase of the stream data. On the other hand, these methods ignore the supervision relationship among the examples, especially in the multi-label case. In this paper, we propose a novel Fast Online Hashing (FOH) method which only updates the binary codes of a small part of the database. To be specific, we first build a query pool in which the nearest neighbors of each central point are recorded. When a new query arrives, only the binary codes of the corresponding potential neighbors are updated. In addition, we create a similarity matrix which takes the multi-label supervision information into account and bring in the multi-label projection loss to further preserve the similarity among the multi-label data. The experimental results on two common benchmarks show that the proposed FOH can achieve dramatic superiority on query time up to 6.28 seconds less than state-of-the-art baselines with competitive retrieval accuracy.  ( 2 min )
    Language Models of Code are Few-Shot Commonsense Learners. (arXiv:2210.07128v3 [cs.CL] UPDATED)
    We address the general task of structured commonsense reasoning: given a natural language input, the goal is to generate a graph such as an event -- or a reasoning-graph. To employ large language models (LMs) for this task, existing approaches ``serialize'' the output graph as a flat list of nodes and edges. Although feasible, these serialized graphs strongly deviate from the natural language corpora that LMs were pre-trained on, hindering LMs from generating them correctly. In this paper, we show that when we instead frame structured commonsense reasoning tasks as code generation tasks, pre-trained LMs of code are better structured commonsense reasoners than LMs of natural language, even when the downstream task does not involve source code at all. We demonstrate our approach across three diverse structured commonsense reasoning tasks. In all these natural language tasks, we show that using our approach, a code generation LM (CODEX) outperforms natural-LMs that are fine-tuned on the target task (e.g., T5) and other strong LMs such as GPT-3 in the few-shot setting.  ( 2 min )
    Bayes-Newton Methods for Approximate Bayesian Inference with PSD Guarantees. (arXiv:2111.01721v3 [stat.ML] UPDATED)
    We formulate natural gradient variational inference (VI), expectation propagation (EP), and posterior linearisation (PL) as extensions of Newton's method for optimising the parameters of a Bayesian posterior distribution. This viewpoint explicitly casts inference algorithms under the framework of numerical optimisation. We show that common approximations to Newton's method from the optimisation literature, namely Gauss-Newton and quasi-Newton methods (e.g., the BFGS algorithm), are still valid under this 'Bayes-Newton' framework. This leads to a suite of novel algorithms which are guaranteed to result in positive semi-definite (PSD) covariance matrices, unlike standard VI and EP. Our unifying viewpoint provides new insights into the connections between various inference schemes. All the presented methods apply to any model with a Gaussian prior and non-conjugate likelihood, which we demonstrate with (sparse) Gaussian processes and state space models.  ( 2 min )
    Controlled Text Generation using T5 based Encoder-Decoder Soft Prompt Tuning and Analysis of the Utility of Generated Text in AI. (arXiv:2212.02924v1 [cs.CL])
    Controlled text generation is a very important task in the arena of natural language processing due to its promising applications. In order to achieve this task we mainly introduce the novel soft prompt tuning method of using soft prompts at both encoder and decoder levels together in a T5 model and investigate the performance as the behaviour of an additional soft prompt related to the decoder of a T5 model in controlled text generation remained unexplored. Then we also investigate the feasibility of steering the output of this extended soft prompted T5 model at decoder level and finally analyse the utility of generated text to be used in AI related tasks such as training AI models with an interpretability analysis of the classifier trained with synthetic text, as there is a lack of proper analysis of methodologies in generating properly labelled data to be utilized in AI tasks. Through the performed in-depth intrinsic and extrinsic evaluations of this generation model along with the artificially generated data, we found that this model produced better results compared to the T5 model with a single soft prompt at encoder level and the sentiment classifier trained using this artificially generated data can produce comparable classification results to the results of a classifier trained with real labelled data and also the classifier decision is interpretable with respect to the input text content.  ( 2 min )
    Multiple Perturbation Attack: Attack Pixelwise Under Different $\ell_p$-norms For Better Adversarial Performance. (arXiv:2212.03069v1 [cs.CV])
    Adversarial machine learning has been both a major concern and a hot topic recently, especially with the ubiquitous use of deep neural networks in the current landscape. Adversarial attacks and defenses are usually likened to a cat-and-mouse game in which defenders and attackers evolve over the time. On one hand, the goal is to develop strong and robust deep networks that are resistant to malicious actors. On the other hand, in order to achieve that, we need to devise even stronger adversarial attacks to challenge these defense models. Most of existing attacks employs a single $\ell_p$ distance (commonly, $p\in\{1,2,\infty\}$) to define the concept of closeness and performs steepest gradient ascent w.r.t. this $p$-norm to update all pixels in an adversarial example in the same way. These $\ell_p$ attacks each has its own pros and cons; and there is no single attack that can successfully break through defense models that are robust against multiple $\ell_p$ norms simultaneously. Motivated by these observations, we come up with a natural approach: combining various $\ell_p$ gradient projections on a pixel level to achieve a joint adversarial perturbation. Specifically, we learn how to perturb each pixel to maximize the attack performance, while maintaining the overall visual imperceptibility of adversarial examples. Finally, through various experiments with standardized benchmarks, we show that our method outperforms most current strong attacks across state-of-the-art defense mechanisms, while retaining its ability to remain clean visually.  ( 2 min )
    Robust Point Cloud Segmentation with Noisy Annotations. (arXiv:2212.03242v1 [cs.CV])
    Point cloud segmentation is a fundamental task in 3D. Despite recent progress on point cloud segmentation with the power of deep networks, current learning methods based on the clean label assumptions may fail with noisy labels. Yet, class labels are often mislabeled at both instance-level and boundary-level in real-world datasets. In this work, we take the lead in solving the instance-level label noise by proposing a Point Noise-Adaptive Learning (PNAL) framework. Compared to noise-robust methods on image tasks, our framework is noise-rate blind, to cope with the spatially variant noise rate specific to point clouds. Specifically, we propose a point-wise confidence selection to obtain reliable labels from the historical predictions of each point. A cluster-wise label correction is proposed with a voting strategy to generate the best possible label by considering the neighbor correlations. To handle boundary-level label noise, we also propose a variant ``PNAL-boundary " with a progressive boundary label cleaning strategy. Extensive experiments demonstrate its effectiveness on both synthetic and real-world noisy datasets. Even with $60\%$ symmetric noise and high-level boundary noise, our framework significantly outperforms its baselines, and is comparable to the upper bound trained on completely clean data. Moreover, we cleaned the popular real-world dataset ScanNetV2 for rigorous experiment. Our code and data is available at https://github.com/pleaseconnectwifi/PNAL.  ( 2 min )
    Enhancing Quantum Adversarial Robustness by Randomized Encodings. (arXiv:2212.02531v1 [quant-ph])
    The interplay between quantum physics and machine learning gives rise to the emergent frontier of quantum machine learning, where advanced quantum learning models may outperform their classical counterparts in solving certain challenging problems. However, quantum learning systems are vulnerable to adversarial attacks: adding tiny carefully-crafted perturbations on legitimate input samples can cause misclassifications. To address this issue, we propose a general scheme to protect quantum learning systems from adversarial attacks by randomly encoding the legitimate data samples through unitary or quantum error correction encoders. In particular, we rigorously prove that both global and local random unitary encoders lead to exponentially vanishing gradients (i.e. barren plateaus) for any variational quantum circuits that aim to add adversarial perturbations, independent of the input data and the inner structures of adversarial circuits and quantum classifiers. In addition, we prove a rigorous bound on the vulnerability of quantum classifiers under local unitary adversarial attacks. We show that random black-box quantum error correction encoders can protect quantum classifiers against local adversarial noises and their robustness increases as we concatenate error correction codes. To quantify the robustness enhancement, we adapt quantum differential privacy as a measure of the prediction stability for quantum classifiers. Our results establish versatile defense strategies for quantum classifiers against adversarial perturbations, which provide valuable guidance to enhance the reliability and security for both near-term and future quantum learning technologies.  ( 2 min )
    Bagging is an Optimal PAC Learner. (arXiv:2212.02264v2 [cs.LG] UPDATED)
    Determining the optimal sample complexity of PAC learning in the realizable setting was a central open problem in learning theory for decades. Finally, the seminal work by Hanneke (2016) gave an algorithm with a provably optimal sample complexity. His algorithm is based on a careful and structured sub-sampling of the training data and then returning a majority vote among hypotheses trained on each of the sub-samples. While being a very exciting theoretical result, it has not had much impact in practice, in part due to inefficiency, since it constructs a polynomial number of sub-samples of the training data, each of linear size. In this work, we prove the surprising result that the practical and classic heuristic bagging (a.k.a. bootstrap aggregation), due to Breiman (1996), is in fact also an optimal PAC learner. Bagging pre-dates Hanneke's algorithm by twenty years and is taught in most undergraduate machine learning courses. Moreover, we show that it only requires a logarithmic number of sub-samples to reach optimality.  ( 2 min )
    Brownian Noise Reduction: Maximizing Privacy Subject to Accuracy Constraints. (arXiv:2206.07234v2 [cs.LG] UPDATED)
    There is a disconnect between how researchers and practitioners handle privacy-utility tradeoffs. Researchers primarily operate from a privacy first perspective, setting strict privacy requirements and minimizing risk subject to these constraints. Practitioners often desire an accuracy first perspective, possibly satisfied with the greatest privacy they can get subject to obtaining sufficiently small error. Ligett et al. have introduced a "noise reduction" algorithm to address the latter perspective. The authors show that by adding correlated Laplace noise and progressively reducing it on demand, it is possible to produce a sequence of increasingly accurate estimates of a private parameter while only paying a privacy cost for the least noisy iterate released. In this work, we generalize noise reduction to the setting of Gaussian noise, introducing the Brownian mechanism. The Brownian mechanism works by first adding Gaussian noise of high variance corresponding to the final point of a simulated Brownian motion. Then, at the practitioner's discretion, noise is gradually decreased by tracing back along the Brownian path to an earlier time. Our mechanism is more naturally applicable to the common setting of bounded $\ell_2$-sensitivity, empirically outperforms existing work on common statistical tasks, and provides customizable control of privacy loss over the entire interaction with the practitioner. We complement our Brownian mechanism with ReducedAboveThreshold, a generalization of the classical AboveThreshold algorithm that provides adaptive privacy guarantees. Overall, our results demonstrate that one can meet utility constraints while still maintaining strong levels of privacy.  ( 2 min )
    Variational Bayesian Reinforcement Learning with Regret Bounds. (arXiv:1807.09647v4 [cs.LG] UPDATED)
    In reinforcement learning the Q-values summarize the expected future rewards that the agent will attain. However, they cannot capture the epistemic uncertainty about those rewards. In this work we derive a new Bellman operator with associated fixed point we call the `knowledge values'. These K-values compress both the expected future rewards and the epistemic uncertainty into a single value, so that high uncertainty, high reward, or both, can yield high K-values. The key principle is to endow the agent with a risk-seeking utility function that is carefully tuned to balance exploration and exploitation. When the agent follows a Boltzmann policy over the K-values it yields a Bayes regret bound of $\tilde O(L \sqrt{S A T})$, where $L$ is the time horizon, $S$ is the total number of states, $A$ is the number of actions, and $T$ is the number of elapsed timesteps. We show deep connections of this approach to the soft-max and maximum-entropy strands of research in reinforcement learning.  ( 2 min )
    Two-Tailed Averaging: Anytime Adaptive Once-in-a-while Optimal Iterate Averaging for Stochastic Optimization. (arXiv:2209.12581v2 [stat.ML] UPDATED)
    Tail averaging improves on Polyak averaging's non-asymptotic behaviour by excluding a number of leading iterates of stochastic optimization from its calculations. In practice, with a finite number of optimization steps and a learning rate that cannot be annealed to zero, tail averaging can get much closer to a local minimum point of the training loss than either the individual iterates or the Polyak average. However, the number of leading iterates to ignore is an important hyperparameter, and starting averaging too early or too late leads to inefficient use of resources or suboptimal solutions. Our work focusses on improving generalization, which makes setting this hyperparameter even more difficult, especially in the presence of other hyperparameters and overfitting. Furthermore, before averaging starts, the loss is only weakly informative of the final performance, which makes early stopping unreliable. To alleviate these problems, we propose an anytime variant of tail averaging intended for improving generalization not pure optimization, that has no hyperparameters and approximates the optimal tail at all optimization steps. Our algorithm is based on two running averages with adaptive lengths bounded in terms of the optimal tail length, one of which achieves approximate optimality with some regularity. Requiring only the additional storage for two sets of weights and periodic evaluation of the loss, the proposed two-tailed averaging algorithm is a practical and widely applicable method for improving generalization.  ( 2 min )
    Unsafe's Betrayal: Abusing Unsafe Rust in Binary Reverse Engineering via Machine Learning. (arXiv:2211.00111v2 [cs.CR] UPDATED)
    Memory-safety bugs introduce critical software-security issues. Rust provides memory-safe mechanisms to avoid memory-safety bugs in programming, while still allowing unsafe escape hatches via unsafe code. However, the unsafe code that enhances the usability of Rust provides clear spots for finding memory-safety bugs in Rust source code. In this paper, we claim that these unsafe spots can still be identifiable in Rust binary code via machine learning and be leveraged for finding memory-safety bugs. To support our claim, we propose the tool textttrustspot, that enables reverse engineering to learn an unsafe classifier that proposes a list of functions in Rust binaries for downstream analysis. We empirically show that the function proposals by textttrustspot can recall $92.92\%$ of memory-safety bugs, while it covers only $16.79\%$ of the entire binary code. As an application, we demonstrate that the function proposals are used in targeted fuzzing on Rust packages, which contribute to reducing the fuzzing time compared to non-targeted fuzzing.  ( 2 min )
    Curriculum Learning for Relative Overgeneralization. (arXiv:2212.02733v1 [cs.LG])
    In multi-agent reinforcement learning (MARL), many popular methods, such as VDN and QMIX, are susceptible to a critical multi-agent pathology known as relative overgeneralization (RO), which arises when the optimal joint action's utility falls below that of a sub-optimal joint action in cooperative tasks. RO can cause the agents to get stuck into local optima or fail to solve tasks that require significant coordination between agents within a given timestep. Recent value-based MARL algorithms such as QPLEX and WQMIX can overcome RO to some extent. However, our experimental results show that they can still fail to solve cooperative tasks that exhibit strong RO. In this work, we propose a novel approach called curriculum learning for relative overgeneralization (CURO) to better overcome RO. To solve a target task that exhibits strong RO, in CURO, we first fine-tune the reward function of the target task to generate source tasks that are tailored to the current ability of the learning agent and train the agent on these source tasks first. Then, to effectively transfer the knowledge acquired in one task to the next, we use a novel transfer learning method that combines value function transfer with buffer transfer, which enables more efficient exploration in the target task. We demonstrate that, when applied to QMIX, CURO overcomes severe RO problem and significantly improves performance, yielding state-of-the-art results in a variety of cooperative multi-agent tasks, including the challenging StarCraft II micromanagement benchmarks.  ( 2 min )
    Machine-learning-based head impact subtyping based on the spectral densities of the measurable head kinematics. (arXiv:2104.09082v3 [q-bio.QM] UPDATED)
    Objective: Traumatic brain injury can be caused by head impacts, but many brain injury risk estimation models are not equally accurate across the variety of impacts that patients may undergo and the characteristics of different types of impacts are not well studied. We investigated the spectral characteristics of different head impact types with kinematics classification. Methods: Data was analyzed from 3,262 head impacts from lab reconstruction, American football, mixed martial arts, and publicly available car crash data. A random forest classifier with spectral densities of linear acceleration and angular velocity was built to classify head impact types (e.g., football, car crash, mixed martial arts). To test the classifier robustness, another 271 lab-reconstructed impacts were obtained from 5 other instrumented mouthguards. Finally, with the classifier, type-specific, nearest-neighbor regression models were built for brain strain. Results: The classifier reached a median accuracy of 96% over 1,000 random partitions of training and test sets. The most important features in the classification included both low-frequency and high-frequency features, both linear acceleration features and angular velocity features. Different head impact types had different distributions of spectral densities in low-frequency and high-frequency ranges (e.g., the spectral densities of MMA impacts were higher in high-frequency range than in the low-frequency range). The type-specific regression showed a generally higher R^2-value than baseline models without classification. Conclusion: The machine-learning-based classifier enables a better understanding of the impact kinematics spectral density in different sports, and it can be applied to evaluate the quality of impact-simulation systems and on-field data augmentation.  ( 3 min )
    Deep Learning for Bayesian Optimization of Scientific Problems with High-Dimensional Structure. (arXiv:2104.11667v4 [cs.LG] UPDATED)
    Bayesian optimization (BO) is a popular paradigm for global optimization of expensive black-box functions, but there are many domains where the function is not completely a black-box. The data may have some known structure (e.g. symmetries) and/or the data generation process may be a composite process that yields useful intermediate or auxiliary information in addition to the value of the optimization objective. However, surrogate models traditionally employed in BO, such as Gaussian Processes (GPs), scale poorly with dataset size and do not easily accommodate known structure. Instead, we use Bayesian neural networks, a class of scalable and flexible surrogate models with inductive biases, to extend BO to complex, structured problems with high dimensionality. We demonstrate BO on a number of realistic problems in physics and chemistry, including topology optimization of photonic crystal materials using convolutional neural networks, and chemical property optimization of molecules using graph neural networks. On these complex tasks, we show that neural networks often outperform GPs as surrogate models for BO in terms of both sampling efficiency and computational cost.  ( 2 min )
    BoostTree and BoostForest for Ensemble Learning. (arXiv:2003.09737v3 [cs.LG] UPDATED)
    Bootstrap aggregating (Bagging) and boosting are two popular ensemble learning approaches, which combine multiple base learners to generate a composite model for more accurate and more reliable performance. They have been widely used in biology, engineering, healthcare, etc. This paper proposes BoostForest, which is an ensemble learning approach using BoostTree as base learners and can be used for both classification and regression. BoostTree constructs a tree model by gradient boosting. It increases the randomness (diversity) by drawing the cut-points randomly at node splitting. BoostForest further increases the randomness by bootstrapping the training data in constructing different BoostTrees. BoostForest generally outperformed four classical ensemble learning approaches (Random Forest, Extra-Trees, XGBoost and LightGBM) on 35 classification and regression datasets. Remarkably, BoostForest tunes its parameters by simply sampling them randomly from a parameter pool, which can be easily specified, and its ensemble learning framework can also be used to combine many other base learners.  ( 2 min )
    Ask "Who", Not "What": Bitcoin Volatility Forecasting with Twitter Data. (arXiv:2110.14317v2 [q-fin.ST] UPDATED)
    Understanding the variations in trading price (volatility), and its response to exogenous information, is a well-researched topic in finance. In this study, we focus on finding stable and accurate volatility predictors for a relatively new asset class of cryptocurrencies, in particular Bitcoin, using deep learning representations of public social media data obtained from Twitter. For our experiments, we extracted semantic information and user statistics from over 30 million Bitcoin-related tweets, in conjunction with 15-minute frequency price data over a horizon of 144 days. Using this data, we built several deep learning architectures that utilized different combinations of the gathered information. For each model, we conducted ablation studies to assess the influence of different components and feature sets over the prediction accuracy. We found statistical evidences for the hypotheses that: (i) temporal convolutional networks perform significantly better than both classical autoregressive models and other deep learning-based architectures in the literature, and (ii) tweet author meta-information, even detached from the tweet itself, is a better predictor of volatility than the semantic content and tweet volume statistics. We demonstrate how different information sets gathered from social media can be utilized in different architectures and how they affect the prediction results. As an additional contribution, we make our dataset public for future research.  ( 2 min )
    Continual learning on deployment pipelines for Machine Learning Systems. (arXiv:2212.02659v1 [cs.LG])
    Following the development of digitization, a growing number of large Original Equipment Manufacturers (OEMs) are adapting computer vision or natural language processing in a wide range of applications such as anomaly detection and quality inspection in plants. Deployment of such a system is becoming an extremely important topic. Our work starts with the least-automated deployment technologies of machine learning systems includes several iterations of updates, and ends with a comparison of automated deployment techniques. The objective is, on the one hand, to compare the advantages and disadvantages of various technologies in theory and practice, so as to facilitate later adopters to avoid making the generalized mistakes when implementing actual use cases, and thereby choose a better strategy for their own enterprises. On the other hand, to raise awareness of the evaluation framework for the deployment of machine learning systems, to have more comprehensive and useful evaluation metrics (e.g. table 2), rather than only focusing on a single factor (e.g. company cost). This is especially important for decision-makers in the industry.  ( 2 min )
    Explaining Link Predictions in Knowledge Graph Embedding Models with Influential Examples. (arXiv:2212.02651v1 [cs.LG])
    We study the problem of explaining link predictions in the Knowledge Graph Embedding (KGE) models. We propose an example-based approach that exploits the latent space representation of nodes and edges in a knowledge graph to explain predictions. We evaluated the importance of identified triples by observing progressing degradation of model performance upon influential triples removal. Our experiments demonstrate that this approach to generate explanations outperforms baselines on KGE models for two publicly available datasets.  ( 2 min )
    Bayesian Algorithm Execution for Tuning Particle Accelerator Emittance with a Virtual Objective. (arXiv:2209.04587v2 [physics.acc-ph] UPDATED)
    Traditional black-box optimization methods are inefficient when dealing with $\textit{multi-point queries}$, i.e. when each query of the objective requires multiple secondary measurements, simulations, or other tasks. Existing approaches, including Bayesian optimization (BO), acquire the full series of measurements at each iteration, making the queries slow and information-poor. We propose applying Bayesian Algorithm Execution (BAX) to instead query and model individual measurements. BAX avoids the slow multi-point query by acquiring points through a $\textit{virtual objective}$, i.e. calculating the multi-point objective from the learned model rather than from the experiment. As a result, queries in BAX are faster and retain more information compared to those in BO. In this work, we use BAX to minimize emittance at the Linac Coherent Light Source (LCLS) and the Facility for Advanced Accelerator Experimental Tests II (FACET-II) particle accelerators. Although the emittance is a critical parameter for the performance of high-brightness machines, including X-ray lasers and linear colliders, optimization is often limited by the time required for tuning. In an LCLS simulation environment, we show that BAX is 20$\times$ faster while also being more robust to noise compared to existing optimization methods. In live tests, BAX performed the first fully-automated emittance tuning at both LCLS and FACET-II, matching the hand-tuned emittance at FACET-II and achieving an optimal emittance 24% lower than that obtained by hand-tuning at LCLS. We anticipate that our approach can readily be adapted to other types of optimization problems involving multi-point queries commonly found in scientific instruments.  ( 2 min )
    Federated Learning with Superquantile Aggregation for Heterogeneous Data. (arXiv:2112.09429v2 [cs.LG] UPDATED)
    We present a federated learning framework that is designed to robustly deliver good predictive performance across individual clients with heterogeneous data. The proposed approach hinges upon a superquantile-based learning objective that captures the tail statistics of the error distribution over heterogeneous clients. We present a stochastic training algorithm that interleaves differentially private client filtering with federated averaging steps. We prove finite time convergence guarantees for the algorithm: $O(1/\sqrt{T})$ in the nonconvex case in $T$ communication rounds and $O(\exp(-T/\kappa^{3/2}) + \kappa/T)$ in the strongly convex case with local condition number $\kappa$. Experimental results on benchmark datasets for federated learning demonstrate that our approach is competitive with classical ones in terms of average error and outperforms them in terms of tail statistics of the error.  ( 2 min )
    Robust Orthogonal Machine Learning of Treatment Effects. (arXiv:2103.11869v2 [stat.ML] UPDATED)
    Causal learning is the key to obtaining stable predictions and answering \textit{what if} problems in decision-makings. In causal learning, it is central to seek methods to estimate the average treatment effect (ATE) from observational data. The Double/Debiased Machine Learning (DML) is one of the prevalent methods to estimate ATE. However, the DML estimators can suffer from an \textit{error-compounding issue} and even give extreme estimates when the propensity scores are close to 0 or 1. Previous studies have overcome this issue through some empirical tricks such as propensity score trimming, yet none of the existing works solves it from a theoretical standpoint. In this paper, we propose a \textit{Robust Causal Learning (RCL)} method to offset the deficiencies of DML estimators. Theoretically, the RCL estimators i) satisfy the (higher-order) orthogonal condition and are as \textit{consistent and doubly robust} as the DML estimators, and ii) get rid of the error-compounding issue. Empirically, the comprehensive experiments show that: i) the RCL estimators give more stable estimations of the causal parameters than DML; ii) the RCL estimators outperform traditional estimators and their variants when applying different machine learning models on both simulation and benchmark datasets, and a mimic consumer credit dataset generated by WGAN.  ( 2 min )
    Learning Representations that Enable Generalization in Assistive Tasks. (arXiv:2212.03175v1 [cs.LG])
    Recent work in sim2real has successfully enabled robots to act in physical environments by training in simulation with a diverse ''population'' of environments (i.e. domain randomization). In this work, we focus on enabling generalization in assistive tasks: tasks in which the robot is acting to assist a user (e.g. helping someone with motor impairments with bathing or with scratching an itch). Such tasks are particularly interesting relative to prior sim2real successes because the environment now contains a human who is also acting. This complicates the problem because the diversity of human users (instead of merely physical environment parameters) is more difficult to capture in a population, thus increasing the likelihood of encountering out-of-distribution (OOD) human policies at test time. We advocate that generalization to such OOD policies benefits from (1) learning a good latent representation for human policies that test-time humans can accurately be mapped to, and (2) making that representation adaptable with test-time interaction data, instead of relying on it to perfectly capture the space of human policies based on the simulated population only. We study how to best learn such a representation by evaluating on purposefully constructed OOD test policies. We find that sim2real methods that encode environment (or population) parameters and work well in tasks that robots do in isolation, do not work well in assistance. In assistance, it seems crucial to train the representation based on the history of interaction directly, because that is what the robot will have access to at test time. Further, training these representations to then predict human actions not only gives them better structure, but also enables them to be fine-tuned at test-time, when the robot observes the partner act. https://adaptive-caregiver.github.io.  ( 2 min )
    Overlapping oriented imbalanced ensemble learning method based on projective clustering and stagewise hybrid sampling. (arXiv:2212.03182v1 [cs.LG])
    The challenge of imbalanced learning lies not only in class imbalance problem, but also in the class overlapping problem which is complex. However, most of the existing algorithms mainly focus on the former. The limitation prevents the existing methods from breaking through. To address this limitation, this paper proposes an ensemble learning algorithm based on dual clustering and stage-wise hybrid sampling (DCSHS). The DCSHS has three parts. Firstly, we design a projection clustering combination framework (PCC) guided by Davies-Bouldin clustering effectiveness index (DBI), which is used to obtain high-quality clusters and combine them to obtain a set of cross-complete subsets (CCS) with balanced class and low overlapping. Secondly, according to the characteristics of subset classes, a stage-wise hybrid sampling algorithm is designed to realize the de-overlapping and balancing of subsets. Finally, a projective clustering transfer mapping mechanism (CTM) is constructed for all processed subsets by means of transfer learning, thereby reducing class overlapping and explore structure information of samples. The major advantage of our algorithm is that it can exploit the intersectionality of the CCS to realize the soft elimination of overlapping majority samples, and learn as much information of overlapping samples as possible, thereby enhancing the class overlapping while class balancing. In the experimental section, more than 30 public datasets and over ten representative algorithms are chosen for verification. The experimental results show that the DCSHS is significantly best in terms of various evaluation criteria.  ( 2 min )
    Safe Imitation Learning of Nonlinear Model Predictive Control for Flexible Robots. (arXiv:2212.02941v1 [cs.RO])
    Flexible robots may overcome the industry's major problems: safe human-robot collaboration and increased load-to-mass ratio. However, oscillations and high dimensional state space complicate the control of flexible robots. This work investigates nonlinear model predictive control (NMPC) of flexible robots -- for simultaneous planning and control -- modeled via the rigid finite element method. Although NMPC performs well in simulation, computational complexity prevents its deployment in practice. We show that imitation learning of NMPC with neural networks as function approximator can massively improve the computation time of the controller at the cost of slight performance loss and, more critically, loss of safety guarantees. We leverage a safety filter formulated as a simpler NMPC to recover safety guarantees. Experiments on a simulated three degrees of freedom flexible robot manipulator demonstrate that the average computational time of the proposed safe approximate NMPC controller is 3.6 ms while of the original NMPC is 11.8 ms. Fast and safe approximate NMPC might facilitate the industry's adoption of flexible robots and new solutions for similar problems, e.g., deformable object manipulation and soft robot control.  ( 2 min )
    Sharpness-Aware Minimization with Dynamic Reweighting. (arXiv:2112.08772v4 [cs.LG] UPDATED)
    Deep neural networks are often overparameterized and may not easily achieve model generalization. Adversarial training has shown effectiveness in improving generalization by regularizing the change of loss on top of adversarially chosen perturbations. The recently proposed sharpness-aware minimization (SAM) algorithm conducts adversarial weight perturbation, encouraging the model to converge to a flat minima. SAM finds a common adversarial weight perturbation per-batch. Although per-instance adversarial weight perturbations are stronger adversaries and can potentially lead to better generalization performance, their computational cost is very high and thus it is impossible to use per-instance perturbations efficiently in SAM. In this paper, we tackle this efficiency bottleneck and propose sharpness-aware minimization with dynamic reweighting (delta-SAM). Our theoretical analysis motivates that it is possible to approach the stronger, per-instance adversarial weight perturbations using reweighted per-batch weight perturbations. delta-SAM dynamically reweights perturbation within each batch according to the theoretically principled weighting factors, serving as a good approximation to per-instance perturbation. Experiments on various natural language understanding tasks demonstrate the effectiveness of delta-SAM.  ( 2 min )
    An Unsupervised Machine Learning Approach for Ground Motion Clustering and Selection. (arXiv:2212.03188v1 [physics.geo-ph])
    Clustering analysis of sequence data continues to address many applications in engineering design, aided with the rapid growth of machine learning in applied science. This paper presents an unsupervised machine learning algorithm to extract defining characteristics of earthquake ground-motion records, also called latent features, to aid in ground-motion clustering and selection. In this context, a latent feature is a low dimensional machine-discovered spectral characteristic learned through nonlinear relationships of a neural network autoencoder. Clustering can be performed on the latent features and used to select a representative archetypal subgroup from a large ground-motion suite. The objective of efficient ground-motion selection is to choose records representative of what the structure will probabilistically experience in its lifetime. Three examples are presented to validate this approach, including a synthetic spectral dataset and spectra from field recorded ground-motion records. Deep embedding clustering of ground motion spectra improves on the results of static feature extraction, utilizing characteristics that represent the sparse spectral content of ground motions.  ( 2 min )
    Integration of a systolic array based hardware accelerator into a DNN operator auto-tuning framework. (arXiv:2212.03034v1 [cs.LG])
    The deployment of neural networks on heterogeneous SoCs coupled with custom accelerators is a challenging task because of the lack of end-to-end software tools provided for these systems. Moreover, the already available low level schedules and mapping strategies provided by the accelerator developers for typical tensor operations are not necessarily the best possible ones for each particular use case. This is why frameworks which automatically test the performance of the generated code on a specific hardware configuration are of special interest. In this work, the integration between the code generation framework TVM and the systolic array-based accelerator Gemmini is presented. A generic schedule to offload the GEneral Matrix Multiply (GEMM) tensor operation onto Gemmini is detailed, and its suitability is tested by executing the AutoTVM tuning process on it. Our generated code achieves a peak throughput of 46 giga-operations per second (GOPs) under a 100 MHz clock on a Xilinx ZCU102 FPGA, outperforming previous work. Furthermore, the code generated by this integration was able to surpass the default hand-tuned schedules provided by the Gemmini developers in real-world workloads.  ( 2 min )
    Towards a Taxonomy for the Use of Synthetic Data in Advanced Analytics. (arXiv:2212.02622v1 [cs.LG])
    The proliferation of deep learning techniques led to a wide range of advanced analytics applications in important business areas such as predictive maintenance or product recommendation. However, as the effectiveness of advanced analytics naturally depends on the availability of sufficient data, an organization's ability to exploit the benefits might be restricted by limited data or likewise data access. These challenges could force organizations to spend substantial amounts of money on data, accept constrained analytics capacities, or even turn into a showstopper for analytics projects. Against this backdrop, recent advances in deep learning to generate synthetic data may help to overcome these barriers. Despite its great potential, however, synthetic data are rarely employed. Therefore, we present a taxonomy highlighting the various facets of deploying synthetic data for advanced analytics systems. Furthermore, we identify typical application scenarios for synthetic data to assess the current state of adoption and thereby unveil missed opportunities to pave the way for further research.  ( 2 min )
    Denoising diffusion probabilistic models for probabilistic energy forecasting. (arXiv:2212.02977v1 [cs.LG])
    Scenario-based probabilistic forecasts have become a vital tool to equip decision-makers to address the uncertain nature of renewable energies. This paper presents a recent promising deep learning generative approach: denoising diffusion probabilistic models. It is a class of latent variable models that have recently demonstrated impressive results in the computer vision community. However, to the best of our knowledge, there has yet to be a demonstration that they can generate high-quality samples of load, PV, or wind power time series that are crucial to face the new challenges in power systems applications. Thus, we propose the first implementation of this model for energy forecasting using the open data of the Global Energy Forecasting Competition 2014. The results demonstrate that this approach is competitive with other state-of-the-art deep learning generative models: generative adversarial networks, variational autoencoders, and normalizing flows.  ( 2 min )
    Universal Early Warning Signals of Phase Transitions in Climate Systems. (arXiv:2206.00060v2 [physics.ao-ph] UPDATED)
    The potential for complex systems to exhibit tipping points in which an equilibrium state undergoes a sudden and often irreversible shift is well established, but prediction of these events using standard forecast modeling techniques is quite difficult. This has led to the development of an alternative suite of methods that seek to identify signatures of critical phenomena in data, which are expected to occur in advance of many classes of dynamical bifurcation. Crucially, the manifestations of these critical phenomena are generic across a variety of systems, meaning that data-intensive deep learning methods can be trained on (abundant) synthetic data and plausibly prove effective when transferred to (more limited) empirical data sets. This paper provides a proof of concept for this approach as applied to lattice phase transitions: a deep neural network trained exclusively on 2D Ising model phase transitions is tested on a number of real and simulated climate systems with considerable success. Its accuracy frequently surpasses that of conventional statistical indicators, with performance shown to be consistently improved by the inclusion of spatial indicators. Tools such as this may offer valuable insight into climate tipping events, as remote sensing measurements provide increasingly abundant data on complex geospatially-resolved Earth systems.  ( 2 min )
    Accuracy-Privacy Trade-off in Deep Ensemble: A Membership Inference Perspective. (arXiv:2105.05381v4 [cs.LG] UPDATED)
    Deep ensemble learning has been shown to improve accuracy by training multiple neural networks and averaging their outputs. Ensemble learning has also been suggested to defend against membership inference attacks that undermine privacy. In this paper, we empirically demonstrate a trade-off between these two goals, namely accuracy and privacy (in terms of membership inference attacks), in deep ensembles. Using a wide range of datasets and model architectures, we show that the effectiveness of membership inference attacks increases when ensembling improves accuracy. We analyze the impact of various factors in deep ensembles and demonstrate the root cause of the trade-off. Then, we evaluate common defenses against membership inference attacks based on regularization and differential privacy. We show that while these defenses can mitigate the effectiveness of membership inference attacks, they simultaneously degrade ensemble accuracy. We illustrate similar trade-off in more advanced and state-of-the-art ensembling techniques, such as snapshot ensembles and diversified ensemble networks. Finally, we propose a simple yet effective defense for deep ensembles to break the trade-off and, consequently, improve the accuracy and privacy, simultaneously.  ( 2 min )
    This changes to that : Combining causal and non-causal explanations to generate disease progression in capsule endoscopy. (arXiv:2212.02506v1 [cs.LG])
    Due to the unequivocal need for understanding the decision processes of deep learning networks, both modal-dependent and model-agnostic techniques have become very popular. Although both of these ideas provide transparency for automated decision making, most methodologies focus on either using the modal-gradients (model-dependent) or ignoring the model internal states and reasoning with a model's behavior/outcome (model-agnostic) to instances. In this work, we propose a unified explanation approach that given an instance combines both model-dependent and agnostic explanations to produce an explanation set. The generated explanations are not only consistent in the neighborhood of a sample but can highlight causal relationships between image content and the outcome. We use Wireless Capsule Endoscopy (WCE) domain to illustrate the effectiveness of our explanations. The saliency maps generated by our approach are comparable or better on the softmax information score.  ( 2 min )
    Data Imputation with Iterative Graph Reconstruction. (arXiv:2212.02810v1 [cs.LG])
    Effective data imputation demands rich latent ``structure" discovery capabilities from ``plain" tabular data. Recent advances in graph neural networks-based data imputation solutions show their strong structure learning potential by directly translating tabular data as bipartite graphs. However, due to a lack of relations between samples, those solutions treat all samples equally which is against one important observation: ``similar sample should give more information about missing values." This paper presents a novel Iterative graph Generation and Reconstruction framework for Missing data imputation(IGRM). Instead of treating all samples equally, we introduce the concept: ``friend networks" to represent different relations among samples. To generate an accurate friend network with missing data, an end-to-end friend network reconstruction solution is designed to allow for continuous friend network optimization during imputation learning. The representation of the optimized friend network, in turn, is used to further optimize the data imputation process with differentiated message passing. Experiment results on eight benchmark datasets show that IGRM yields 39.13% lower mean absolute error compared with nine baselines and 9.04% lower than the second-best.
    Tensor-reduced atomic density representations. (arXiv:2210.01705v2 [physics.chem-ph] UPDATED)
    Density based representations of atomic environments that are invariant under Euclidean symmetries have become a widely used tool in the machine learning of interatomic potentials, broader data-driven atomistic modelling and the visualisation and analysis of materials datasets.The standard mechanism used to incorporate chemical element information is to create separate densities for each element and form tensor products between them. This leads to a steep scaling in the size of the representation as the number of elements increases. Graph neural networks, which do not explicitly use density representations, escape this scaling by mapping the chemical element information into a fixed dimensional space in a learnable way. We recast this approach as tensor factorisation by exploiting the tensor structure of standard neighbour density based descriptors. In doing so, we form compact tensor-reduced representations whose size does not depend on the number of chemical elements, but remain systematically convergeable and are therefore applicable to a wide range of data analysis and regression tasks.  ( 2 min )
    Ergo, SMIRK is Safe: A Safety Case for a Machine Learning Component in a Pedestrian Automatic Emergency Brake System. (arXiv:2204.07874v3 [cs.SE] UPDATED)
    Integration of Machine Learning (ML) components in critical applications introduces novel challenges for software certification and verification. New safety standards and technical guidelines are under development to support the safety of ML-based systems, e.g., ISO 21448 SOTIF for the automotive domain and the Assurance of Machine Learning for use in Autonomous Systems (AMLAS) framework. SOTIF and AMLAS provide high-level guidance but the details must be chiseled out for each specific case. We initiated a research project with the goal to demonstrate a complete safety case for an ML component in an open automotive system. This paper reports results from an industry-academia collaboration on safety assurance of SMIRK, an ML-based pedestrian automatic emergency braking demonstrator running in an industry-grade simulator. We demonstrate an application of AMLAS on SMIRK for a minimalistic operational design domain, i.e., we share a complete safety case for its integrated ML-based component. Finally, we report lessons learned and provide both SMIRK and the safety case under an open-source licence for the research community to reuse.  ( 2 min )
    A Learned Simulation Environment to Model Plant Growth in Indoor Farming. (arXiv:2212.03155v1 [eess.SY])
    We developed a simulator to quantify the effect of changes in environmental parameters on plant growth in precision farming. Our approach combines the processing of plant images with deep convolutional neural networks (CNN), growth curve modeling, and machine learning. As a result, our system is able to predict growth rates based on environmental variables, which opens the door for the development of versatile reinforcement learning agents.  ( 2 min )
    Efficient Malware Analysis Using Metric Embeddings. (arXiv:2212.02663v1 [cs.LG])
    In this paper, we explore the use of metric learning to embed Windows PE files in a low-dimensional vector space for downstream use in a variety of applications, including malware detection, family classification, and malware attribute tagging. Specifically, we enrich labeling on malicious and benign PE files using computationally expensive, disassembly-based malicious capabilities. Using these capabilities, we derive several different types of metric embeddings utilizing an embedding neural network trained via contrastive loss, Spearman rank correlation, and combinations thereof. We then examine performance on a variety of transfer tasks performed on the EMBER and SOREL datasets, demonstrating that for several tasks, low-dimensional, computationally efficient metric embeddings maintain performance with little decay, which offers the potential to quickly retrain for a variety of transfer tasks at significantly reduced storage overhead. We conclude with an examination of practical considerations for the use of our proposed embedding approach, such as robustness to adversarial evasion and introduction of task-specific auxiliary objectives to improve performance on mission critical tasks.
    INSPIRE: Distributed Bayesian Optimization for ImproviNg SPatIal REuse in Dense WLANs. (arXiv:2204.10184v2 [cs.NI] UPDATED)
    WLANs, which have overtaken wired networks to become the primary means of connecting devices to the Internet, are prone to performance issues due to the scarcity of space in the radio spectrum. As a response, IEEE 802.11ax and subsequent amendments aim at increasing the spatial reuse of a radio channel by allowing the dynamic update of two key parameters in wireless transmission: the transmission power (TX_POWER) and the sensitivity threshold (OBSS_PD). In this paper, we present INSPIRE, a distributed solution performing local Bayesian optimizations based on Gaussian processes to improve the spatial reuse in WLANs. INSPIRE makes no explicit assumptions about the topology of WLANs and favors altruistic behaviors of the access points, leading them to find adequate configurations of their TX_POWER and OBSS_PD parameters for the "greater good" of the WLANs. We demonstrate the superiority of INSPIRE over other state-of-the-art strategies using the ns-3 simulator and two examples inspired by real-life deployments of dense WLANs. Our results show that, in only a few seconds, INSPIRE is able to drastically increase the quality of service of operational WLANs by improving their fairness and throughput.  ( 2 min )
    Learning to Reason With Relational Abstractions. (arXiv:2210.02615v2 [cs.LG] UPDATED)
    Large language models have recently shown promising progress in mathematical reasoning when fine-tuned with human-generated sequences walking through a sequence of solution steps. However, the solution sequences are not formally structured and the resulting model-generated sequences may not reflect the kind of systematic reasoning we might expect an expert human to produce. In this paper, we study how to build stronger reasoning capability in language models using the idea of relational abstractions. We introduce new types of sequences that more explicitly provide an abstract characterization of the transitions through intermediate solution steps to the goal state. We find that models that are supplied with such sequences as prompts can solve tasks with a significantly higher accuracy, and models that are trained to produce such sequences solve problems better than those that are trained with previously used human-generated sequences and other baselines. Our work thus takes several steps toward elucidating and improving how language models perform on tasks requiring multi-step mathematical reasoning.  ( 2 min )
    QFT: Post-training quantization via fast joint finetuning of all degrees of freedom. (arXiv:2212.02634v1 [stat.ML])
    The post-training quantization (PTQ) challenge of bringing quantized neural net accuracy close to original has drawn much attention driven by industry demand. Many of the methods emphasize optimization of a specific degree-of-freedom (DoF), such as quantization step size, preconditioning factors, bias fixing, often chained to others in multi-step solutions. Here we rethink quantized network parameterization in HW-aware fashion, towards a unified analysis of all quantization DoF, permitting for the first time their joint end-to-end finetuning. Our single-step simple and extendable method, dubbed quantization-aware finetuning (QFT), achieves 4-bit weight quantization results on-par with SoTA within PTQ constraints of speed and resource.  ( 2 min )
    Concept Embedding Models: Beyond the Accuracy-Explainability Trade-Off. (arXiv:2209.09056v2 [cs.LG] UPDATED)
    Deploying AI-powered systems requires trustworthy models supporting effective human interactions, going beyond raw prediction accuracy. Concept bottleneck models promote trustworthiness by conditioning classification tasks on an intermediate level of human-like concepts. This enables human interventions which can correct mispredicted concepts to improve the model's performance. However, existing concept bottleneck models are unable to find optimal compromises between high task accuracy, robust concept-based explanations, and effective interventions on concepts -- particularly in real-world conditions where complete and accurate concept supervisions are scarce. To address this, we propose Concept Embedding Models, a novel family of concept bottleneck models which goes beyond the current accuracy-vs-interpretability trade-off by learning interpretable high-dimensional concept representations. Our experiments demonstrate that Concept Embedding Models (1) attain better or competitive task accuracy w.r.t. standard neural models without concepts, (2) provide concept representations capturing meaningful semantics including and beyond their ground truth labels, (3) support test-time concept interventions whose effect in test accuracy surpasses that in standard concept bottleneck models, and (4) scale to real-world conditions where complete concept supervisions are scarce.
    On the tightness of linear relaxation based robustness certification methods. (arXiv:2210.00178v2 [cs.LG] UPDATED)
    There has been a rapid development and interest in adversarial training and defenses in the machine learning community in the recent years. One line of research focuses on improving the performance and efficiency of adversarial robustness certificates for neural networks \cite{gowal:19, wong_zico:18, raghunathan:18, WengTowardsFC:18, wong:scalable:18, singh:convex_barrier:19, Huang_etal:19, single-neuron-relax:20, Zhang2020TowardsSA}. While each providing a certification to lower (or upper) bound the true distortion under adversarial attacks via relaxation, less studied was the tightness of relaxation. In this paper, we analyze a family of linear outer approximation based certificate methods via a meta algorithm, IBP-Lin. The aforementioned works often lack quantitative analysis to answer questions such as how does the performance of the certificate method depend on the network configuration and the choice of approximation parameters. Under our framework, we make a first attempt at answering these questions, which reveals that the tightness of linear approximation based certification can depend heavily on the configuration of the trained networks.
    Walk These Ways: Tuning Robot Control for Generalization with Multiplicity of Behavior. (arXiv:2212.03238v1 [cs.RO])
    Learned locomotion policies can rapidly adapt to diverse environments similar to those experienced during training but lack a mechanism for fast tuning when they fail in an out-of-distribution test environment. This necessitates a slow and iterative cycle of reward and environment redesign to achieve good performance on a new task. As an alternative, we propose learning a single policy that encodes a structured family of locomotion strategies that solve training tasks in different ways, resulting in Multiplicity of Behavior (MoB). Different strategies generalize differently and can be chosen in real-time for new tasks or environments, bypassing the need for time-consuming retraining. We release a fast, robust open-source MoB locomotion controller, Walk These Ways, that can execute diverse gaits with variable footswing, posture, and speed, unlocking diverse downstream tasks: crouching, hopping, high-speed running, stair traversal, bracing against shoves, rhythmic dance, and more. Video and code release: https://gmargo11.github.io/walk-these-ways/
    Transformers for End-to-End InfoSec Tasks: A Feasibility Study. (arXiv:2212.02666v1 [cs.LG])
    In this paper, we assess the viability of transformer models in end-to-end InfoSec settings, in which no intermediate feature representations or processing steps occur outside the model. We implement transformer models for two distinct InfoSec data formats - specifically URLs and PE files - in a novel end-to-end approach, and explore a variety of architectural designs, training regimes, and experimental settings to determine the ingredients necessary for performant detection models. We show that in contrast to conventional transformers trained on more standard NLP-related tasks, our URL transformer model requires a different training approach to reach high performance levels. Specifically, we show that 1) pre-training on a massive corpus of unlabeled URL data for an auto-regressive task does not readily transfer to binary classification of malicious or benign URLs, but 2) that using an auxiliary auto-regressive loss improves performance when training from scratch. We introduce a method for mixed objective optimization, which dynamically balances contributions from both loss terms so that neither one of them dominates. We show that this method yields quantitative evaluation metrics comparable to that of several top-performing benchmark classifiers. Unlike URLs, binary executables contain longer and more distributed sequences of information-rich bytes. To accommodate such lengthy byte sequences, we introduce additional context length into the transformer by providing its self-attention layers with an adaptive span similar to Sukhbaatar et al. We demonstrate that this approach performs comparably to well-established malware detection models on benchmark PE file datasets, but also point out the need for further exploration into model improvements in scalability and compute efficiency.
    Benchmarking AutoML algorithms on a collection of binary problems. (arXiv:2212.02704v1 [cs.LG])
    Automated machine learning (AutoML) algorithms have grown in popularity due to their high performance and flexibility to adapt to different problems and data sets. With the increasing number of AutoML algorithms, deciding which would best suit a given problem becomes increasingly more work. Therefore, it is essential to use complex and challenging benchmarks which would be able to differentiate the AutoML algorithms from each other. This paper compares the performance of four different AutoML algorithms: Tree-based Pipeline Optimization Tool (TPOT), Auto-Sklearn, Auto-Sklearn 2, and H2O AutoML. We use the Diverse and Generative ML benchmark (DIGEN), a diverse set of synthetic datasets derived from generative functions designed to highlight the strengths and weaknesses of the performance of common machine learning algorithms. We confirm that AutoML can identify pipelines that perform well on all included datasets. Most AutoML algorithms performed similarly without much room for improvement; however, some were more consistent than others at finding high-performing solutions for some datasets.
    NVIDIA FLARE: Federated Learning from Simulation to Real-World. (arXiv:2210.13291v2 [cs.LG] UPDATED)
    Federated learning (FL) enables the building of robust and generalizable AI models by leveraging diverse datasets from multiple collaborators without centralizing the data. We created NVIDIA FLARE as an open-source software development kit (SDK) to make it easier for data scientists to use FL in their research and real-world applications. The SDK includes solutions for state-of-the-art FL algorithms and federated machine learning approaches, which facilitate building workflows for distributed learning across enterprises and enable platform developers to create a secure, privacy-preserving offering for multiparty collaboration utilizing homomorphic encryption or differential privacy. The SDK is a lightweight, flexible, and scalable Python package, and allows researchers to bring their data science workflows implemented in any training libraries (PyTorch, TensorFlow, XGBoost, or even NumPy) and apply them in real-world FL settings. This paper introduces the key design principles of FLARE and illustrates some use cases (e.g., COVID analysis) with customizable FL workflows that implement different privacy-preserving algorithms. Code is available at https://github.com/NVIDIA/NVFlare.
    Explainability as statistical inference. (arXiv:2212.03131v1 [cs.LG])
    A wide variety of model explanation approaches have been proposed in recent years, all guided by very different rationales and heuristics. In this paper, we take a new route and cast interpretability as a statistical inference problem. We propose a general deep probabilistic model designed to produce interpretable predictions. The model parameters can be learned via maximum likelihood, and the method can be adapted to any predictor network architecture and any type of prediction problem. Our method is a case of amortized interpretability models, where a neural network is used as a selector to allow for fast interpretation at inference time. Several popular interpretability methods are shown to be particular cases of regularised maximum likelihood for our general model. We propose new datasets with ground truth selection which allow for the evaluation of the features importance map. Using these datasets, we show experimentally that using multiple imputation provides more reasonable interpretations.
    Diffusion Models Beat GANs on Topology Optimization. (arXiv:2208.09591v2 [cs.LG] UPDATED)
    Structural topology optimization, which aims to find the optimal physical structure that maximizes mechanical performance, is vital in engineering design applications in aerospace, mechanical, and civil engineering. Generative adversarial networks (GANs) have recently emerged as a popular alternative to traditional iterative topology optimization methods. However, these models are often difficult to train, have limited generalizability, and due to their goal of mimicking optimal structures, neglect manufacturability and performance objectives like mechanical compliance. We propose TopoDiff - a conditional diffusion-model-based architecture to perform performance-aware and manufacturability-aware topology optimization that overcomes these issues. Our model introduces a surrogate model-based guidance strategy that actively favors structures with low compliance and good manufacturability. Our method significantly outperforms a state-of-art conditional GAN by reducing the average error on physical performance by a factor of eight and by producing eleven times fewer infeasible samples. By introducing diffusion models to topology optimization, we show that conditional diffusion models have the ability to outperform GANs in engineering design synthesis applications too. Our work also suggests a general framework for engineering optimization problems using diffusion models and external performance with constraint-aware guidance. We publicly share the data, code, and trained models here: https://decode.mit.edu/projects/topodiff/.  ( 2 min )
    CARD: Classification and Regression Diffusion Models. (arXiv:2206.07275v4 [stat.ML] UPDATED)
    Learning the distribution of a continuous or categorical response variable $\boldsymbol y$ given its covariates $\boldsymbol x$ is a fundamental problem in statistics and machine learning. Deep neural network-based supervised learning algorithms have made great progress in predicting the mean of $\boldsymbol y$ given $\boldsymbol x$, but they are often criticized for their ability to accurately capture the uncertainty of their predictions. In this paper, we introduce classification and regression diffusion (CARD) models, which combine a denoising diffusion-based conditional generative model and a pre-trained conditional mean estimator, to accurately predict the distribution of $\boldsymbol y$ given $\boldsymbol x$. We demonstrate the outstanding ability of CARD in conditional distribution prediction with both toy examples and real-world datasets, the experimental results on which show that CARD in general outperforms state-of-the-art methods, including Bayesian neural network-based ones that are designed for uncertainty estimation, especially when the conditional distribution of $\boldsymbol y$ given $\boldsymbol x$ is multi-modal. In addition, we utilize the stochastic nature of the generative model outputs to obtain a finer granularity in model confidence assessment at the instance level for classification tasks.
    Which products activate a product? An explainable machine learning approach. (arXiv:2212.03094v1 [econ.GN])
    Tree-based machine learning algorithms provide the most precise assessment of the feasibility for a country to export a target product given its export basket. However, the high number of parameters involved prevents a straightforward interpretation of the results and, in turn, the explainability of policy indications. In this paper, we propose a procedure to statistically validate the importance of the products used in the feasibility assessment. In this way, we are able to identify which products, called explainers, significantly increase the probability to export a target product in the near future. The explainers naturally identify a low dimensional representation, the Feature Importance Product Space, that enhances the interpretability of the recommendations and provides out-of-sample forecasts of the export baskets of countries. Interestingly, we detect a positive correlation between the complexity of a product and the complexity of its explainers.
    Fast and Precise: Adjusting Planning Horizon with Adaptive Subgoal Search. (arXiv:2206.00702v4 [cs.AI] UPDATED)
    Complex reasoning problems contain states that vary in the computational cost required to determine a good action plan. Taking advantage of this property, we propose Adaptive Subgoal Search (AdaSubS), a search method that adaptively adjusts the planning horizon. To this end, AdaSubS generates diverse sets of subgoals at different distances. A verification mechanism is employed to filter out unreachable subgoals swiftly, allowing to focus on feasible further subgoals. In this way, AdaSubS benefits from the efficiency of planning with longer subgoals and the fine control with the shorter ones, and thus scales well to difficult planning problems. We show that AdaSubS significantly surpasses hierarchical planning algorithms on three complex reasoning tasks: Sokoban, the Rubik's Cube, and inequality proving benchmark INT.
    Domain Adaptation and Generalization on Functional Medical Images: A Systematic Survey. (arXiv:2212.03176v1 [eess.IV])
    Machine learning algorithms have revolutionized different fields, including natural language processing, computer vision, signal processing, and medical data processing. Despite the excellent capabilities of machine learning algorithms in various tasks and areas, the performance of these models mainly deteriorates when there is a shift in the test and training data distributions. This gap occurs due to the violation of the fundamental assumption that the training and test data are independent and identically distributed (i.i.d). In real-world scenarios where collecting data from all possible domains for training is costly and even impossible, the i.i.d assumption can hardly be satisfied. The problem is even more severe in the case of medical images and signals because it requires either expensive equipment or a meticulous experimentation setup to collect data, even for a single domain. Additionally, the decrease in performance may have severe consequences in the analysis of medical records. As a result of such problems, the ability to generalize and adapt under distribution shifts (domain generalization (DG) and domain adaptation (DA)) is essential for the analysis of medical data. This paper provides the first systematic review of DG and DA on functional brain signals to fill the gap of the absence of a comprehensive study in this era. We provide detailed explanations and categorizations of datasets, approaches, and architectures used in DG and DA on functional brain images. We further address the attention-worthy future tracks in this field.  ( 2 min )
    A Comprehensively Improved Hybrid Algorithm for Learning Bayesian Networks: Multiple Compound Memory Erasing. (arXiv:2212.03103v1 [cs.LG])
    Using a Bayesian network to analyze the causal relationship between nodes is a hot spot. The existing network learning algorithms are mainly constraint-based and score-based network generation methods. The constraint-based method is mainly the application of conditional independence (CI) tests, but the inaccuracy of CI tests in the case of high dimensionality and small samples has always been a problem for the constraint-based method. The score-based method uses the scoring function and search strategy to find the optimal candidate network structure, but the search space increases too much with the increase of the number of nodes, and the learning efficiency is very low. This paper presents a new hybrid algorithm, MCME (multiple compound memory erasing). This method retains the advantages of the first two methods, solves the shortcomings of the above CI tests, and makes innovations in the scoring function in the direction discrimination stage. A large number of experiments show that MCME has better or similar performance than some existing algorithms.  ( 2 min )
    A Learning Based Hypothesis Test for Harmful Covariate Shift. (arXiv:2212.02742v1 [cs.LG])
    The ability to quickly and accurately identify covariate shift at test time is a critical and often overlooked component of safe machine learning systems deployed in high-risk domains. While methods exist for detecting when predictions should not be made on out-of-distribution test examples, identifying distributional level differences between training and test time can help determine when a model should be removed from the deployment setting and retrained. In this work, we define harmful covariate shift (HCS) as a change in distribution that may weaken the generalization of a predictive model. To detect HCS, we use the discordance between an ensemble of classifiers trained to agree on training data and disagree on test data. We derive a loss function for training this ensemble and show that the disagreement rate and entropy represent powerful discriminative statistics for HCS. Empirically, we demonstrate the ability of our method to detect harmful covariate shift with statistical certainty on a variety of high-dimensional datasets. Across numerous domains and modalities, we show state-of-the-art performance compared to existing methods, particularly when the number of observed test samples is small.  ( 2 min )
    Yggdrasil Decision Forests: A Fast and Extensible Decision Forests Library. (arXiv:2212.02934v1 [cs.LG])
    Yggdrasil Decision Forests is a library for the training, serving and interpretation of decision forest models, targeted both at research and production work, implemented in C++, and available in C++, command line interface, Python (under the name TensorFlow Decision Forests), JavaScript, and Go. The library has been developed organically since 2018 following a set of four design principles applicable to machine learning libraries and frameworks: simplicity of use, safety of use, modularity and high-level abstraction, and integration with other machine learning libraries. In this paper, we describe those principles in detail and present how they have been used to guide the design of the library. We then showcase the use of our library on a set of classical machine learning problems. Finally, we report a benchmark comparing our library to related solutions.  ( 2 min )
    A Strongly Polynomial Algorithm for Approximate Forster Transforms and its Application to Halfspace Learning. (arXiv:2212.03008v1 [cs.DS])
    The Forster transform is a method of regularizing a dataset by placing it in {\em radial isotropic position} while maintaining some of its essential properties. Forster transforms have played a key role in a diverse range of settings spanning computer science and functional analysis. Prior work had given {\em weakly} polynomial time algorithms for computing Forster transforms, when they exist. Our main result is the first {\em strongly polynomial time} algorithm to compute an approximate Forster transform of a given dataset or certify that no such transformation exists. By leveraging our strongly polynomial Forster algorithm, we obtain the first strongly polynomial time algorithm for {\em distribution-free} PAC learning of halfspaces. This learning result is surprising because {\em proper} PAC learning of halfspaces is {\em equivalent} to linear programming. Our learning approach extends to give a strongly polynomial halfspace learner in the presence of random classification noise and, more generally, Massart noise.  ( 2 min )
    Straggler-Resilient Differentially-Private Decentralized Learning. (arXiv:2212.03080v1 [cs.LG])
    We consider the straggler problem in decentralized learning over a logical ring while preserving user data privacy. Especially, we extend the recently proposed framework of differential privacy (DP) amplification by decentralization by Cyffers and Bellet to include overall training latency--comprising both computation and communication latency. Analytical results on both the convergence speed and the DP level are derived for both a skipping scheme (which ignores the stragglers after a timeout) and a baseline scheme that waits for each node to finish before the training continues. A trade-off between overall training latency, accuracy, and privacy, parameterized by the timeout of the skipping scheme, is identified and empirically validated for logistic regression on a real-world dataset.  ( 2 min )
    The AI Definition and a Program Which Satisfies this Definition. (arXiv:2212.03184v1 [cs.AI])
    We will consider all policies of the agent and will prove that one of them is the best performing policy. While that policy is not computable, computable policies do exist in its proximity. We will define AI as a computable policy which is sufficiently proximal to the best performing policy. Before we can define the agent's best performing policy, we need a language for description of the world. We will also use this language to develop a program which satisfies the AI definition. The program will first understand the world by describing it in the selected language. The program will then use the description in order to predict the future and select the best possible move. While this program is extremely inefficient and practically unusable, it can be improved by refining both the language for description of the world and the algorithm used to predict the future. This can yield a program which is both efficient and consistent with the AI definition.  ( 2 min )
    A comparative study of emotion recognition methods using facial expressions. (arXiv:2212.03102v1 [cs.CV])
    Understanding the facial expressions of our interlocutor is important to enrich the communication and to give it a depth that goes beyond the explicitly expressed. In fact, studying one's facial expression gives insight into their hidden emotion state. However, even as humans, and despite our empathy and familiarity with the human emotional experience, we are only able to guess what the other might be feeling. In the fields of artificial intelligence and computer vision, Facial Emotion Recognition (FER) is a topic that is still in full growth mostly with the advancement of deep learning approaches and the improvement of data collection. The main purpose of this paper is to compare the performance of three state-of-the-art networks, each having their own approach to improve on FER tasks, on three FER datasets. The first and second sections respectively describe the three datasets and the three studied network architectures designed for an FER task. The experimental protocol, the results and their interpretation are outlined in the remaining sections.  ( 2 min )
    Tackling Data Heterogeneity in Federated Learning with Class Prototypes. (arXiv:2212.02758v1 [cs.LG])
    Data heterogeneity across clients in federated learning (FL) settings is a widely acknowledged challenge. In response, personalized federated learning (PFL) emerged as a framework to curate local models for clients' tasks. In PFL, a common strategy is to develop local and global models jointly - the global model (for generalization) informs the local models, and the local models (for personalization) are aggregated to update the global model. A key observation is that if we can improve the generalization ability of local models, then we can improve the generalization of global models, which in turn builds better personalized models. In this work, we consider class imbalance, an overlooked type of data heterogeneity, in the classification setting. We propose FedNH, a novel method that improves the local models' performance for both personalization and generalization by combining the uniformity and semantics of class prototypes. FedNH initially distributes class prototypes uniformly in the latent space and smoothly infuses the class semantics into class prototypes. We show that imposing uniformity helps to combat prototype collapse while infusing class semantics improves local models. Extensive experiments were conducted on popular classification datasets under the cross-device setting. Our results demonstrate the effectiveness and stability of our method over recent works.  ( 2 min )
    Correlation detection in trees for planted graph alignment. (arXiv:2107.07623v4 [cs.DS] UPDATED)
    Motivated by alignment of correlated sparse random graphs, we introduce a hypothesis testing problem of deciding whether or not two random trees are correlated. We obtain sufficient conditions under which this testing is impossible or feasible. We propose MPAlign, a message-passing algorithm for graph alignment inspired by the tree correlation detection problem. We prove MPAlign to succeed in polynomial time at partial alignment whenever tree detection is feasible. As a result our analysis of tree detection reveals new ranges of parameters for which partial alignment of sparse random graphs is feasible in polynomial time. We then conjecture that graph alignment is not feasible in polynomial time when the associated tree detection problem is impossible. If true, this conjecture together with our sufficient conditions on tree detection impossibility would imply the existence of a hard phase for graph alignment, i.e. a parameter range where alignment cannot be done in polynomial time even though it is known to be feasible in non-polynomial time.  ( 2 min )
    SODA: A Natural Language Processing Package to Extract Social Determinants of Health for Cancer Studies. (arXiv:2212.03000v1 [cs.CL])
    Objective: We aim to develop an open-source natural language processing (NLP) package, SODA (i.e., SOcial DeterminAnts), with pre-trained transformer models to extract social determinants of health (SDoH) for cancer patients, examine the generalizability of SODA to a new disease domain (i.e., opioid use), and evaluate the extraction rate of SDoH using cancer populations. Methods: We identified SDoH categories and attributes and developed an SDoH corpus using clinical notes from a general cancer cohort. We compared four transformer-based NLP models to extract SDoH, examined the generalizability of NLP models to a cohort of patients prescribed with opioids, and explored customization strategies to improve performance. We applied the best NLP model to extract 19 categories of SDoH from the breast (n=7,971), lung (n=11,804), and colorectal cancer (n=6,240) cohorts. Results and Conclusion: We developed a corpus of 629 cancer patients notes with annotations of 13,193 SDoH concepts/attributes from 19 categories of SDoH. The Bidirectional Encoder Representations from Transformers (BERT) model achieved the best strict/lenient F1 scores of 0.9216 and 0.9441 for SDoH concept extraction, 0.9617 and 0.9626 for linking attributes to SDoH concepts. Fine-tuning the NLP models using new annotations from opioid use patients improved the strict/lenient F1 scores from 0.8172/0.8502 to 0.8312/0.8679. The extraction rates among 19 categories of SDoH varied greatly, where 10 SDoH could be extracted from >70% of cancer patients, but 9 SDoH had a low extraction rate (<70% of cancer patients). The SODA package with pre-trained transformer models is publicly available at https://github.com/uf-hobiinformatics-lab/SDoH_SODA.  ( 2 min )
    Evaluation of particle motions in stabilized specimens of transparent sand using deep learning segmentation. (arXiv:2212.02939v1 [physics.geo-ph])
    Individual particle rotation and displacement were measured in triaxial tests on transparent sand stabilized with geogrid simulants. The Cellpose U-Net model, originally developed to segment biological cells, was trained to segment images of fused quartz particles. The Score-CAM metric from the field of Explainable AI was used to validate the application of Cellpose to segment particles of fused quartz. These segmented particles were characterized in terms of Fourier shape descriptors and tracked across images. The measured particle displacements in the monotonic triaxial tests correlated with displacement fields from Digital Image Correlation (DIC). In contrast to DIC, the new technique also allows for the measurement of individual particle rotation. The particle rotation measurements were found to be repeatable across different specimens. A state boundary line between probable and improbable particle motions could be identified for a given test based on the measured particle displacements and rotations. The size of the zone of probable motions was used to quantify the effectiveness of the stabilizing inclusions. The results of repeated load tests revealed that the honeycomb inclusions used stabilized the specimens by reducing both particle displacements and rotations.  ( 2 min )
    Interdisciplinary Discovery of Nanomaterials Based on Convolutional Neural Networks. (arXiv:2212.02805v1 [cond-mat.mtrl-sci])
    The material science literature contains up-to-date and comprehensive scientific knowledge of materials. However, their content is unstructured and diverse, resulting in a significant gap in providing sufficient information for material design and synthesis. To this end, we used natural language processing (NLP) and computer vision (CV) techniques based on convolutional neural networks (CNN) to discover valuable experimental-based information about nanomaterials and synthesis methods in energy-material-related publications. Our first system, TextMaster, extracts opinions from texts and classifies them into challenges and opportunities, achieving 94% and 92% accuracy, respectively. Our second system, GraphMaster, realizes data extraction of tables and figures from publications with 98.3\% classification accuracy and 4.3% data extraction mean square error. Our results show that these systems could assess the suitability of materials for a certain application by evaluation of synthesis insights and case analysis with detailed references. This work offers a fresh perspective on mining knowledge from scientific literature, providing a wide swatch to accelerate nanomaterial research through CNN.  ( 2 min )
    COmic: Convolutional Kernel Networks for Interpretable End-to-End Learning on (Multi-)Omics Data. (arXiv:2212.02504v1 [q-bio.QM])
    Motivation: The size of available omics datasets is steadily increasing with technological advancement in recent years. While this increase in sample size can be used to improve the performance of relevant prediction tasks in healthcare, models that are optimized for large datasets usually operate as black boxes. In high stakes scenarios, like healthcare, using a black-box model poses safety and security issues. Without an explanation about molecular factors and phenotypes that affected the prediction, healthcare providers are left with no choice but to blindly trust the models. We propose a new type of artificial neural networks, named Convolutional Omics Kernel Networks (COmic). By combining convolutional kernel networks with pathway-induced kernels, our method enables robust and interpretable end-to-end learning on omics datasets ranging in size from a few hundred to several hundreds of thousands of samples. Furthermore, COmic can be easily adapted to utilize multi-omics data. Results: We evaluate the performance capabilities of COmic on six different breast cancer cohorts. Additionally, we train COmic models on multi-omics data using the METABRIC cohort. Our models perform either better or similar to competitors on both tasks. We show how the use of pathway-induced Laplacian kernels opens the black-box nature of neural networks and results in intrinsically interpretable models that eliminate the need for \textit{post-hoc} explanation models.  ( 2 min )
    A Trustworthy Framework for Medical Image Analysis with Deep Learning. (arXiv:2212.02764v1 [eess.IV])
    Computer vision and machine learning are playing an increasingly important role in computer-assisted diagnosis; however, the application of deep learning to medical imaging has challenges in data availability and data imbalance, and it is especially important that models for medical imaging are built to be trustworthy. Therefore, we propose TRUDLMIA, a trustworthy deep learning framework for medical image analysis, which adopts a modular design, leverages self-supervised pre-training, and utilizes a novel surrogate loss function. Experimental evaluations indicate that models generated from the framework are both trustworthy and high-performing. It is anticipated that the framework will support researchers and clinicians in advancing the use of deep learning for dealing with public health crises including COVID-19.  ( 2 min )
    Trajectory Flow Map: Graph-based Approach to Analysing Temporal Evolution of Aggregated Traffic Flows in Large-scale Urban Networks. (arXiv:2212.02927v1 [cs.LG])
    This paper proposes a graph-based approach to representing spatio-temporal trajectory data that allows an effective visualization and characterization of city-wide traffic dynamics. With the advance of sensor, mobile, and Internet of Things (IoT) technologies, vehicle and passenger trajectories are being increasingly collected on a massive scale and are becoming a critical source of insight into traffic pattern and traveller behaviour. To leverage such trajectory data to better understand traffic dynamics in a large-scale urban network, this study develops a trajectory-based network traffic analysis method that converts individual trajectory data into a sequence of graphs that evolve over time (known as dynamic graphs or time-evolving graphs) and analyses network-wide traffic patterns in terms of a compact and informative graph-representation of aggregated traffic flows. First, we partition the entire network into a set of cells based on the spatial distribution of data points in individual trajectories, where the cells represent spatial regions between which aggregated traffic flows can be measured. Next, dynamic flows of moving objects are represented as a time-evolving graph, where regions are graph vertices and flows between them are treated as weighted directed edges. Given a fixed set of vertices, edges can be inserted or removed at every time step depending on the presence of traffic flows between two regions at a given time window. Once a dynamic graph is built, we apply graph mining algorithms to detect change-points in time, which represent time points where the graph exhibits significant changes in its overall structure and, thus, correspond to change-points in city-wide mobility pattern throughout the day (e.g., global transition points between peak and off-peak periods).  ( 3 min )
    Safe Inverse Reinforcement Learning via Control Barrier Function. (arXiv:2212.02753v1 [cs.RO])
    Learning from Demonstration (LfD) is a powerful method for enabling robots to perform novel tasks as it is often more tractable for a non-roboticist end-user to demonstrate the desired skill and for the robot to efficiently learn from the associated data than for a human to engineer a reward function for the robot to learn the skill via reinforcement learning (RL). Safety issues arise in modern LfD techniques, e.g., Inverse Reinforcement Learning (IRL), just as they do for RL; yet, safe learning in LfD has received little attention. In the context of agile robots, safety is especially vital due to the possibility of robot-environment collision, robot-human collision, and damage to the robot. In this paper, we propose a safe IRL framework, CBFIRL, that leverages the Control Barrier Function (CBF) to enhance the safety of the IRL policy. The core idea of CBFIRL is to combine a loss function inspired by CBF requirements with the objective in an IRL method, both of which are jointly optimized via gradient descent. In the experiments, we show our framework performs safer compared to IRL methods without CBF, that is $\sim15\%$ and $\sim20\%$ improvement for two levels of difficulty of a 2D racecar domain and $\sim 50\%$ improvement for a 3D drone domain.  ( 2 min )
    PRISM: Probabilistic Real-Time Inference in Spatial World Models. (arXiv:2212.02988v1 [cs.LG])
    We introduce PRISM, a method for real-time filtering in a probabilistic generative model of agent motion and visual perception. Previous approaches either lack uncertainty estimates for the map and agent state, do not run in real-time, do not have a dense scene representation or do not model agent dynamics. Our solution reconciles all of these aspects. We start from a predefined state-space model which combines differentiable rendering and 6-DoF dynamics. Probabilistic inference in this model amounts to simultaneous localisation and mapping (SLAM) and is intractable. We use a series of approximations to Bayesian inference to arrive at probabilistic map and state estimates. We take advantage of well-established methods and closed-form updates, preserving accuracy and enabling real-time capability. The proposed solution runs at 10Hz real-time and is similarly accurate to state-of-the-art SLAM in small to medium-sized indoor environments, with high-speed UAV and handheld camera agents (Blackbird, EuRoC and TUM-RGBD).  ( 2 min )
    Improved Beam Search for Hallucination Mitigation in Abstractive Summarization. (arXiv:2212.02712v1 [cs.CL])
    Advancement in large pretrained language models has significantly improved their performance for conditional language generation tasks including summarization albeit with hallucinations. To reduce hallucinations, conventional methods proposed improving beam search or using a fact checker as a postprocessing step. In this paper, we investigate the use of the Natural Language Inference (NLI) entailment metric to detect and prevent hallucinations in summary generation. We propose an NLI-assisted beam re-ranking mechanism by computing entailment probability scores between the input context and summarization model-generated beams during saliency-enhanced greedy decoding. Moreover, a diversity metric is introduced to compare its effectiveness against vanilla beam search. Our proposed algorithm significantly outperforms vanilla beam decoding on XSum and CNN/DM datasets.  ( 2 min )
    BALPA: A Balanced Primal-Dual Algorithm for Nonsmooth Optimization with Application to Distributed Optimization. (arXiv:2212.02835v1 [math.OC])
    In this paper, we propose a novel primal-dual proximal splitting algorithm (PD-PSA), named BALPA, for the composite optimization problem with equality constraints, where the loss function consists of a smooth term and a nonsmooth term composed with a linear mapping. In BALPA, the dual update is designed as a proximal point for a time-varying quadratic function, which balances the implementation of primal and dual update and retains the proximity-induced feature of classic PD-PSAs. In addition, by this balance, BALPA eliminates the inefficiency of classic PD-PSAs for composite optimization problems in which the Euclidean norm of the linear mapping or the equality constraint mapping is large. Therefore, BALPA not only inherits the advantages of simple structure and easy implementation of classic PD-PSAs but also ensures a fast convergence when these norms are large. Moreover, we propose a stochastic version of BALPA (S-BALPA) and apply the developed BALPA to distributed optimization to devise a new distributed optimization algorithm. Furthermore, a comprehensive convergence analysis for BALPA and S-BALPA is conducted, respectively. Finally, numerical experiments demonstrate the efficiency of the proposed algorithms.  ( 2 min )
    Efficient Learning of Voltage Control Strategies via Model-based Deep Reinforcement Learning. (arXiv:2212.02715v1 [eess.SY])
    This article proposes a model-based deep reinforcement learning (DRL) method to design emergency control strategies for short-term voltage stability problems in power systems. Recent advances show promising results in model-free DRL-based methods for power systems, but model-free methods suffer from poor sample efficiency and training time, both critical for making state-of-the-art DRL algorithms practically applicable. DRL-agent learns an optimal policy via a trial-and-error method while interacting with the real-world environment. And it is desirable to minimize the direct interaction of the DRL agent with the real-world power grid due to its safety-critical nature. Additionally, state-of-the-art DRL-based policies are mostly trained using a physics-based grid simulator where dynamic simulation is computationally intensive, lowering the training efficiency. We propose a novel model-based-DRL framework where a deep neural network (DNN)-based dynamic surrogate model, instead of a real-world power-grid or physics-based simulation, is utilized with the policy learning framework, making the process faster and sample efficient. However, stabilizing model-based DRL is challenging because of the complex system dynamics of large-scale power systems. We solved these issues by incorporating imitation learning to have a warm start in policy learning, reward-shaping, and multi-step surrogate loss. Finally, we achieved 97.5% sample efficiency and 87.7% training efficiency for an application to the IEEE 300-bus test system.  ( 2 min )
    Leveraging Different Learning Styles for Improved Knowledge Distillation. (arXiv:2212.02931v1 [cs.CV])
    Learning style refers to a type of training mechanism adopted by an individual to gain new knowledge. As suggested by the VARK model, humans have different learning preferences like visual, auditory, etc., for acquiring and effectively processing information. Inspired by this concept, our work explores the idea of mixed information sharing with model compression in the context of Knowledge Distillation (KD) and Mutual Learning (ML). Unlike conventional techniques that share the same type of knowledge with all networks, we propose to train individual networks with different forms of information to enhance the learning process. We formulate a combined KD and ML framework with one teacher and two student networks that share or exchange information in the form of predictions and feature maps. Our comprehensive experiments with benchmark classification and segmentation datasets demonstrate that with 15% compression, the ensemble performance of networks trained with diverse forms of knowledge outperforms the conventional techniques both quantitatively and qualitatively.  ( 2 min )
    Towards a more efficient computation of individual attribute and policy contribution for post-hoc explanation of cooperative multi-agent systems using Myerson values. (arXiv:2212.03041v1 [cs.AI])
    A quantitative assessment of the global importance of an agent in a team is as valuable as gold for strategists, decision-makers, and sports coaches. Yet, retrieving this information is not trivial since in a cooperative task it is hard to isolate the performance of an individual from the one of the whole team. Moreover, it is not always clear the relationship between the role of an agent and his personal attributes. In this work we conceive an application of the Shapley analysis for studying the contribution of both agent policies and attributes, putting them on equal footing. Since the computational complexity is NP-hard and scales exponentially with the number of participants in a transferable utility coalitional game, we resort to exploiting a-priori knowledge about the rules of the game to constrain the relations between the participants over a graph. We hence propose a method to determine a Hierarchical Knowledge Graph of agents' policies and features in a Multi-Agent System. Assuming a simulator of the system is available, the graph structure allows to exploit dynamic programming to assess the importances in a much faster way. We test the proposed approach in a proof-of-case environment deploying both hardcoded policies and policies obtained via Deep Reinforcement Learning. The proposed paradigm is less computationally demanding than trivially computing the Shapley values and provides great insight not only into the importance of an agent in a team but also into the attributes needed to deploy the policy at its best.  ( 2 min )
    Multi-Layer Personalized Federated Learning for Mitigating Biases in Student Predictive Analytics. (arXiv:2212.02985v1 [cs.LG])
    Traditional learning-based approaches to student modeling (e.g., predicting grades based on measured activities) generalize poorly to underrepresented/minority student groups due to biases in data availability. In this paper, we propose a Multi-Layer Personalized Federated Learning (MLPFL) methodology which optimizes inference accuracy over different layers of student grouping criteria, such as by course and by demographic subgroups within each course. In our approach, personalized models for individual student subgroups are derived from a global model, which is trained in a distributed fashion via meta-gradient updates that account for subgroup heterogeneity while preserving modeling commonalities that exist across the full dataset. To evaluate our methodology, we consider case studies of two popular downstream student modeling tasks, knowledge tracing and outcome prediction, which leverage multiple modalities of student behavior (e.g., visits to lecture videos and participation on forums) in model training. Experiments on three real-world datasets from online courses demonstrate that our approach obtains substantial improvements over existing student modeling baselines in terms of increasing the average and decreasing the variance of prediction quality across different student subgroups. Visual analysis of the resulting students' knowledge state embeddings confirm that our personalization methodology extracts activity patterns which cluster into different student subgroups, consistent with the performance enhancements we obtain over the baselines.  ( 2 min )
    A Time Series Approach to Explainability for Neural Nets with Applications to Risk-Management and Fraud Detection. (arXiv:2212.02906v1 [q-fin.RM])
    Artificial intelligence is creating one of the biggest revolution across technology driven application fields. For the finance sector, it offers many opportunities for significant market innovation and yet broad adoption of AI systems heavily relies on our trust in their outputs. Trust in technology is enabled by understanding the rationale behind the predictions made. To this end, the concept of eXplainable AI emerged introducing a suite of techniques attempting to explain to users how complex models arrived at a certain decision. For cross-sectional data classical XAI approaches can lead to valuable insights about the models' inner workings, but these techniques generally cannot cope well with longitudinal data (time series) in the presence of dependence structure and non-stationarity. We here propose a novel XAI technique for deep learning methods which preserves and exploits the natural time ordering of the data.  ( 2 min )
    Diffusion Video Autoencoders: Toward Temporally Consistent Face Video Editing via Disentangled Video Encoding. (arXiv:2212.02802v1 [cs.CV])
    Inspired by the impressive performance of recent face image editing methods, several studies have been naturally proposed to extend these methods to the face video editing task. One of the main challenges here is temporal consistency among edited frames, which is still unresolved. To this end, we propose a novel face video editing framework based on diffusion autoencoders that can successfully extract the decomposed features - for the first time as a face video editing model - of identity and motion from a given video. This modeling allows us to edit the video by simply manipulating the temporally invariant feature to the desired direction for the consistency. Another unique strength of our model is that, since our model is based on diffusion models, it can satisfy both reconstruction and edit capabilities at the same time, and is robust to corner cases in wild face videos (e.g. occluded faces) unlike the existing GAN-based methods.  ( 2 min )
    AIDA: Analytic Isolation and Distance-based Anomaly Detection Algorithm. (arXiv:2212.02645v1 [cs.LG])
    We combine the metrics of distance and isolation to develop the \textit{Analytic Isolation and Distance-based Anomaly (AIDA) detection algorithm}. AIDA is the first distance-based method that does not rely on the concept of nearest-neighbours, making it a parameter-free model. Differently from the prevailing literature, in which the isolation metric is always computed via simulations, we show that AIDA admits an analytical expression for the outlier score, providing new insights into the isolation metric. Additionally, we present an anomaly explanation method based on AIDA, the \textit{Tempered Isolation-based eXplanation (TIX)} algorithm, which finds the most relevant outlier features even in data sets with hundreds of dimensions. We test both algorithms on synthetic and empirical data: we show that AIDA is competitive when compared to other state-of-the-art methods, and it is superior in finding outliers hidden in multidimensional feature subspaces. Finally, we illustrate how the TIX algorithm is able to find outliers in multidimensional feature subspaces, and use these explanations to analyze common benchmarks used in anomaly detection.  ( 2 min )
    Is Conditional Generative Modeling all you need for Decision-Making?. (arXiv:2211.15657v2 [cs.LG] UPDATED)
    Recent improvements in conditional generative modeling have made it possible to generate high-quality images from language descriptions alone. We investigate whether these methods can directly address the problem of sequential decision-making. We view decision-making not through the lens of reinforcement learning (RL), but rather through conditional generative modeling. To our surprise, we find that our formulation leads to policies that can outperform existing offline RL approaches across standard benchmarks. By modeling a policy as a return-conditional diffusion model, we illustrate how we may circumvent the need for dynamic programming and subsequently eliminate many of the complexities that come with traditional offline RL. We further demonstrate the advantages of modeling policies as conditional diffusion models by considering two other conditioning variables: constraints and skills. Conditioning on a single constraint or skill during training leads to behaviors at test-time that can satisfy several constraints together or demonstrate a composition of skills. Our results illustrate that conditional generative modeling is a powerful tool for decision-making.  ( 2 min )
    Pretrained Diffusion Models for Unified Human Motion Synthesis. (arXiv:2212.02837v1 [cs.CV])
    Generative modeling of human motion has broad applications in computer animation, virtual reality, and robotics. Conventional approaches develop separate models for different motion synthesis tasks, and typically use a model of a small size to avoid overfitting the scarce data available in each setting. It remains an open question whether developing a single unified model is feasible, which may 1) benefit the acquirement of novel skills by combining skills learned from multiple tasks, and 2) help in increasing the model capacity without overfitting by combining multiple data sources. Unification is challenging because 1) it involves diverse control signals as well as targets of varying granularity, and 2) motion datasets may use different skeletons and default poses. In this paper, we present MoFusion, a framework for unified motion synthesis. MoFusion employs a Transformer backbone to ease the inclusion of diverse control signals via cross attention, and pretrains the backbone as a diffusion model to support multi-granularity synthesis ranging from motion completion of a body part to whole-body motion generation. It uses a learnable adapter to accommodate the differences between the default skeletons used by the pretraining and the fine-tuning data. Empirical results show that pretraining is vital for scaling the model size without overfitting, and demonstrate MoFusion's potential in various tasks, e.g., text-to-motion, motion completion, and zero-shot mixing of multiple control signals. Project page: \url{https://ofa-sys.github.io/MoFusion/}.  ( 2 min )
    Style transfer and classification in hebrew news items. (arXiv:2212.03019v1 [cs.CL])
    Hebrew is a Morphological rich language, making its modeling harder than simpler language. Recent developments such as Transformers in general and Bert in particular opened a path for Hebrew models that reach SOTA results, not falling short from other non-MRL languages. We explore the cutting edge in this field performing style transfer, text generation and classification over news articles collected from online archives. Furthermore, the news portals that feed our collective consciousness are an interesting corpus to study, as their analysis and tracing might reveal insights about our society and discourse.  ( 2 min )
    Variational Neural Networks. (arXiv:2207.01524v2 [cs.LG] UPDATED)
    Bayesian Neural Networks (BNNs) provide a tool to estimate the uncertainty of a neural network by considering a distribution over weights and sampling different models for each input. In this paper, we propose a method for uncertainty estimation in neural networks called Variational Neural Network that, instead of considering a distribution over weights, generates parameters for the output distribution of a layer by transforming its inputs with learnable sub-layers. In uncertainty quality estimation experiments, we show that VNNs achieve better uncertainty quality than Monte Carlo Dropout or Bayes By Backpropagation methods.  ( 2 min )
    Deep Double Descent via Smooth Interpolation. (arXiv:2209.10080v2 [cs.LG] UPDATED)
    Overparameterized deep networks can interpolate noisy data while at the same time showing good generalization performance. Common intuition from polynomial regression suggests that large networks are able to sharply interpolate noisy data without considerably deviating from the ground-truth signal. At present, a precise characterization of this phenomenon for deep networks is missing. In this work, we present an empirical study of input-space smoothness of the loss landscape of deep networks over volumes around cleanly- and noisily-labeled training samples, as we systematically increase the number of model parameters and training epochs. Our findings show that loss sharpness in the input space follows both model- and epoch-wise double descent, with worse peaks observed around noisy labels. While small interpolating models sharply fit both clean and noisy data, large interpolating models express a smooth loss landscape, where noisy targets are predicted over large volumes around training data points, in contrast to existing intuition.  ( 2 min )
    Misspecification in Inverse Reinforcement Learning. (arXiv:2212.03201v1 [cs.LG])
    The aim of Inverse Reinforcement Learning (IRL) is to infer a reward function $R$ from a policy $\pi$. To do this, we need a model of how $\pi$ relates to $R$. In the current literature, the most common models are optimality, Boltzmann rationality, and causal entropy maximisation. One of the primary motivations behind IRL is to infer human preferences from human behaviour. However, the true relationship between human preferences and human behaviour is much more complex than any of the models currently used in IRL. This means that they are misspecified, which raises the worry that they might lead to unsound inferences if applied to real-world data. In this paper, we provide a mathematical analysis of how robust different IRL models are to misspecification, and answer precisely how the demonstrator policy may differ from each of the standard models before that model leads to faulty inferences about the reward function $R$. We also introduce a framework for reasoning about misspecification in IRL, together with formal tools that can be used to easily derive the misspecification robustness of new IRL models.  ( 2 min )
    Reinforcement Learning for Signal Temporal Logic using Funnel-Based Approach. (arXiv:2212.03181v1 [eess.SY])
    Signal Temporal Logic (STL) is a powerful framework for describing the complex temporal and logical behaviour of the dynamical system. Several works propose a method to find a controller for the satisfaction of STL specification using reinforcement learning but fail to address either the issue of robust satisfaction in continuous state space or ensure the tractability of the approach. In this paper, leveraging the concept of funnel functions, we propose a tractable reinforcement learning algorithm to learn a time-dependent policy for robust satisfaction of STL specification in continuous state space. We demonstrate the utility of our approach on several tasks using a pendulum and mobile robot examples.  ( 2 min )
    Near-Linear Time and Fixed-Parameter Tractable Algorithms for Tensor Decompositions. (arXiv:2207.07417v2 [cs.DS] UPDATED)
    We study low rank approximation of tensors, focusing on the tensor train and Tucker decompositions, as well as approximations with tree tensor networks and more general tensor networks. For tensor train decomposition, we give a bicriteria $(1 + \eps)$-approximation algorithm with a small bicriteria rank and $O(q \cdot \nnz(A))$ running time, up to lower order terms, which improves over the additive error algorithm of \cite{huber2017randomized}. We also show how to convert the algorithm of \cite{huber2017randomized} into a relative error algorithm, but their algorithm necessarily has a running time of $O(qr^2 \cdot \nnz(A)) + n \cdot \poly(qk/\eps)$ when converted to a $(1 + \eps)$-approximation algorithm with bicriteria rank $r$. To the best of our knowledge, our work is the first to achieve polynomial time relative error approximation for tensor train decomposition. Our key technique is a method for obtaining subspace embeddings with a number of rows polynomial in $q$ for a matrix which is the flattening of a tensor train of $q$ tensors. We extend our algorithm to tree tensor networks. In addition, we extend our algorithm to tensor networks with arbitrary graphs (which we refer to as general tensor networks), by using a result of \cite{ms08_simulating_quantum_tensor_contraction} and showing that a general tensor network of rank $k$ can be contracted to a binary tree network of rank $k^{O(\deg(G)\tw(G))}$, allowing us to reduce to the case of tree tensor networks. Finally, we give new fixed-parameter tractable algorithms for the tensor train, Tucker, and CP decompositions, which are simpler than those of \cite{swz19_tensor_low_rank} since they do not make use of polynomial system solvers. Our technique of Gaussian subspace embeddings with exactly $k$ rows (and thus exponentially small success probability) may be of independent interest.  ( 2 min )
    VISEM-Tracking: Human Spermatozoa Tracking Dataset. (arXiv:2212.02842v1 [cs.CV])
    Manually analyzing spermatozoa is a tremendous task for biologists due to the many fast-moving spermatozoa, causing inconsistencies in the quality of the assessments. Therefore, computer-assisted sperm analysis (CASA) has become a popular solution. Despite this, more data is needed to train supervised machine learning approaches in order to improve accuracy and reliability. In this regard, we provide a dataset called VISEM-Tracking with 20 video recordings of 30s of spermatozoa with manually annotated bounding-box coordinates and a set of sperm characteristics analyzed by experts in the domain. VISEM-Tracking is an extension of the previously published VISEM dataset. In addition to the annotated data, we provide unlabeled video clips for easy-to-use access and analysis of the data. As part of this paper, we present baseline sperm detection performances using the YOLOv5 deep learning model trained on the VISEM-Tracking dataset. As a result, the dataset can be used to train complex deep-learning models to analyze spermatozoa. The dataset is publicly available at https://zenodo.org/record/7293726.  ( 2 min )
    Continuous diffusion for categorical data. (arXiv:2211.15089v2 [cs.CL] UPDATED)
    Diffusion models have quickly become the go-to paradigm for generative modelling of perceptual signals (such as images and sound) through iterative refinement. Their success hinges on the fact that the underlying physical phenomena are continuous. For inherently discrete and categorical data such as language, various diffusion-inspired alternatives have been proposed. However, the continuous nature of diffusion models conveys many benefits, and in this work we endeavour to preserve it. We propose CDCD, a framework for modelling categorical data with diffusion models that are continuous both in time and input space. We demonstrate its efficacy on several language modelling tasks.  ( 2 min )
    Estimating Cardiac Tissue Conductivity from Electrograms with Fully Convolutional Networks. (arXiv:2212.03012v1 [cs.LG])
    Atrial Fibrillation (AF) is characterized by disorganised electrical activity in the atria and is known to be sustained by the presence of regions of fibrosis (scars) or functional cellular remodeling, both of which may lead to areas of slow conduction. Estimating the effective conductivity of the myocardium and identifying regions of abnormal propagation is therefore crucial for the effective treatment of AF. We hypothesise that the spatial distribution of tissue conductivity can be directly inferred from an array of concurrently acquired contact electrograms (EGMs). We generate a dataset of simulated cardiac AP propagation using randomised scar distributions and a phenomenological cardiac model and calculate contact electrograms at various positions on the field. A deep neural network, based on a modified U-net architecture, is trained to estimate the location of the scar and quantify conductivity of the tissue with a Jaccard index of $91$%. We adapt a wavelet-based surrogate testing analysis to confirm that the inferred conductivity distribution is an accurate representation of the ground truth input to the model. We find that the root mean square error (RMSE) between the ground truth and our predictions is significantly smaller ($p_{val}=0.007$) than the RMSE between the ground truth and surrogate samples.  ( 2 min )
    Loss Adapted Plasticity in Deep Neural Networks to Learn from Data with Unreliable Sources. (arXiv:2212.02895v1 [cs.LG])
    When data is streaming from multiple sources, conventional training methods update model weights often assuming the same level of reliability for each source; that is: a model does not consider data quality of each source during training. In many applications, sources can have varied levels of noise or corruption that has negative effects on the learning of a robust deep learning model. A key issue is that the quality of data or labels for individual sources is often not available during training and could vary over time. Our solution to this problem is to consider the mistakes made while training on data originating from sources and utilise this to create a perceived data quality for each source. This paper demonstrates a straight-forward and novel technique that can be applied to any gradient descent optimiser: Update model weights as a function of the perceived reliability of data sources within a wider data set. The algorithm controls the plasticity of a given model to weight updates based on the history of losses from individual data sources. We show that applying this technique can significantly improve model performance when trained on a mixture of reliable and unreliable data sources, and maintain performance when models are trained on data sources that are all considered reliable. All code to reproduce this work's experiments and implement the algorithm in the reader's own models is made available.  ( 2 min )
    Interpretation of Neural Networks is Susceptible to Universal Adversarial Perturbations. (arXiv:2212.03095v1 [cs.CV])
    Interpreting neural network classifiers using gradient-based saliency maps has been extensively studied in the deep learning literature. While the existing algorithms manage to achieve satisfactory performance in application to standard image recognition datasets, recent works demonstrate the vulnerability of widely-used gradient-based interpretation schemes to norm-bounded perturbations adversarially designed for every individual input sample. However, such adversarial perturbations are commonly designed using the knowledge of an input sample, and hence perform sub-optimally in application to an unknown or constantly changing data point. In this paper, we show the existence of a Universal Perturbation for Interpretation (UPI) for standard image datasets, which can alter a gradient-based feature map of neural networks over a significant fraction of test samples. To design such a UPI, we propose a gradient-based optimization method as well as a principal component analysis (PCA)-based approach to compute a UPI which can effectively alter a neural network's gradient-based interpretation on different samples. We support the proposed UPI approaches by presenting several numerical results of their successful applications to standard image datasets.  ( 2 min )
    FretNet: Continuous-Valued Pitch Contour Streaming for Polyphonic Guitar Tablature Transcription. (arXiv:2212.03023v1 [eess.AS])
    In recent years, the task of Automatic Music Transcription (AMT), whereby various attributes of music notes are estimated from audio, has received increasing attention. At the same time, the related task of Multi-Pitch Estimation (MPE) remains a challenging but necessary component of almost all AMT approaches, even if only implicitly. In the context of AMT, pitch information is typically quantized to the nominal pitches of the Western music scale. Even in more general contexts, MPE systems typically produce pitch predictions with some degree of quantization. In certain applications of AMT, such as Guitar Tablature Transcription (GTT), it is more meaningful to estimate continuous-valued pitch contours. Guitar tablature has the capacity to represent various playing techniques, some of which involve pitch modulation. Contemporary approaches to AMT do not adequately address pitch modulation, and offer only less quantization at the expense of more model complexity. In this paper, we present a GTT formulation that estimates continuous-valued pitch contours, grouping them according to their string and fret of origin. We demonstrate that for this task, the proposed method significantly improves the resolution of MPE and simultaneously yields tablature estimation results competitive with baseline models.  ( 2 min )
    Fine-tuning a Subtle Parsing Distinction Using a Probabilistic Decision Tree: the Case of Postnominal "that" in Noun Complement Clauses vs. Relative Clauses. (arXiv:2212.02591v1 [cs.CL])
    In this paper we investigated two different methods to parse relative and noun complement clauses in English and resorted to distinct tags for their corresponding that as a relative pronoun and as a complementizer. We used an algorithm to relabel a corpus parsed with the GUM Treebank using Universal Dependency. Our second experiment consisted in using TreeTagger, a Probabilistic Decision Tree, to learn the distinction between the two complement and relative uses of postnominal "that". We investigated the effect of the training set size on TreeTagger accuracy and how representative the GUM Treebank files are for the two structures under scrutiny. We discussed some of the linguistic and structural tenets of the learnability of this distinction.  ( 2 min )
    Mixer: DNN Watermarking using Image Mixup. (arXiv:2212.02814v1 [cs.CR])
    It is crucial to protect the intellectual property rights of DNN models prior to their deployment. The DNN should perform two main tasks: its primary task and watermarking task. This paper proposes a lightweight, reliable, and secure DNN watermarking that attempts to establish strong ties between these two tasks. The samples triggering the watermarking task are generated using image Mixup either from training or testing samples. This means that there is an infinity of triggers not limited to the samples used to embed the watermark in the model at training. The extensive experiments on image classification models for different datasets as well as exposing them to a variety of attacks, show that the proposed watermarking provides protection with an adequate level of security and robustness.  ( 2 min )
    Galaxy Spin Classification I: Z-wise vs S-wise Spirals With Chirality Equivariant Residual Network. (arXiv:2210.04168v2 [astro-ph.CO] UPDATED)
    The angular momentum of galaxies (galaxy spin) contains rich information about the initial condition of the Universe, yet it is challenging to efficiently measure the spin direction for the tremendous amount of galaxies that are being mapped by the ongoing and forthcoming cosmological surveys. We present a machine learning based classifier for the Z-wise vs S-wise spirals, which can help to break the degeneracy in the galaxy spin direction measurement. The proposed Chirality Equivariant Residual Network (CE-ResNet) is manifestly equivariant under a reflection of the input image, which guarantees that there is no inherent asymmetry between the Z-wise and S-wise probability estimators. We train the model with Sloan Digital Sky Survey (SDSS) images, with the training labels given by the Galaxy Zoo 1 (GZ1) project. A combination of data augmentation tricks are used during the training, making the model more robust to be applied to other surveys. We find a $\sim\!30\%$ increase of both types of spirals when Dark Energy Spectroscopic Instrument (DESI) images are used for classification, due to the better imaging quality of DESI. We verify that the $\sim\!7\sigma$ difference between the numbers of Z-wise and S-wise spirals is due to human bias, since the discrepancy drops to $<\!1.8\sigma$ with our CE-ResNet classification results. We discuss the potential systematics that are relevant to the future cosmological applications.  ( 2 min )
    A Simple Unified Approach to Testing High-Dimensional Conditional Independences for Categorical and Ordinal Data. (arXiv:2206.04356v2 [stat.ML] UPDATED)
    Conditional independence (CI) tests underlie many approaches to model testing and structure learning in causal inference. Most existing CI tests for categorical and ordinal data stratify the sample by the conditioning variables, perform simple independence tests in each stratum, and combine the results. Unfortunately, the statistical power of this approach degrades rapidly as the number of conditioning variables increases. Here we propose a simple unified CI test for ordinal and categorical data that maintains reasonable calibration and power in high dimensions. We show that our test outperforms existing baselines in model testing and structure learning for dense directed graphical models while being comparable for sparse models. Our approach could be attractive for causal model testing because it is easy to implement, can be used with non-parametric or parametric probability models, has the symmetry property, and has reasonable computational requirements.  ( 2 min )
    Thales: Formulating and Estimating Architectural Vulnerability Factors for DNN Accelerators. (arXiv:2212.02649v1 [cs.AR])
    As Deep Neural Networks (DNNs) are increasingly deployed in safety critical and privacy sensitive applications such as autonomous driving and biometric authentication, it is critical to understand the fault-tolerance nature of DNNs. Prior work primarily focuses on metrics such as Failures In Time (FIT) rate and the Silent Data Corruption (SDC) rate, which quantify how often a device fails. Instead, this paper focuses on quantifying the DNN accuracy given that a transient error has occurred, which tells us how well a network behaves when a transient error occurs. We call this metric Resiliency Accuracy (RA). We show that existing RA formulation is fundamentally inaccurate, because it incorrectly assumes that software variables (model weights/activations) have equal faulty probability under hardware transient faults. We present an algorithm that captures the faulty probabilities of DNN variables under transient faults and, thus, provides correct RA estimations validated by hardware. To accelerate RA estimation, we reformulate RA calculation as a Monte Carlo integration problem, and solve it using importance sampling driven by DNN specific heuristics. Using our lightweight RA estimation method, we show that transient faults lead to far greater accuracy degradation than what todays DNN resiliency tools estimate. We show how our RA estimation tool can help design more resilient DNNs by integrating it with a Network Architecture Search framework.  ( 2 min )
    Can Ensembling Pre-processing Algorithms Lead to Better Machine Learning Fairness?. (arXiv:2212.02614v1 [cs.LG])
    As machine learning (ML) systems get adopted in more critical areas, it has become increasingly crucial to address the bias that could occur in these systems. Several fairness pre-processing algorithms are available to alleviate implicit biases during model training. These algorithms employ different concepts of fairness, often leading to conflicting strategies with consequential trade-offs between fairness and accuracy. In this work, we evaluate three popular fairness pre-processing algorithms and investigate the potential for combining all algorithms into a more robust pre-processing ensemble. We report on lessons learned that can help practitioners better select fairness algorithms for their models.  ( 2 min )
    A K-variate Time Series Is Worth K Words: Evolution of the Vanilla Transformer Architecture for Long-term Multivariate Time Series Forecasting. (arXiv:2212.02789v1 [cs.LG])
    Multivariate time series forecasting (MTSF) is a fundamental problem in numerous real-world applications. Recently, Transformer has become the de facto solution for MTSF, especially for the long-term cases. However, except for the one forward operation, the basic configurations in existing MTSF Transformer architectures were barely carefully verified. In this study, we point out that the current tokenization strategy in MTSF Transformer architectures ignores the token uniformity inductive bias of Transformers. Therefore, the vanilla MTSF transformer struggles to capture details in time series and presents inferior performance. Based on this observation, we make a series of evolution on the basic architecture of the vanilla MTSF transformer. We vary the flawed tokenization strategy, along with the decoder structure and embeddings. Surprisingly, the evolved simple transformer architecture is highly effective, which successfully avoids the over-smoothing phenomena in the vanilla MTSF transformer, achieves a more detailed and accurate prediction, and even substantially outperforms the state-of-the-art Transformers that are well-designed for MTSF.  ( 2 min )
    Auxiliary Quantile Forecasting with Linear Networks. (arXiv:2212.02578v1 [cs.LG])
    We propose a novel multi-task method for quantile forecasting with shared Linear layers. Our method is based on the Implicit quantile learning approach, where samples from the Uniform distribution $\mathcal{U}(0, 1)$ are reparameterized to quantile values of the target distribution. We combine the implicit quantile and input time series representations to directly forecast multiple quantile estimations for multiple horizons jointly. Prior works have adopted a Linear layer for the direct estimation of all forecasting horizons in a multi-task learning setup. We show that following similar intuition from multi-task learning to exploit correlations among forecast horizons, we can model multiple quantile estimates as auxiliary tasks for each of the forecast horizon to improve forecast accuracy across the quantile estimates compared to modeling only a single quantile estimate. We show learning auxiliary quantile tasks leads to state-of-the-art performance on deterministic forecasting benchmarks concerning the main-task of forecasting the 50$^{th}$ percentile estimate.  ( 2 min )
    Unifying Vision, Text, and Layout for Universal Document Processing. (arXiv:2212.02623v1 [cs.CV])
    We propose Universal Document Processing (UDOP), a foundation Document AI model which unifies text, image, and layout modalities together with varied task formats, including document understanding and generation. UDOP leverages the spatial correlation between textual content and document image to model image, text, and layout modalities with one uniform representation. With a novel Vision-Text-Layout Transformer, UDOP unifies pretraining and multi-domain downstream tasks into a prompt-based sequence generation scheme. UDOP is pretrained on both large-scale unlabeled document corpora using innovative self-supervised objectives and diverse labeled data. UDOP also learns to generate document images from text and layout modalities via masked image reconstruction. To the best of our knowledge, this is the first time in the field of document AI that one model simultaneously achieves high-quality neural document editing and content customization. Our method sets the state-of-the-art on 9 Document AI tasks, e.g., document understanding and QA, across diverse data domains like finance reports, academic papers, and websites. UDOP ranks first on the leaderboard of the Document Understanding Benchmark (DUE).  ( 2 min )
    UniGeo: Unifying Geometry Logical Reasoning via Reformulating Mathematical Expression. (arXiv:2212.02746v1 [cs.AI])
    Geometry problem solving is a well-recognized testbed for evaluating the high-level multi-modal reasoning capability of deep models. In most existing works, two main geometry problems: calculation and proving, are usually treated as two specific tasks, hindering a deep model to unify its reasoning capability on multiple math tasks. However, in essence, these two tasks have similar problem representations and overlapped math knowledge which can improve the understanding and reasoning ability of a deep model on both two tasks. Therefore, we construct a large-scale Unified Geometry problem benchmark, UniGeo, which contains 4,998 calculation problems and 9,543 proving problems. Each proving problem is annotated with a multi-step proof with reasons and mathematical expressions. The proof can be easily reformulated as a proving sequence that shares the same formats with the annotated program sequence for calculation problems. Naturally, we also present a unified multi-task Geometric Transformer framework, Geoformer, to tackle calculation and proving problems simultaneously in the form of sequence generation, which finally shows the reasoning ability can be improved on both two tasks by unifying formulation. Furthermore, we propose a Mathematical Expression Pretraining (MEP) method that aims to predict the mathematical expressions in the problem solution, thus improving the Geoformer model. Experiments on the UniGeo demonstrate that our proposed Geoformer obtains state-of-the-art performance by outperforming task-specific model NGS with over 5.6% and 3.2% accuracies on calculation and proving problems, respectively.  ( 2 min )
    Spuriosity Rankings: Sorting Data for Spurious Correlation Robustness. (arXiv:2212.02648v1 [cs.CV])
    We present a framework for ranking images within their class based on the strength of spurious cues present. By measuring the gap in accuracy on the highest and lowest ranked images (we call this spurious gap), we assess spurious feature reliance for $89$ diverse ImageNet models, finding that even the best models underperform in images with weak spurious presence. However, the effect of spurious cues varies far more dramatically across classes, emphasizing the crucial, often overlooked, class-dependence of the spurious correlation problem. While most spurious features we observe are clarifying (i.e. improving test-time accuracy when present, as is typically expected), we surprisingly find many cases of confusing spurious features, where models perform better when they are absent. We then close the spurious gap by training new classification heads on lowly ranked (i.e. without common spurious cues) images, resulting in improved effective robustness to distribution shifts (ObjectNet, ImageNet-R, ImageNet-Sketch). We also propose a second metric to assess feature reliability, finding that spurious features are generally less reliable than non-spurious (core) ones, though again, spurious features can be more reliable for certain classes. To enable our analysis, we annotated $5,000$ feature-class dependencies over {\it all} of ImageNet as core or spurious using minimal human supervision. Finally, we show the feature discovery and spuriosity ranking framework can be extended to other datasets like CelebA and WaterBirds in a lightweight fashion with only linear layer training, leading to discovering a previously unknown racial bias in the Celeb-A hair classification.  ( 2 min )
    Automatic Anomalies Detection in Hydraulic Devices. (arXiv:2212.02602v1 [cs.LG])
    Nowadays, the applications of hydraulic systems are present in a wide variety of devices in both industrial and everyday environments. The implementation and usage of hydraulic systems have been well documented; however, today, this still faces a challenge, the integration of tools that allow more accurate information about the functioning and operation of these systems for proactive decision-making. In industrial applications, many sensors and methods exist to measure and determine the status of process variables (e.g., flow, pressure, force). Nevertheless, little has been done to have systems that can provide users with device-health information related to hydraulic devices integrated into the machinery. Implementing artificial intelligence (AI) technologies and machine learning (ML) models in hydraulic system components has been identified as a solution to the challenge many industries currently face: optimizing processes and carrying them out more safely and efficiently. This paper presents a solution for the characterization and estimation of anomalies in one of the most versatile and used devices in hydraulic systems, cylinders. AI and ML models were implemented to determine the current operating status of these hydraulic components and whether they are working correctly or if a failure mode or abnormal condition is present.  ( 2 min )
    Learning to Optimize in Model Predictive Control. (arXiv:2212.02603v1 [cs.RO])
    Sampling-based Model Predictive Control (MPC) is a flexible control framework that can reason about non-smooth dynamics and cost functions. Recently, significant work has focused on the use of machine learning to improve the performance of MPC, often through learning or fine-tuning the dynamics or cost function. In contrast, we focus on learning to optimize more effectively. In other words, to improve the update rule within MPC. We show that this can be particularly useful in sampling-based MPC, where we often wish to minimize the number of samples for computational reasons. Unfortunately, the cost of computational efficiency is a reduction in performance; fewer samples results in noisier updates. We show that we can contend with this noise by learning how to update the control distribution more effectively and make better use of the few samples that we have. Our learned controllers are trained via imitation learning to mimic an expert which has access to substantially more samples. We test the efficacy of our approach on multiple simulated robotics tasks in sample-constrained regimes and demonstrate that our approach can outperform a MPC controller with the same number of samples.  ( 2 min )
    Decentralized Stochastic Gradient Descent Ascent for Finite-Sum Minimax Problems. (arXiv:2212.02724v1 [cs.LG])
    Minimax optimization problems have attracted significant attention in recent years due to their widespread application in numerous machine learning models. To solve the minimax optimization problem, a wide variety of stochastic optimization methods have been proposed. However, most of them ignore the distributed setting where the training data is distributed on multiple workers. In this paper, we developed a novel decentralized stochastic gradient descent ascent method for the finite-sum minimax optimization problem. In particular, by employing the variance-reduced gradient, our method can achieve $O(\frac{\sqrt{n}\kappa^3}{(1-\lambda)^2\epsilon^2})$ sample complexity and $O(\frac{\kappa^3}{(1-\lambda)^2\epsilon^2})$ communication complexity for the nonconvex-strongly-concave minimax optimization problem. As far as we know, our work is the first one to achieve such theoretical complexities for this kind of problem. At last, we apply our method to optimize the AUC maximization problem and the experimental results confirm the effectiveness of our method.  ( 2 min )
    Improving Molecule Properties Through 2-Stage VAE. (arXiv:2212.02750v1 [cs.LG])
    Variational autoencoder (VAE) is a popular method for drug discovery and there had been a great deal of architectures and pipelines proposed to improve its performance. But the VAE model itself suffers from deficiencies such as poor manifold recovery when data lie on low-dimensional manifold embedded in higher dimensional ambient space and they manifest themselves in each applications differently. The consequences of it in drug discovery is somewhat under-explored. In this paper, we study how to improve the similarity of the data generated via VAE and the training dataset by improving manifold recovery via a 2-stage VAE where the second stage VAE is trained on the latent space of the first one. We experimentally evaluated our approach using the ChEMBL dataset as well as a polymer datasets. In both dataset, the 2-stage VAE method is able to improve the property statistics significantly from a pre-existing method.  ( 2 min )
    MAP-Music2Vec: A Simple and Effective Baseline for Self-Supervised Music Audio Representation Learning. (arXiv:2212.02508v1 [cs.SD])
    The deep learning community has witnessed an exponentially growing interest in self-supervised learning (SSL). However, it still remains unexplored how to build a framework for learning useful representations of raw music waveforms in a self-supervised manner. In this work, we design Music2Vec, a framework exploring different SSL algorithmic components and tricks for music audio recordings. Our model achieves comparable results to the state-of-the-art (SOTA) music SSL model Jukebox, despite being significantly smaller with less than 2% of parameters of the latter. The model will be released on Huggingface(Please refer to: https://huggingface.co/m-a-p/music2vec-v1)  ( 2 min )
    FEMa-FS: Finite Element Machines for Feature Selection. (arXiv:2212.02507v1 [cs.LG])
    Identifying anomalies has become one of the primary strategies towards security and protection procedures in computer networks. In this context, machine learning-based methods emerge as an elegant solution to identify such scenarios and learn irrelevant information so that a reduction in the identification time and possible gain in accuracy can be obtained. This paper proposes a novel feature selection approach called Finite Element Machines for Feature Selection (FEMa-FS), which uses the framework of finite elements to identify the most relevant information from a given dataset. Although FEMa-FS can be applied to any application domain, it has been evaluated in the context of anomaly detection in computer networks. The outcomes over two datasets showed promising results.  ( 2 min )
    Relation-based Motion Prediction using Traffic Scene Graphs. (arXiv:2212.02503v1 [cs.AI])
    Representing relevant information of a traffic scene and understanding its environment is crucial for the success of autonomous driving. Modeling the surrounding of an autonomous car using semantic relations, i.e., how different traffic participants relate in the context of traffic rule based behaviors, is hardly been considered in previous work. This stems from the fact that these relations are hard to extract from real-world traffic scenes. In this work, we model traffic scenes in a form of spatial semantic scene graphs for various different predictions about the traffic participants, e.g., acceleration and deceleration. Our learning and inference approach uses Graph Neural Networks (GNNs) and shows that incorporating explicit information about the spatial semantic relations between traffic participants improves the predicdtion results. Specifically, the acceleration prediction of traffic participants is improved by up to 12% compared to the baselines, which do not exploit this explicit information. Furthermore, by including additional information about previous scenes, we achieve 73% improvements.  ( 2 min )
    cs-net: structural approach to time-series forecasting for high-dimensional feature space data with limited observations. (arXiv:2212.02567v1 [cs.LG])
    In recent years, deep-learning-based approaches have been introduced to solving time-series forecasting-related problems. These novel methods have demonstrated impressive performance in univariate and low-dimensional multivariate time-series forecasting tasks. However, when these novel methods are used to handle high-dimensional multivariate forecasting problems, their performance is highly restricted by a practical training time and a reasonable GPU memory configuration. In this paper, inspired by a change of basis in the Hilbert space, we propose a flexible data feature extraction technique that excels in high-dimensional multivariate forecasting tasks. Our approach was originally developed for the National Science Foundation (NSF) Algorithms for Threat Detection (ATD) 2022 Challenge. Implemented using the attention mechanism and Convolutional Neural Networks (CNN) architecture, our method demonstrates great performance and compatibility. Our models trained on the GDELT Dataset finished 1st and 2nd places in the ATD sprint series and hold promise for other datasets for time series forecasting.  ( 2 min )
  • Open

    Robust and flexible learning of a high-dimensional classification rule using auxiliary outcomes. (arXiv:2011.05493v2 [stat.ME] UPDATED)
    Correlated outcomes are common in many practical problems. In some settings, one outcome is of particular interest and others are auxiliary. To leverage information shared by all the outcomes, traditional multi-task learning (MTL) minimizes an averaged loss function over all the outcomes, which may lead to biased estimation, especially when the MTL model is mis-specified. In this work, based on a decomposition of estimation bias into two types, within-subspace and against-subspace, we develop a robust transfer learning approach to estimating a high-dimensional linear decision rule for the outcome of interest with the presence of auxiliary outcomes. The proposed method includes a MTL step using all outcomes to gain efficiency, and a subsequent calibration step using only the outcome of interest to correct both types of biases. We show that the final estimator can achieve a lower estimation error than the one using only the single outcome of interest. Simulations and a real data analysis are conducted to justify the superiority of the proposed method.
    Deep Double Descent via Smooth Interpolation. (arXiv:2209.10080v2 [cs.LG] UPDATED)
    Overparameterized deep networks can interpolate noisy data while at the same time showing good generalization performance. Common intuition from polynomial regression suggests that large networks are able to sharply interpolate noisy data without considerably deviating from the ground-truth signal. At present, a precise characterization of this phenomenon for deep networks is missing. In this work, we present an empirical study of input-space smoothness of the loss landscape of deep networks over volumes around cleanly- and noisily-labeled training samples, as we systematically increase the number of model parameters and training epochs. Our findings show that loss sharpness in the input space follows both model- and epoch-wise double descent, with worse peaks observed around noisy labels. While small interpolating models sharply fit both clean and noisy data, large interpolating models express a smooth loss landscape, where noisy targets are predicted over large volumes around training data points, in contrast to existing intuition.  ( 2 min )
    Bayes-Newton Methods for Approximate Bayesian Inference with PSD Guarantees. (arXiv:2111.01721v3 [stat.ML] UPDATED)
    We formulate natural gradient variational inference (VI), expectation propagation (EP), and posterior linearisation (PL) as extensions of Newton's method for optimising the parameters of a Bayesian posterior distribution. This viewpoint explicitly casts inference algorithms under the framework of numerical optimisation. We show that common approximations to Newton's method from the optimisation literature, namely Gauss-Newton and quasi-Newton methods (e.g., the BFGS algorithm), are still valid under this 'Bayes-Newton' framework. This leads to a suite of novel algorithms which are guaranteed to result in positive semi-definite (PSD) covariance matrices, unlike standard VI and EP. Our unifying viewpoint provides new insights into the connections between various inference schemes. All the presented methods apply to any model with a Gaussian prior and non-conjugate likelihood, which we demonstrate with (sparse) Gaussian processes and state space models.
    Brownian Noise Reduction: Maximizing Privacy Subject to Accuracy Constraints. (arXiv:2206.07234v2 [cs.LG] UPDATED)
    There is a disconnect between how researchers and practitioners handle privacy-utility tradeoffs. Researchers primarily operate from a privacy first perspective, setting strict privacy requirements and minimizing risk subject to these constraints. Practitioners often desire an accuracy first perspective, possibly satisfied with the greatest privacy they can get subject to obtaining sufficiently small error. Ligett et al. have introduced a "noise reduction" algorithm to address the latter perspective. The authors show that by adding correlated Laplace noise and progressively reducing it on demand, it is possible to produce a sequence of increasingly accurate estimates of a private parameter while only paying a privacy cost for the least noisy iterate released. In this work, we generalize noise reduction to the setting of Gaussian noise, introducing the Brownian mechanism. The Brownian mechanism works by first adding Gaussian noise of high variance corresponding to the final point of a simulated Brownian motion. Then, at the practitioner's discretion, noise is gradually decreased by tracing back along the Brownian path to an earlier time. Our mechanism is more naturally applicable to the common setting of bounded $\ell_2$-sensitivity, empirically outperforms existing work on common statistical tasks, and provides customizable control of privacy loss over the entire interaction with the practitioner. We complement our Brownian mechanism with ReducedAboveThreshold, a generalization of the classical AboveThreshold algorithm that provides adaptive privacy guarantees. Overall, our results demonstrate that one can meet utility constraints while still maintaining strong levels of privacy.  ( 2 min )
    Benign overfitting in ridge regression. (arXiv:2009.14286v2 [math.ST] UPDATED)
    In many modern applications of deep learning the neural network has many more parameters than the data points used for its training. Motivated by those practices, a large body of recent theoretical research has been devoted to studying overparameterized models. One of the central phenomena in this regime is the ability of the model to interpolate noisy data, but still have test error lower than the amount of noise in that data. arXiv:1906.11300 characterized for which covariance structure of the data such a phenomenon can happen in linear regression if one considers the interpolating solution with minimum $\ell_2$-norm and the data has independent components: they gave a sharp bound on the variance term and showed that it can be small if and only if the data covariance has high effective rank in a subspace of small co-dimension. We strengthen and complete their results by eliminating the independence assumption and providing sharp bounds for the bias term. Thus, our results apply in a much more general setting than those of arXiv:1906.11300, e.g., kernel regression, and not only characterize how the noise is damped but also which part of the true signal is learned. Moreover, we extend the result to the setting of ridge regression, which allows us to explain another interesting phenomenon: we give general sufficient conditions under which the optimal regularization is negative.
    Federated Learning with Superquantile Aggregation for Heterogeneous Data. (arXiv:2112.09429v2 [cs.LG] UPDATED)
    We present a federated learning framework that is designed to robustly deliver good predictive performance across individual clients with heterogeneous data. The proposed approach hinges upon a superquantile-based learning objective that captures the tail statistics of the error distribution over heterogeneous clients. We present a stochastic training algorithm that interleaves differentially private client filtering with federated averaging steps. We prove finite time convergence guarantees for the algorithm: $O(1/\sqrt{T})$ in the nonconvex case in $T$ communication rounds and $O(\exp(-T/\kappa^{3/2}) + \kappa/T)$ in the strongly convex case with local condition number $\kappa$. Experimental results on benchmark datasets for federated learning demonstrate that our approach is competitive with classical ones in terms of average error and outperforms them in terms of tail statistics of the error.  ( 2 min )
    On the role of benchmarking data sets and simulations in method comparison studies. (arXiv:2208.01457v2 [stat.ME] UPDATED)
    Method comparisons are essential to provide recommendations and guidance for applied researchers, who often have to choose from a plethora of available approaches. While many comparisons exist in the literature, these are often not neutral but favour a novel method. Apart from the choice of design and a proper reporting of the findings, there are different approaches concerning the underlying data for such method comparison studies. Most manuscripts on statistical methodology rely on simulation studies and provide a single real-world data set as an example to motivate and illustrate the methodology investigated. In the context of supervised learning, in contrast, methods are often evaluated using so-called benchmarking data sets, i.e. real-world data that serve as gold standard in the community. Simulation studies, on the other hand, are much less common in this context. The aim of this paper is to investigate differences and similarities between these approaches, to discuss their advantages and disadvantages and ultimately to develop new approaches to the evaluation of methods picking the best of both worlds. To this aim, we borrow ideas from different contexts such as mixed methods research and Clinical Scenario Evaluation.  ( 2 min )
    BoostTree and BoostForest for Ensemble Learning. (arXiv:2003.09737v3 [cs.LG] UPDATED)
    Bootstrap aggregating (Bagging) and boosting are two popular ensemble learning approaches, which combine multiple base learners to generate a composite model for more accurate and more reliable performance. They have been widely used in biology, engineering, healthcare, etc. This paper proposes BoostForest, which is an ensemble learning approach using BoostTree as base learners and can be used for both classification and regression. BoostTree constructs a tree model by gradient boosting. It increases the randomness (diversity) by drawing the cut-points randomly at node splitting. BoostForest further increases the randomness by bootstrapping the training data in constructing different BoostTrees. BoostForest generally outperformed four classical ensemble learning approaches (Random Forest, Extra-Trees, XGBoost and LightGBM) on 35 classification and regression datasets. Remarkably, BoostForest tunes its parameters by simply sampling them randomly from a parameter pool, which can be easily specified, and its ensemble learning framework can also be used to combine many other base learners.  ( 2 min )
    MUTLA: A Large-Scale Dataset for Multimodal Teaching and Learning Analytics. (arXiv:1910.06078v2 [cs.CY] UPDATED)
    Automatic analysis of teacher and student interactions could be very important to improve the quality of teaching and student engagement. However, despite some recent progress in utilizing multimodal data for teaching and learning analytics, a thorough analysis of a rich multimodal dataset coming for a complex real learning environment has yet to be done. To bridge this gap, we present a large-scale MUlti-modal Teaching and Learning Analytics (MUTLA) dataset. This dataset includes time-synchronized multimodal data records of students (learning logs, videos, EEG brainwaves) as they work in various subjects from Squirrel AI Learning System (SAIL) to solve problems of varying difficulty levels. The dataset resources include user records from the learner records store of SAIL, brainwave data collected by EEG headset devices, and video data captured by web cameras while students worked in the SAIL products. Our hope is that by analyzing real-world student learning activities, facial expressions, and brainwave patterns, researchers can better predict engagement, which can then be used to improve adaptive learning selection and student learning outcomes. An additional goal is to provide a dataset gathered from real-world educational activities versus those from controlled lab environments to benefit the educational learning community.  ( 2 min )
    Variational Bayesian Reinforcement Learning with Regret Bounds. (arXiv:1807.09647v4 [cs.LG] UPDATED)
    In reinforcement learning the Q-values summarize the expected future rewards that the agent will attain. However, they cannot capture the epistemic uncertainty about those rewards. In this work we derive a new Bellman operator with associated fixed point we call the `knowledge values'. These K-values compress both the expected future rewards and the epistemic uncertainty into a single value, so that high uncertainty, high reward, or both, can yield high K-values. The key principle is to endow the agent with a risk-seeking utility function that is carefully tuned to balance exploration and exploitation. When the agent follows a Boltzmann policy over the K-values it yields a Bayes regret bound of $\tilde O(L \sqrt{S A T})$, where $L$ is the time horizon, $S$ is the total number of states, $A$ is the number of actions, and $T$ is the number of elapsed timesteps. We show deep connections of this approach to the soft-max and maximum-entropy strands of research in reinforcement learning.  ( 2 min )
    Correlation detection in trees for planted graph alignment. (arXiv:2107.07623v4 [cs.DS] UPDATED)
    Motivated by alignment of correlated sparse random graphs, we introduce a hypothesis testing problem of deciding whether or not two random trees are correlated. We obtain sufficient conditions under which this testing is impossible or feasible. We propose MPAlign, a message-passing algorithm for graph alignment inspired by the tree correlation detection problem. We prove MPAlign to succeed in polynomial time at partial alignment whenever tree detection is feasible. As a result our analysis of tree detection reveals new ranges of parameters for which partial alignment of sparse random graphs is feasible in polynomial time. We then conjecture that graph alignment is not feasible in polynomial time when the associated tree detection problem is impossible. If true, this conjecture together with our sufficient conditions on tree detection impossibility would imply the existence of a hard phase for graph alignment, i.e. a parameter range where alignment cannot be done in polynomial time even though it is known to be feasible in non-polynomial time.  ( 2 min )
    Robust Orthogonal Machine Learning of Treatment Effects. (arXiv:2103.11869v2 [stat.ML] UPDATED)
    Causal learning is the key to obtaining stable predictions and answering \textit{what if} problems in decision-makings. In causal learning, it is central to seek methods to estimate the average treatment effect (ATE) from observational data. The Double/Debiased Machine Learning (DML) is one of the prevalent methods to estimate ATE. However, the DML estimators can suffer from an \textit{error-compounding issue} and even give extreme estimates when the propensity scores are close to 0 or 1. Previous studies have overcome this issue through some empirical tricks such as propensity score trimming, yet none of the existing works solves it from a theoretical standpoint. In this paper, we propose a \textit{Robust Causal Learning (RCL)} method to offset the deficiencies of DML estimators. Theoretically, the RCL estimators i) satisfy the (higher-order) orthogonal condition and are as \textit{consistent and doubly robust} as the DML estimators, and ii) get rid of the error-compounding issue. Empirically, the comprehensive experiments show that: i) the RCL estimators give more stable estimations of the causal parameters than DML; ii) the RCL estimators outperform traditional estimators and their variants when applying different machine learning models on both simulation and benchmark datasets, and a mimic consumer credit dataset generated by WGAN.  ( 2 min )
    Continuous Mixtures of Tractable Probabilistic Models. (arXiv:2209.10584v2 [cs.LG] UPDATED)
    Probabilistic models based on continuous latent spaces, such as variational autoencoders, can be understood as uncountable mixture models where components depend continuously on the latent code. They have proven expressive tools for generative and probabilistic modelling, but are at odds with tractable probabilistic inference, that is, computing marginals and conditionals of the represented probability distribution. Meanwhile, tractable probabilistic models such as probabilistic circuits (PCs) can be understood as hierarchical discrete mixture models, which allows them to perform exact inference, but often they show subpar performance in comparison to continuous latent-space models. In this paper, we investigate a hybrid approach, namely continuous mixtures of tractable models with a small latent dimension. While these models are analytically intractable, they are well amenable to numerical integration schemes based on a finite set of integration points. With a large enough number of integration points the approximation becomes de-facto exact. Moreover, using a finite set of integration points, the approximation method can be compiled into a PC performing `exact inference in an approximate model'. In experiments, we show that this simple scheme proves remarkably effective, as PCs learned this way set new state-of-the-art for tractable models on many standard density estimation benchmarks.  ( 2 min )
    A Simple Unified Approach to Testing High-Dimensional Conditional Independences for Categorical and Ordinal Data. (arXiv:2206.04356v2 [stat.ML] UPDATED)
    Conditional independence (CI) tests underlie many approaches to model testing and structure learning in causal inference. Most existing CI tests for categorical and ordinal data stratify the sample by the conditioning variables, perform simple independence tests in each stratum, and combine the results. Unfortunately, the statistical power of this approach degrades rapidly as the number of conditioning variables increases. Here we propose a simple unified CI test for ordinal and categorical data that maintains reasonable calibration and power in high dimensions. We show that our test outperforms existing baselines in model testing and structure learning for dense directed graphical models while being comparable for sparse models. Our approach could be attractive for causal model testing because it is easy to implement, can be used with non-parametric or parametric probability models, has the symmetry property, and has reasonable computational requirements.  ( 2 min )
    Variational Neural Networks. (arXiv:2207.01524v2 [cs.LG] UPDATED)
    Bayesian Neural Networks (BNNs) provide a tool to estimate the uncertainty of a neural network by considering a distribution over weights and sampling different models for each input. In this paper, we propose a method for uncertainty estimation in neural networks called Variational Neural Network that, instead of considering a distribution over weights, generates parameters for the output distribution of a layer by transforming its inputs with learnable sub-layers. In uncertainty quality estimation experiments, we show that VNNs achieve better uncertainty quality than Monte Carlo Dropout or Bayes By Backpropagation methods.  ( 2 min )
    CARD: Classification and Regression Diffusion Models. (arXiv:2206.07275v4 [stat.ML] UPDATED)
    Learning the distribution of a continuous or categorical response variable $\boldsymbol y$ given its covariates $\boldsymbol x$ is a fundamental problem in statistics and machine learning. Deep neural network-based supervised learning algorithms have made great progress in predicting the mean of $\boldsymbol y$ given $\boldsymbol x$, but they are often criticized for their ability to accurately capture the uncertainty of their predictions. In this paper, we introduce classification and regression diffusion (CARD) models, which combine a denoising diffusion-based conditional generative model and a pre-trained conditional mean estimator, to accurately predict the distribution of $\boldsymbol y$ given $\boldsymbol x$. We demonstrate the outstanding ability of CARD in conditional distribution prediction with both toy examples and real-world datasets, the experimental results on which show that CARD in general outperforms state-of-the-art methods, including Bayesian neural network-based ones that are designed for uncertainty estimation, especially when the conditional distribution of $\boldsymbol y$ given $\boldsymbol x$ is multi-modal. In addition, we utilize the stochastic nature of the generative model outputs to obtain a finer granularity in model confidence assessment at the instance level for classification tasks.  ( 2 min )
    Two-Tailed Averaging: Anytime Adaptive Once-in-a-while Optimal Iterate Averaging for Stochastic Optimization. (arXiv:2209.12581v2 [stat.ML] UPDATED)
    Tail averaging improves on Polyak averaging's non-asymptotic behaviour by excluding a number of leading iterates of stochastic optimization from its calculations. In practice, with a finite number of optimization steps and a learning rate that cannot be annealed to zero, tail averaging can get much closer to a local minimum point of the training loss than either the individual iterates or the Polyak average. However, the number of leading iterates to ignore is an important hyperparameter, and starting averaging too early or too late leads to inefficient use of resources or suboptimal solutions. Our work focusses on improving generalization, which makes setting this hyperparameter even more difficult, especially in the presence of other hyperparameters and overfitting. Furthermore, before averaging starts, the loss is only weakly informative of the final performance, which makes early stopping unreliable. To alleviate these problems, we propose an anytime variant of tail averaging intended for improving generalization not pure optimization, that has no hyperparameters and approximates the optimal tail at all optimization steps. Our algorithm is based on two running averages with adaptive lengths bounded in terms of the optimal tail length, one of which achieves approximate optimality with some regularity. Requiring only the additional storage for two sets of weights and periodic evaluation of the loss, the proposed two-tailed averaging algorithm is a practical and widely applicable method for improving generalization.  ( 2 min )
    Robust convex biclustering with a tuning-free method. (arXiv:2212.03122v1 [stat.ME])
    Biclustering is widely used in different kinds of fields including gene information analysis, text mining, and recommendation system by effectively discovering the local correlation between samples and features. However, many biclustering algorithms will collapse when facing heavy-tailed data. In this paper, we propose a robust version of convex biclustering algorithm with Huber loss. Yet, the newly introduced robustification parameter brings an extra burden to selecting the optimal parameters. Therefore, we propose a tuning-free method for automatically selecting the optimal robustification parameter with high efficiency. The simulation study demonstrates the more fabulous performance of our proposed method than traditional biclustering methods when encountering heavy-tailed noise. A real-life biomedical application is also presented. The R package RcvxBiclustr is available at https://github.com/YifanChen3/RcvxBiclustr.  ( 2 min )
    Uniform-in-Time Propagation of Chaos for Mean Field Langevin Dynamics. (arXiv:2212.03050v1 [math.PR])
    We study the uniform-in-time propagation of chaos for mean field Langevin dynamics with convex mean field potenital. Convergences in both Wasserstein-$2$ distance and relative entropy are established. We do not require the mean field potenital functional to bear either small mean field interaction or displacement convexity, which are common constraints in the literature. In particular, it allows us to study the efficiency of the noisy gradient descent algorithm for training two-layer neural networks.  ( 2 min )
    Loss Adapted Plasticity in Deep Neural Networks to Learn from Data with Unreliable Sources. (arXiv:2212.02895v1 [cs.LG])
    When data is streaming from multiple sources, conventional training methods update model weights often assuming the same level of reliability for each source; that is: a model does not consider data quality of each source during training. In many applications, sources can have varied levels of noise or corruption that has negative effects on the learning of a robust deep learning model. A key issue is that the quality of data or labels for individual sources is often not available during training and could vary over time. Our solution to this problem is to consider the mistakes made while training on data originating from sources and utilise this to create a perceived data quality for each source. This paper demonstrates a straight-forward and novel technique that can be applied to any gradient descent optimiser: Update model weights as a function of the perceived reliability of data sources within a wider data set. The algorithm controls the plasticity of a given model to weight updates based on the history of losses from individual data sources. We show that applying this technique can significantly improve model performance when trained on a mixture of reliable and unreliable data sources, and maintain performance when models are trained on data sources that are all considered reliable. All code to reproduce this work's experiments and implement the algorithm in the reader's own models is made available.  ( 2 min )
    Interpretation of Neural Networks is Susceptible to Universal Adversarial Perturbations. (arXiv:2212.03095v1 [cs.CV])
    Interpreting neural network classifiers using gradient-based saliency maps has been extensively studied in the deep learning literature. While the existing algorithms manage to achieve satisfactory performance in application to standard image recognition datasets, recent works demonstrate the vulnerability of widely-used gradient-based interpretation schemes to norm-bounded perturbations adversarially designed for every individual input sample. However, such adversarial perturbations are commonly designed using the knowledge of an input sample, and hence perform sub-optimally in application to an unknown or constantly changing data point. In this paper, we show the existence of a Universal Perturbation for Interpretation (UPI) for standard image datasets, which can alter a gradient-based feature map of neural networks over a significant fraction of test samples. To design such a UPI, we propose a gradient-based optimization method as well as a principal component analysis (PCA)-based approach to compute a UPI which can effectively alter a neural network's gradient-based interpretation on different samples. We support the proposed UPI approaches by presenting several numerical results of their successful applications to standard image datasets.  ( 2 min )
    Learning the joint distribution of two sequences using little or no paired data. (arXiv:2212.03232v1 [cs.LG])
    We present a noisy channel generative model of two sequences, for example text and speech, which enables uncovering the association between the two modalities when limited paired data is available. To address the intractability of the exact model under a realistic data setup, we propose a variational inference approximation. To train this variational model with categorical data, we propose a KL encoder loss approach which has connections to the wake-sleep algorithm. Identifying the joint or conditional distributions by only observing unpaired samples from the marginals is only possible under certain conditions in the data distribution and we discuss under what type of conditional independence assumptions that might be achieved, which guides the architecture designs. Experimental results show that even tiny amount of paired data (5 minutes) is sufficient to learn to relate the two modalities (graphemes and phonemes here) when a massive amount of unpaired data is available, paving the path to adopting this principled approach for all seq2seq models in low data resource regimes.  ( 2 min )
    Decentralized Stochastic Gradient Descent Ascent for Finite-Sum Minimax Problems. (arXiv:2212.02724v1 [cs.LG])
    Minimax optimization problems have attracted significant attention in recent years due to their widespread application in numerous machine learning models. To solve the minimax optimization problem, a wide variety of stochastic optimization methods have been proposed. However, most of them ignore the distributed setting where the training data is distributed on multiple workers. In this paper, we developed a novel decentralized stochastic gradient descent ascent method for the finite-sum minimax optimization problem. In particular, by employing the variance-reduced gradient, our method can achieve $O(\frac{\sqrt{n}\kappa^3}{(1-\lambda)^2\epsilon^2})$ sample complexity and $O(\frac{\kappa^3}{(1-\lambda)^2\epsilon^2})$ communication complexity for the nonconvex-strongly-concave minimax optimization problem. As far as we know, our work is the first one to achieve such theoretical complexities for this kind of problem. At last, we apply our method to optimize the AUC maximization problem and the experimental results confirm the effectiveness of our method.  ( 2 min )
    Identification of Unobservables in Observations. (arXiv:2212.02585v1 [econ.EM])
    In empirical studies, the data usually don't include all the variables of interest in an economic model. This paper shows the identification of unobserved variables in observations at the population level. When the observables are distinct in each observation, there exists a function mapping from the observables to the unobservables. Such a function guarantees the uniqueness of the latent value in each observation. The key lies in the identification of the joint distribution of observables and unobservables from the distribution of observables. The joint distribution of observables and unobservables then reveal the latent value in each observation. Three examples of this result are discussed.  ( 2 min )
    PRISM: Probabilistic Real-Time Inference in Spatial World Models. (arXiv:2212.02988v1 [cs.LG])
    We introduce PRISM, a method for real-time filtering in a probabilistic generative model of agent motion and visual perception. Previous approaches either lack uncertainty estimates for the map and agent state, do not run in real-time, do not have a dense scene representation or do not model agent dynamics. Our solution reconciles all of these aspects. We start from a predefined state-space model which combines differentiable rendering and 6-DoF dynamics. Probabilistic inference in this model amounts to simultaneous localisation and mapping (SLAM) and is intractable. We use a series of approximations to Bayesian inference to arrive at probabilistic map and state estimates. We take advantage of well-established methods and closed-form updates, preserving accuracy and enabling real-time capability. The proposed solution runs at 10Hz real-time and is similarly accurate to state-of-the-art SLAM in small to medium-sized indoor environments, with high-speed UAV and handheld camera agents (Blackbird, EuRoC and TUM-RGBD).  ( 2 min )
    QFT: Post-training quantization via fast joint finetuning of all degrees of freedom. (arXiv:2212.02634v1 [stat.ML])
    The post-training quantization (PTQ) challenge of bringing quantized neural net accuracy close to original has drawn much attention driven by industry demand. Many of the methods emphasize optimization of a specific degree-of-freedom (DoF), such as quantization step size, preconditioning factors, bias fixing, often chained to others in multi-step solutions. Here we rethink quantized network parameterization in HW-aware fashion, towards a unified analysis of all quantization DoF, permitting for the first time their joint end-to-end finetuning. Our single-step simple and extendable method, dubbed quantization-aware finetuning (QFT), achieves 4-bit weight quantization results on-par with SoTA within PTQ constraints of speed and resource.  ( 2 min )
    A Strongly Polynomial Algorithm for Approximate Forster Transforms and its Application to Halfspace Learning. (arXiv:2212.03008v1 [cs.DS])
    The Forster transform is a method of regularizing a dataset by placing it in {\em radial isotropic position} while maintaining some of its essential properties. Forster transforms have played a key role in a diverse range of settings spanning computer science and functional analysis. Prior work had given {\em weakly} polynomial time algorithms for computing Forster transforms, when they exist. Our main result is the first {\em strongly polynomial time} algorithm to compute an approximate Forster transform of a given dataset or certify that no such transformation exists. By leveraging our strongly polynomial Forster algorithm, we obtain the first strongly polynomial time algorithm for {\em distribution-free} PAC learning of halfspaces. This learning result is surprising because {\em proper} PAC learning of halfspaces is {\em equivalent} to linear programming. Our learning approach extends to give a strongly polynomial halfspace learner in the presence of random classification noise and, more generally, Massart noise.  ( 2 min )

  • Open

    Fears of AI
    AI is a spectacular tool, but its progress is worrisome. As with technology as a whole in civilization, AI is progressing faster and could progress much faster than we can acclimate. By handing off the bureaucratic reigns of things like content curation, job/school admissions, phone service call centers, among other societal functions to AI, we will depersonalize and dehumanize the workings of human civilization for the sake of "efficiency" and "productivity". Think about it this way, if you're born in the age of AI in lower to middle class, you grow up with the same cohort of people. Your entertainment is streamlined to you through algorithms telling you what you ought to consume. Your preferences are, for the most part, shaped by a computer. You graduate from school, maybe you apply to …  ( 51 min )
    Google introduces new machine learning addon for Google Sheets
    submitted by /u/DominoChessMaster [link] [comments]  ( 46 min )
    Open AI GPT - Answer to US Rare Earth Mining Collaboration
    My Question: When will the federal government, local government, the mining industry and environmental advocates come together and support the mining and processing rare earth metals in the United States? OPEN AI GPT Answer: https://beta.openai.com/playground [The federal government, local government, mining industry, and environmental advocates are already collaborating to support the mining and processing of rare earth metals in the United States. The U.S. Department of Energy has established a Rare Earths Supply Chain Technology and Critical Materials Institute to research, develop, and deploy materials processing technologies and systems. In addition, the Energy Department has funded research projects to explore the potential for domestic rare earths production. The U.S. Geological Survey is also involved in researching the potential for domestic rare earths production. The mining industry has also been actively involved in exploring the potential for domestic rare earths production. The American Exploration & Mining Association has established a Rare Earths Task Force to study the issue and provide recommendations. Environmental advocates have been working with the government and industry to ensure that any domestic rare earths production is done responsibly. Several environmental groups have expressed support for domestic rare earths production, as long as it is done in a sustainable manner. ] submitted by /u/JeffJHornEsq [link] [comments]  ( 47 min )
    Microsoft CTO Scott Stein predicts "most exciting" 2023 for AI with GPT-4 on the horizon
    submitted by /u/Number_5_alive [link] [comments]  ( 44 min )
    When ChatGPT meets trading cards! I asked several questions and the answers are so accurate, it is unbelievable.
    submitted by /u/skiwi2 [link] [comments]  ( 43 min )
    AI 's beat crypto in search interest not so long ago
    submitted by /u/UncleGutenberg [link] [comments]  ( 46 min )
    Assistance for choosing my career.
    Hi everyone, Exactly one year ago, on December 8, 2021, I earned a first-class honors degree in BEng(Hons) in software engineering. In January 2023, I'm going to complete 1.5 years in my current employment as a data engineer. I worked as a software engineering intern for a year in my third year of undergrad. I currently possess a foundational skill set in software engineering, data engineering, data science, machine learning, deep learning, and computer vision. During my undergraduate by taking part in hackathons and working on my project, I discovered that I am more drawn to the technical solutions I provide for problems than to the coding and development of them. Like I felt that rather than sitting and doing code, I'm more interested in offering a creative solution to a problem. …  ( 49 min )
    P. W. Singer - Wired For War & Burn-In
    submitted by /u/timothy-ventura [link] [comments]  ( 46 min )
    AI that listens to the sounds you make in the toilet
    This is an Extract from AI With Vibes, a daily AI newsletter: https://aiwithvibes.com/ ​ AI listens to toilet sounds to guess whether people have diarrhea wait…whaaaaatttt??? An artificial intelligence that can detect diarrhea with 98 percent accuracy from recordings of toilet sounds could help track outbreaks of diseases, such as cholera ​ https://preview.redd.it/yfcwycuhgi4a1.png?width=1200&format=png&auto=webp&s=be0fa18238fb94668773e29ff87d902216769127 So basically it has a Mic to listen to the sounds you, your tummy, and your poo make while you are doing your business then feeds that sound into an AI trained on, god knows what, 350 recordings of toilet-based sounds from YouTube and sound database Soundsnap – covering standard defecation, diarrhea, urination, and flatulence. Thi…  ( 46 min )
    USE ChatGPT To Write Text to Image Prompts!
    submitted by /u/PuppetHere [link] [comments]  ( 44 min )
    PP-Matting: High-Accuracy Natural Image Matting
    Hi, All, I'd like to introduce PP-Matting, a novel model for the high-accuracy natural image matting task. Hope this be some help to you. Arxiv: https://arxiv.org/pdf/2204.09433.pdf Source code and models: https://github.com/PaddlePaddle/PaddleSeg/tree/release/2.7/Matting Abstract Natural image matting is a fundamental and challenging computer vision task. It has many applications in image editing and composition. Recently, deep learning based approaches have achieved great improvements in image matting. However, most of them require a user supplied trimap as an auxiliary input, which limits the matting applications in the real world. Although some trimap-free approaches have been proposed, the matting quality is still unsatisfactory compared to trimap-based ones. Without the trimap…  ( 51 min )
    Hi everyone, my startup (SelfieWiz) is Looking to hire a prompt engineer on contract. Anyone interested?
    submitted by /u/toniena [link] [comments]  ( 45 min )
    Ask Me Anything with Ed Grefenstette, Head of Machine Learning at Cohere, and an Honorary Associate Professor at UCL, happening today 10:00 am -12:00 pm EST ❤️
    submitted by /u/techn0_cratic [link] [comments]  ( 46 min )
    The Brilliance and Weirdness of ChatGPT
    submitted by /u/eddytony96 [link] [comments]  ( 50 min )
    ChatGPT for Mac, living in your menubar
    submitted by /u/tim_toum [link] [comments]  ( 53 min )
    Important and Great. Text to Speech. HELP!
    Hello, as you might know, we have Loquendo since the damn 2006. Now I have a powerful AI image generator in my PC but still can't download a DESKTOP PROGRAM that can turn text in to speech??? Can't be, really, can't be possible that StableDifussion has a Desktop GUI so you can use it without internet but we still don't have a GREAT and completely FREE text-to-voice app. Please help. submitted by /u/Jupinel [link] [comments]  ( 52 min )
    [R] SOTA Real-Time Semantic Segmentation Model
    Hi, All, I'd like to introduce PP-LiteSeg, a novel model for the real-time semantic segmentation task. PP-LiteSeg achieves a superior trade-off between accuracy and speed compared to other methods. Hope this be some help to you. Arxiv: https://arxiv.org/abs/2204.02681 Source code and models: https://github.com/PaddlePaddle/PaddleSeg The comparison of accuracy and speed on the Cityscapes test set. PP-LiteSeg adopts the encoder-decoder architecture. A lightweight network is used as an encoder to extract hierarchical features. The Simple Pyramid Pooling Module (SPPM) is in charge of aggregating the global context. The Flexible Decoder (FLD) predicts the outcome by fusing detail and semantic features from high level to low level. In addition, FLD makes use of the Unified Attention Fusion Module (UAFM) to strengthen feature representations. The architecture overview of PP-LiteSeg. The framework of Unified Attention Fusion Module (UAFM), which can utilize spatial and channel attention module. submitted by /u/Effective_Tax_2096 [link] [comments]  ( 50 min )
    One Day Solving Problems - AI making everyone a artist. today!
    submitted by /u/Kami199199 [link] [comments]  ( 43 min )
    Me using GPT for the first time
    submitted by /u/MaadHater [link] [comments]  ( 43 min )
    𝐓𝐡𝐞 𝐰𝐨𝐫𝐥𝐝 𝐢𝐧 𝐭𝐡𝐞 𝐞𝐲𝐞𝐬 𝐨𝐟 𝐕𝐢𝐧𝐜𝐞𝐧𝐭 𝐕𝐚𝐧 𝐆𝐨𝐠𝐡 🌌✨
    submitted by /u/shama_mohamed [link] [comments]  ( 42 min )
    I feel like I’m at the start of an addiction to AI… help!? advice!?
    I spent $6 on using GPT3.5 today from OpenAI. I am a senior in college. I thought AIs were dumb/incapable/crude until this past week… When the whole Lensa AI “Magic Avatars” based on Subtle Diffusion took Instagram by storm and I started doing my research and got hooked. I’ve dallied with DALE but found the true temptation in GPT3.5 Of course, GPT3.5 and ChatGPT just came out, and while ChatGPT is great for most conversational stuff… GPT3.5 is a powerhouse. It is so extremely dynamic. I fed it an 1820’s family’s household budget and it was able to generate the price of 77 items back in that time period.That’s a very specific topic I’m sure I’ve tried to google at least twenty times before and had never found such quantity and at such speed as GPT3.5 can. It cuts through so much bullshit…  ( 51 min )
    Limited Memory AI
    I am trying to make a limited memory ai, but i can't seem to find any sources on how to make it. Do you guys know? submitted by /u/ProGamer171_ [link] [comments]  ( 47 min )
  • Open

    [N] CICERO AMA: "We're the Meta AI research team behind CICERO, the first AI agent to achieve human-level performance in the game Diplomacy. We’ll be answering your questions on 2022-12-08 8th starting at 10AM PT. Ask us anything!"
    submitted by /u/gwern [link] [comments]  ( 59 min )
    Getting into RL as a undergrad
    I am an undergrad doing major in bioengineering. I got interested towards RL, Can someone guide me with resources and learning path. Can you also suggest any careet path?. Thanks in advance submitted by /u/Shivaram_3223 [link] [comments]  ( 55 min )
    Can someone please guide me how to plot reward vs step while training is running or at least after each trial? I am using python.
    submitted by /u/Kucing_koyangi [link] [comments]  ( 53 min )
    Log Probability in Off-Policy methods
    Dear RL community, Last time I showed log probability (logarithm of Gaussian Probability Density function), how it depends on delta ∆ = difference between Mean of Policy and Actions done and how it is simple logarithm of standard deviation if this difference is very low. ​ https://preview.redd.it/t8t1qaamcg4a1.png?width=3026&format=png&auto=webp&s=9ff37ef98ad2670ee0f12934be664c254d8a2fb9 The logic of log_prob is to MOVE TOWARDS ACTIONS DONE when Standard deviation is bigger and the difference is high (Positive Gradient), or RUN AWAY from them if Standard Deviation is lower and difference is lower (Negative Gradient). There is small logic to move towards ACTIONS DONE, if they are not going to provide better Return, that is why log_prob is multiplied by Return. Return affects how much t…  ( 56 min )
    Are there any good robotics simulators/prior code which can be leveraged to simulate MDPs and POMDPs (not a 2D world)?
    Hi everyone! I was wondering if there are any open sourced simulators/prior code on ROS/any framework which I can leverage to realistically simulate any MDP/POMDP scenario to test out something I theorized? (I am essentially looking for something which is realistic rather than a 2D grid world.) Many thanks in advance! Edit 1: Adding resources from the comments for people coming back to the post later on! 1. Mujoco 2. Gymnasium 3. PyBullet 4. AirSim 5. Webots 6. Unity submitted by /u/E-Cockroach [link] [comments]  ( 59 min )
    Why goal-conditioned RL struggles with long-horizon goals?
    Many planning-related papers say this and many of them point to this SoRB paper, which empirically shows it but I can't find the theory behind it. I have some intuitive feeling but if possible I'd like to know if there's some work that talks about this more. When the goals can be more than dozens of steps in the future, is it the value function or the policy that's more difficult to learn? Is it even the number of steps? Is it more about the distribution of possible goals? Or stochasticity of state transitions? submitted by /u/connery123 [link] [comments]  ( 57 min )
  • Open

    [P] bias when estimating a ratio
    Hello, hoping someone out there has dealt with something similar and has suggestions on things I could try. I am basically trying to estimate a ratio y/x with a k-nearest neighbors regressor. On validation data when x is small, I am significantly underestimating the ratio on average. When x is large, we have a relatively unbiased model. I took a look at this wiki page (https://en.wikipedia.org/wiki/Ratio_estimator) which seems somewhat relevant but I'm having a bit of trouble parsing how I could apply it to my situation, if at all. Is there some statistical correction method I can apply so that my predictions are unbiased? submitted by /u/SantyClause [link] [comments]  ( 58 min )
    36% of HellaSwag benchmark contains errors [D]
    Continuing my analysis of errors in widely-used LLM benchmarks (post on Google's GoEmotions here) — I analyzed HellaSwag and found 36% contains errors. For example, here's a prompt and set of possible completions from the dataset. Which completion do you think is most appropriate? See if you can figure it out through the haze of typos and generally non-sensical writing. Men are standing in a large green field playing lacrosse. People is around the field watching the game. men are holding tshirts watching int lacrosse playing. are being interviewed in a podium in front of a large group and a gymnast is holding a microphone for the announcers. are running side to side of the ield playing lacrosse trying to score. are in a field running around playing lacrosse. I'll keep it spoiler-free here, but the full blog post goes into detail on this example (and others) and explains why they are so problematic. Link: https://www.surgehq.ai/blog/hellaswag-or-hellabad-36-of-this-popular-llm-benchmark-contains-errors submitted by /u/BB4evaTB12 [link] [comments]  ( 59 min )
    [P] Stable Diffusion 2.1 Release
    2.1 supports the new prompting style and brings back many of the old prompts too! The differences are more data, more training, and less restrictive filtering of the dataset. The filter still stripped out adult content, but was less aggressive, which cut down the number of false positives it detected. We fine-tuned the SD 2.0 model with this updated setting, giving us a model which captures the best of both worlds. It can render beautiful architectural concepts and natural scenery with ease, and yet still produce fantastic images of people and pop culture too. The new release delivers improved anatomy and hands and is much better at a range of incredible art styles than SD 2.0. Related links: Stability AI GitHub. weights and model cards. Stable Diffusion Prompt Book online. submitted by /u/turingbook [link] [comments]  ( 59 min )
    [D] We're the Meta AI research team behind CICERO, the first AI agent to achieve human-level performance in the game Diplomacy. We’ll be answering your questions on December 8th starting at 10am PT. Ask us anything!
    PROOF: https://i.redd.it/8skvttie6j4a1.png We’re part of the research team behind CICERO, Meta AI’s latest research in cooperative AI. CICERO is the first AI agent to achieve human-level performance in the game Diplomacy. Diplomacy is a complex strategy game involving both cooperation and competition that emphasizes natural language negotiation between seven players. Over the course of 40 two-hour games with 82 human players, CICERO achieved more than double the average score of other players, ranked in the top 10% of players who played more than one game, and placed 2nd out of 19 participants who played at least 5 games. Here are some highlights from our recent announcement: NLP x RL/Planning: CICERO combines techniques in NLP and RL/planning, by coupling a controllable dialogue modul…  ( 63 min )
    [P] Retrieval metrics: descriptions, formulas, examples and code
    Hi, guys! I'm working on OpenMetricLearning project, and we've just finished polishing the module for calculating retrieval metrics. We tried to make documentation self-sufficient, so, each metric includes text description, math formula, input-output example and source code. Code works fast enough because all of the metrics are vectorised and if you call them from the "umbrella" function they will share some intermediate computations. For now, we have the following metrics: CMC@k Precision@k MAP@k FNMR@FMR If you found the information useful, you can reward us with a start on GitHub! submitted by /u/Zestyclose-Check-751 [link] [comments]  ( 58 min )
    [D] HIGHLY SKEWED data! what to do?
    Hey fellow ML enthusiasts! I'm getting started in data science and post my boot camp, i am now finally working on some freelance projects. Currently, I'm working with a data set with a winery in Europe and i have a question. The dataset is HIGHLY skewed. so, in order to rectify it, I started with standardization and normalization of the features (log transformation, square transformation etc). it did improve the skew but not a lot. Some features seem to look normal but they have high kurtosis and skewness. i want to know what can i do in such a case. I plan to run a regression and a K-means clustering. inputs are appreciated! submitted by /u/a_sooshii [link] [comments]  ( 60 min )
    [R] Compute-Efficient Deep Learning: Algorithmic Trends and Opportunities
    Abstract: Although deep learning has made great progress in recent years, the exploding economic and environmental costs of training neural networks are becoming unsustainable. To address this problem, there has been a great deal of research on algorithmically-efficient deep learning, which seeks to reduce training costs not at the hardware or implementation level, but through changes in the semantics of the training program. In this paper, we present a structured and comprehensive overview of the research in this field. First, we formalize the algorithmic speedup problem, then we use fundamental building blocks of algorithmically efficient training to develop a taxonomy. Our taxonomy highlights commonalities of seemingly disparate methods and reveals current research gaps. Next, we present evaluation best practices to enable comprehensive, fair, and reliable comparisons of speedup techniques. To further aid research and applications, we discuss common bottlenecks in the training pipeline (illustrated via experiments) and offer taxonomic mitigation strategies for them. Finally, we highlight some unsolved research challenges and present promising future directions. ​ Paper: https://arxiv.org/pdf/2210.06640.pdf submitted by /u/bhavyakailkhura [link] [comments]  ( 60 min )
    [P] Pure python implementation of Mecab morpheme analyzer
    Mecab is a CRF-based morpheme analyzer made by Taku Kudo in 2011. It is very fast and accurate at the same time, which is why it is still very popular even though it is quite old. However, it is known to be one of the most tricky libraries to install, and in fact many people have had a hard time installing Mecab. So, since a few years ago, I wanted to make a pure python version of Mecab that was easy to install while inheriting the advantages of Mecab. Now, Pecab came out. This ensures results very similar to Mecab and at the same time easy to install. For more details, please refer the following link. https://github.com/hyunwoongko/pecab submitted by /u/Own_Feeling_416 [link] [comments]  ( 64 min )
    [Discussion] Suggestions on Trust Region Methods For Natural Gradient
    Hello! I've been working on a project that uses the Natural Gradient and I was wondering if anyone has suggestions on ways to include higher-order information. Is there some equivalent to the Hessian for natural gradients? If not, is there some sort of way of finding a trust region where the Natural Gradient approximation is reasonable? Thank you! submitted by /u/randomkolmogorov [link] [comments]  ( 60 min )
    Best Machine Learning Algorithm for Soil Mapping and Testing [Discussion]
    I looked into few sources saying that Random forest is one of the most popular MLA's for soil mapping. Especially since it is suited for both classification and regression problems. The only con for this one is that it is not suited for large data sets. Would definitely need some guidance in identifying the best MLA for soil mapping and testing, it would help me a lot hearing from you guys! I have based this info from: https://soilmapper.org/soilmapping-using-mla.html at 6.1.3 chapter. submitted by /u/RalphuChino [link] [comments]  ( 61 min )
    [Discussion] No-code ML for engineers
    I work in an oil refinery. Beyond my regular role, I have been working on Python-based analysis at my workplace, including machine learning. Many colleagues have sent their data to me for analysis or to create ML models, but I do not have time to process all the requests (though I’d love to). I’m hoping to look for a no-code and low-cost method that empowers chemical/mechanical/electrical engineers (who have no Python or ML knowledge) to attempt ML studies on their data, before passing it to me for further work or to put into production. We happen to be using Power BI for dashboarding. Is asking the engineers to use Power BI Premium Pay-per-user AutoML a good idea? Or are there better, or cheaper or easier to use platforms? Thanks for your advice. Additional question: would anyone know the full list of models that are considered by Power BI’s automl? Googling doesn’t seem to give me such info. submitted by /u/kayhai [link] [comments]  ( 64 min )
    [R] Predictive End-to-End Enterprise Process Network Monitoring
    This paper presents a method for predictive enterprise process network monitoring leveraging a novel multi-headed deep neural network model. The model integrates multiple data sources from an enterprise process network, such as process logs or context information. With this deep learning architecture, the heterogeneous data are processed in dedicated neural network input heads and concatenated for prediction based on cross-department information. The results from a case study conducted with a medium-sized German manufacturing company shed light on the practical relevance. Researchgate: https://www.researchgate.net/publication/366063386_Predictive_End-to-End_Enterprise_Process_Network_Monitoring Springer: https://link.springer.com/article/10.1007/s12599-022-00778-4 submitted by /u/Positive_Ad_1090 [link] [comments]  ( 58 min )
    [D] resources related to multi-task learning for graph neural network
    Hi, I am just wondering if I can apply multi-task learning to GNN (for text classification). what kind of parameter should I share for the layer, especially for hard parameter sharing? As for PyTorch docs (CMIIW), the shared parameter would be encoder and decoder, then how about for GNN? Should I apply an encoder on top of the graph layer or what? Kinda stuck on this recently. submitted by /u/aozorahime [link] [comments]  ( 61 min )
    [D] If you had to pick 10-20 significant papers that summarize the research trajectory of AI from the past 100 years what would they be
    You can only pick max 20 papers, and they should cover the major milestones/turning points in AI research. What would those papers be? In terms of significance im looking for papers along the lines of "Attention is all you need" - https://arxiv.org/abs/1706.03762 That mark big shifts/breakthroughs in the field. submitted by /u/versaceblues [link] [comments]  ( 69 min )
    [P] Build data apps with GPT-3 in hal9
    Hi 👋🏼 I'm Javier, we've been working on an OSS library called hal9. It allows you to build data applications with Python and R with a callback model, kinda like between streamlit and dash. We are currently exploring using GPT-3 to generate apps with streamlit and hal9, I'm super excited to make this post and collect your thoughts, you can play with it here: hal9.com/build. Feel free to open GitHub issues for questions, feedback, or issues as needed. Thank you! ​ https://i.redd.it/s9nkwyyyqe4a1.gif submitted by /u/northwestredditor [link] [comments]  ( 58 min )
    [D] Online Portfolio Selection - Exponentiated Gradient
    https://sudeepraja.github.io/OPS3/ ​ https://preview.redd.it/rtw5l8nxhd4a1.png?width=987&format=png&auto=webp&s=224ccaf225f476d70fdd1b47a7587266f6703742 I wrote a blog post on the Exponentiated Gradient algorithm and its variants. It can be used for selecting a portfolio from a set of stocks. It is computationally inexpensive (linear per iteration) and has the universal portfolio property (asymptotic average regret is 0). The post includes a gist with code for the algorithm and backtests on simulated and historical data. Here are the previous two posts in this series: https://sudeepraja.github.io/OPS1/ https://sudeepraja.github.io/OPS2/ submitted by /u/sudeepraja [link] [comments]  ( 57 min )
  • Open

    Private Ads Prediction with DP-SGD
    Posted by Krishna Giri Narra, Software Engineer, Google, and Chiyuan Zhang, Research Scientist, Google Research Ad technology providers widely use machine learning (ML) models to predict and present users with the most relevant ads, and to measure the effectiveness of those ads. With increasing focus on online privacy, there’s an opportunity to identify ML algorithms that have better privacy-utility trade-offs. Differential privacy (DP) has emerged as a popular framework for developing ML algorithms responsibly with provable privacy guarantees. It has been extensively studied in the privacy literature, deployed in industrial applications and employed by the U.S. Census. Intuitively, the DP framework enables ML models to learn population-wide properties, while protecting user-level inf…  ( 93 min )
    Google at EMNLP 2022
    Posted by Malaya Jules, Program Manager, Google This week, the premier conference on Empirical Methods in Natural Language Processing (EMNLP 2022) is being held in Abu Dhabi, United Arab Emirates. We are proud to be a Diamond Sponsor of EMNLP 2022, with Google researchers contributing at all levels. This year we are presenting over 50 papers and are actively involved in 10 different workshops and tutorials. If you’re registered for EMNLP 2022, we hope you’ll visit the Google booth to learn more about the exciting work across various topics, including language interactions, causal inference, question answering and more. Take a look below to learn more about the Google research being presented at EMNLP 2022 (Google affiliations in bold). Committees Organizing Committee incl…  ( 92 min )
  • Open

    Improve scalability for Amazon Rekognition stateless APIs using multiple regions
    In previous blog post, we described an end-to-end identity verification solution in a single AWS Region. The solution uses the Amazon Rekognition APIs DetectFaces for face detection and CompareFaces for face comparison. We think of those APIs as stateless APIs because they don’t depend on an Amazon Rekognition face collection. They’re also idempotent, meaning repeated […]  ( 7 min )
    Use your own training scripts and automatically select the best model using hyperparameter optimization in Amazon SageMaker
    The success of any machine learning (ML) pipeline depends not just on the quality of model used, but also the ability to train and iterate upon this model. One of the key ways to improve an ML model is by choosing better tunable parameters, known as hyperparameters. This is known as hyperparameter optimization (HPO). However, […]  ( 8 min )
  • Open

    Visual Effects Artist Jay Lippman Takes Viewers Behind the Camera This Week ‘In the NVIDIA Studio’
    Time to tackle one of the most challenging tasks for aspiring movie makers — creating aesthetically pleasing visual effects — courtesy of visual effects artist and filmmaker Jay Lippman this week In the NVIDIA Studio. The post Visual Effects Artist Jay Lippman Takes Viewers Behind the Camera This Week ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.  ( 8 min )
    License for the AI Autobahn: NVIDIA AI Enterprise 3.0 Introduces New Tools to Speed Success
    From rapidly fluctuating demand to staffing shortages and supply chain complexity, enterprises have navigated numerous challenges the past few years. Many companies seeking strong starts to 2023 are planning to use AI and accelerated computing to drive growth while saving costs. To support these early adopters — as well as those just beginning their AI Read article > The post License for the AI Autobahn: NVIDIA AI Enterprise 3.0 Introduces New Tools to Speed Success appeared first on NVIDIA Blog.  ( 8 min )
    Banking on AI: Deutsche Bank, NVIDIA to Accelerate Adoption of AI for Financial Services
    Deutsche Bank Wednesday announced a partnership with NVIDIA to accelerate the use of AI and machine learning in the financial services sector. The announcement follows months of testing to explore use cases that could support the bank’s strategic ambitions to 2025 and beyond. “Accelerated computing and AI are at a tipping point, and we’re bringing Read article > The post Banking on AI: Deutsche Bank, NVIDIA to Accelerate Adoption of AI for Financial Services appeared first on NVIDIA Blog.  ( 5 min )
    Hittin’ the Sim: NVIDIA’s Matt Cragun on Conditioning Autonomous Vehicles in Simulation
    Training, testing and validating autonomous vehicles requires a continuous pipeline — or data factory — to introduce new scenarios and refine deep neural networks. A key component of this process is simulation. AV developers can test a virtually limitless number of scenarios, repeatably and at scale, with high-fidelity, physically based simulation. And like much of Read article > The post Hittin’ the Sim: NVIDIA’s Matt Cragun on Conditioning Autonomous Vehicles in Simulation appeared first on NVIDIA Blog.  ( 5 min )
  • Open

    Suggestion for model to predict size of particles in images?
    Inputs will be images (64x64 or higher order 256x256) and output should be number float (size of particles in image). I prepared a dataset using voronoi cells. I looked into the digits recognition problem but that seemed classification problem (categories of 10 digits). So in examples it was using MLPclassifier or CNN( categorial loss and softmax in final layer.) Others things i came across are MLPregressor or CNN with linear activation and mean squre loss. submitted by /u/mrpacetv [link] [comments]  ( 49 min )
    Is this the correct application for a NN?
    I get neural networks are great for classifications and 'identifying' images and photo's is a v popular application for NN's. I'm designing an 'expert knowledge' system that scans a lot of different 'signals' and suggests the best course of action. Think about a patient presenting a range of symptoms at various degrees (not binary/absolute) and then the software suggesting a diagnosis and then learning when it is right/wrong. Is a NN the best thing for this? Or a series of NN's? If a NN can analyse input signals, compare against a known outcome (pattern) and then give a measure of likeness (e.g. matching signals to an image) - it's hard to imagine how that is useful where input signals can align to a large number of outputs. Am I understanding NN's correctly and need to find a different way to learn what all the different input signals mean? submitted by /u/Togfox [link] [comments]  ( 45 min )
  • Open

    Sine of a circle
    What does it look like when you take the sine of a circle? Not the angle of points on a circle, but the circle itself as a set of points in the complex plane? Here’s a plot for the sine of circles of radius r centered at the origin, 0 < r < π/2. Here’s […] Sine of a circle first appeared on John D. Cook.  ( 5 min )
    Test whether three complex numbers lie on an equilateral triangle
    Let a, b, and c be three complex numbers. These numbers form the vertices of an equilateral triangle in the complex plane if and only if This theorem can be found in [1]. If we rotate the matrix above, we multiply its sign by -1. If we then swap two rows we multiply the determinant […] Test whether three complex numbers lie on an equilateral triangle first appeared on John D. Cook.  ( 4 min )
  • Open

    The value of qualitative research in a comparative user research study
    Many UX designers consider several value of qualitative research in a comparative user research study design directions and research methods for their designs. But in a world where user expectations are rising, competition is high, and design trends change frequently. It’s important to stay ahead of the curve. A lot of the time designers choose… Read More »The value of qualitative research in a comparative user research study The post The value of qualitative research in a comparative user research study appeared first on Data Science Central.  ( 26 min )
    Top 7 Data Security Threats to AI and ML
    Artificial intelligence (AI) and machine learning (ML) are making waves across industries. We are beginning to see these incredible technologies pop up in more areas of our lives, from self-driving cars to healthcare, finance, and even customer service. But as more and more companies roll out these technologies en masse and start intertwining them with… Read More »Top 7 Data Security Threats to AI and ML The post Top 7 Data Security Threats to AI and ML appeared first on Data Science Central.  ( 22 min )
  • Open

    Scalable Spectral Clustering with Group Fairness Constraints. (arXiv:2210.16435v2 [cs.LG] UPDATED)
    There are synergies of research interests and industrial efforts in modeling fairness and correcting algorithmic bias in machine learning. In this paper, we present a scalable algorithm for spectral clustering (SC) with group fairness constraints. Group fairness is also known as statistical parity where in each cluster, each protected group is represented with the same proportion as in the entirety. While FairSC algorithm (Kleindessner et al., 2019) is able to find the fairer clustering, it is compromised by high costs due to the kernels of computing nullspaces and the square roots of dense matrices explicitly. We present a new formulation of underlying spectral computation by incorporating nullspace projection and Hotelling's deflation such that the resulting algorithm, called s-FairSC, only involves the sparse matrix-vector products and is able to fully exploit the sparsity of the fair SC model. The experimental results on the modified stochastic block model demonstrate that s-FairSC is comparable with FairSC in recovering fair clustering. Meanwhile, it is sped up by a factor of 12 for moderate model sizes. s-FairSC is further demonstrated to be scalable in the sense that the computational costs of s-FairSC only increase marginally compared to the SC without fairness constraints.  ( 2 min )
    Efficient Incremental Text-to-Speech on GPUs. (arXiv:2211.13939v2 [cs.SD] UPDATED)
    Incremental text-to-speech, also known as streaming TTS, has been increasingly applied to online speech applications that require ultra-low response latency to provide an optimal user experience. However, most of the existing speech synthesis pipelines deployed on GPU are still non-incremental, which uncovers limitations in high-concurrency scenarios, especially when the pipeline is built with end-to-end neural network models. To address this issue, we present a highly efficient approach to perform real-time incremental TTS on GPUs with Instant Request Pooling and Module-wise Dynamic Batching. Experimental results demonstrate that the proposed method is capable of producing high-quality speech with a first-chunk latency lower than 80ms under 100 QPS on a single NVIDIA A10 GPU and significantly outperforms the non-incremental twin in both concurrency and latency. Our work reveals the effectiveness of high-performance incremental TTS on GPUs.  ( 2 min )
    Measuring Proximity in Attributed Networks for Community Detection. (arXiv:2111.03089v1 [cs.SI] CROSS LISTED)
    Proximity measures on graphs have a variety of applications in network analysis, including community detection. Previously they have been mainly studied in the context of networks without attributes. If node attributes are taken into account, however, this can provide more insight into the network structure. In this paper, we extend the definition of some well-studied proximity measures to attributed networks. To account for attributes, several attribute similarity measures are used. Finally, the obtained proximity measures are applied to detect the community structure in some real-world networks using the spectral clustering algorithm.  ( 2 min )
    Establishment of Neural Networks Robust to Label Noise. (arXiv:2211.15279v2 [cs.LG] UPDATED)
    Label noise is a significant obstacle in deep learning model training. It can have a considerable impact on the performance of image classification models, particularly deep neural networks, which are especially susceptible because they have a strong propensity to memorise noisy labels. In this paper, we have examined the fundamental concept underlying related label noise approaches. A transition matrix estimator has been created, and its effectiveness against the actual transition matrix has been demonstrated. In addition, we examined the label noise robustness of two convolutional neural network classifiers with LeNet and AlexNet designs. The two FashionMINIST datasets have revealed the robustness of both models. We are not efficiently able to demonstrate the influence of the transition matrix noise correction on robustness enhancements due to our inability to correctly tune the complex convolutional neural network model due to time and computing resource constraints. There is a need for additional effort to fine-tune the neural network model and explore the precision of the estimated transition model in future research.  ( 2 min )
    How to GAN away Detector Effects. (arXiv:1912.00477v4 [hep-ph] CROSS LISTED)
    LHC analyses directly comparing data and simulated events bear the danger of using first-principle predictions only as a black-box part of event simulation. We show how simulations, for instance, of detector effects can instead be inverted using generative networks. This allows us to reconstruct parton level information from measured events. Our results illustrate how, in general, fully conditional generative networks can statistically invert Monte Carlo simulations. As a technical by-product we show how a maximum mean discrepancy loss can be staggered or cooled.  ( 2 min )
    HyperEF: Spectral Hypergraph Coarsening by Effective-Resistance Clustering. (arXiv:2210.14813v2 [cs.LG] UPDATED)
    This paper introduces a scalable algorithmic framework (HyperEF) for spectral coarsening (decomposition) of large-scale hypergraphs by exploiting hyperedge effective resistances. Motivated by the latest theoretical framework for low-resistance-diameter decomposition of simple graphs, HyperEF aims at decomposing large hypergraphs into multiple node clusters with only a few inter-cluster hyperedges. The key component in HyperEF is a nearly-linear time algorithm for estimating hyperedge effective resistances, which allows incorporating the latest diffusion-based non-linear quadratic operators defined on hypergraphs. To achieve good runtime scalability, HyperEF searches within the Krylov subspace (or approximate eigensubspace) for identifying the nearly-optimal vectors for approximating the hyperedge effective resistances. In addition, a node weight propagation scheme for multilevel spectral hypergraph decomposition has been introduced for achieving even greater node coarsening ratios. When compared with state-of-the-art hypergraph partitioning (clustering) methods, extensive experiment results on real-world VLSI designs show that HyperEF can more effectively coarsen (decompose) hypergraphs without losing key structural (spectral) properties of the original hypergraphs, while achieving over $70\times$ runtime speedups over hMetis and $20\times$ speedups over HyperSF.  ( 2 min )
    EVA: Exploring the Limits of Masked Visual Representation Learning at Scale. (arXiv:2211.07636v2 [cs.CV] UPDATED)
    We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves almost the same state-of-the-art performance on LVISv1.0 dataset with over a thousand categories and COCO dataset with only eighty categories. Beyond a pure vision encoder, EVA can also serve as a vision-centric, multi-modal pivot to connect images and text. We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models. To facilitate future research, we release all the code and models at https://github.com/baaivision/EVA.  ( 2 min )
    RARR: Researching and Revising What Language Models Say, Using Language Models. (arXiv:2210.08726v2 [cs.CL] UPDATED)
    Language models (LMs) now excel at many tasks such as few-shot learning, question answering, reasoning, and dialog. However, they sometimes generate unsupported or misleading content. A user cannot easily determine whether their outputs are trustworthy or not, because most LMs do not have any built-in mechanism for attribution to external evidence. To enable attribution while still preserving all the powerful advantages of recent generation models, we propose RARR (Retrofit Attribution using Research and Revision), a system that 1) automatically finds attribution for the output of any text generation model and 2) post-edits the output to fix unsupported content while preserving the original output as much as possible. When applied to the output of several state-of-the-art LMs on a diverse set of generation tasks, we find that RARR significantly improves attribution while otherwise preserving the original input to a much greater degree than previously explored edit models. Furthermore, the implementation of RARR requires only a handful of training examples, a large language model, and standard web search.  ( 2 min )
    Topical Segmentation of Spoken Narratives: A Test Case on Holocaust Survivor Testimonies. (arXiv:2210.13783v2 [cs.CL] UPDATED)
    The task of topical segmentation is well studied, but previous work has mostly addressed it in the context of structured, well-defined segments, such as segmentation into paragraphs, chapters, or segmenting text that originated from multiple sources. We tackle the task of segmenting running (spoken) narratives, which poses hitherto unaddressed challenges. As a test case, we address Holocaust survivor testimonies, given in English. Other than the importance of studying these testimonies for Holocaust research, we argue that they provide an interesting test case for topical segmentation, due to their unstructured surface level, relative abundance (tens of thousands of such testimonies were collected), and the relatively confined domain that they cover. We hypothesize that boundary points between segments correspond to low mutual information between the sentences proceeding and following the boundary. Based on this hypothesis, we explore a range of algorithmic approaches to the task, building on previous work on segmentation that uses generative Bayesian modeling and state-of-the-art neural machinery. Compared to manually annotated references, we find that the developed approaches show considerable improvements over previous work.  ( 2 min )
    Gamma-convergence of a nonlocal perimeter arising in adversarial machine learning. (arXiv:2211.15223v2 [math.AP] UPDATED)
    In this paper we prove Gamma-convergence of a nonlocal perimeter of Minkowski type to a local anisotropic perimeter. The nonlocal model describes the regularizing effect of adversarial training in binary classifications. The energy essentially depends on the interaction between two distributions modelling likelihoods for the associated classes. We overcome typical strict regularity assumptions for the distributions by only assuming that they have bounded $BV$ densities. In the natural topology coming from compactness, we prove Gamma-convergence to a weighted perimeter with weight determined by an anisotropic function of the two densities. Despite being local, this sharp interface limit reflects classification stability with respect to adversarial perturbations. We further apply our results to deduce Gamma-convergence of the associated total variations, to study the asymptotics of adversarial training, and to prove Gamma-convergence of graph discretizations for the nonlocal perimeter.  ( 2 min )
    RoS-KD: A Robust Stochastic Knowledge Distillation Approach for Noisy Medical Imaging. (arXiv:2210.08388v2 [cs.CV] UPDATED)
    AI-powered Medical Imaging has recently achieved enormous attention due to its ability to provide fast-paced healthcare diagnoses. However, it usually suffers from a lack of high-quality datasets due to high annotation cost, inter-observer variability, human annotator error, and errors in computer-generated labels. Deep learning models trained on noisy labelled datasets are sensitive to the noise type and lead to less generalization on the unseen samples. To address this challenge, we propose a Robust Stochastic Knowledge Distillation (RoS-KD) framework which mimics the notion of learning a topic from multiple sources to ensure deterrence in learning noisy information. More specifically, RoS-KD learns a smooth, well-informed, and robust student manifold by distilling knowledge from multiple teachers trained on overlapping subsets of training data. Our extensive experiments on popular medical imaging classification tasks (cardiopulmonary disease and lesion classification) using real-world datasets, show the performance benefit of RoS-KD, its ability to distill knowledge from many popular large networks (ResNet-50, DenseNet-121, MobileNet-V2) in a comparatively small network, and its robustness to adversarial attacks (PGD, FSGM). More specifically, RoS-KD achieves >2% and >4% improvement on F1-score for lesion classification and cardiopulmonary disease classification tasks, respectively, when the underlying student is ResNet-18 against recent competitive knowledge distillation baseline. Additionally, on cardiopulmonary disease classification task, RoS-KD outperforms most of the SOTA baselines by ~1% gain in AUC score.  ( 2 min )
    Transformer Meets Boundary Value Inverse Problems. (arXiv:2209.14977v2 [cs.LG] UPDATED)
    A Transformer-based deep direct sampling method is proposed for a class of boundary value inverse problems. A real-time reconstruction is achieved by evaluating the learned inverse operator between carefully designed data and the reconstructed images. An effort is made to give a specific example to a fundamental question: whether and how one can benefit from the theoretical structure of a mathematical problem to develop task-oriented and structure-conforming deep neural networks? Specifically, inspired by direct sampling methods for inverse problems, the 1D boundary data in different frequencies are preprocessed by a partial differential equation-based feature map to yield 2D harmonic extensions as different input channels. Then, by introducing learnable non-local kernels, the direct sampling is recast to a modified attention mechanism. The proposed method is then applied to electrical impedance tomography, a well-known severely ill-posed nonlinear inverse problem. The new method achieves superior accuracy over its predecessors and contemporary operator learners, as well as shows robustness with respect to noise. This research shall strengthen the insights that the attention mechanism, despite being invented for natural language processing tasks, offers great flexibility to be modified in conformity with the a priori mathematical knowledge, which ultimately leads to the design of more physics-compatible neural architectures.  ( 2 min )
    Optimal Sparse Regression Trees. (arXiv:2211.14980v2 [cs.LG] UPDATED)
    Regression trees are one of the oldest forms of AI models, and their predictions can be made without a calculator, which makes them broadly useful, particularly for high-stakes applications. Within the large literature on regression trees, there has been little effort towards full provable optimization, mainly due to the computational hardness of the problem. This work proposes a dynamic-programming-with-bounds approach to the construction of provably-optimal sparse regression trees. We leverage a novel lower bound based on an optimal solution to the k-Means clustering algorithm in 1-dimension over the set of labels. We are often able to find optimal sparse trees in seconds, even for challenging datasets that involve large numbers of samples and highly-correlated features.  ( 2 min )
    Diffusion Models for Graphs Benefit From Discrete State Spaces. (arXiv:2210.01549v2 [cs.LG] UPDATED)
    Denoising diffusion probabilistic models and score matching models have proven to be very powerful for generative tasks. While these approaches have also been applied to the generation of discrete graphs, they have, so far, relied on continuous Gaussian perturbations. Instead, in this work, we suggest using discrete noise for the forward Markov process. This ensures that in every intermediate step the graph remains discrete. Compared to the previous approach, our experimental results on four datasets and multiple architectures show that using a discrete noising process results in higher quality generated samples indicated with an average MMDs reduced by a factor of 1.5. Furthermore, the number of denoising steps is reduced from 1000 to 32 steps leading to a 30 times faster sampling procedure.  ( 2 min )
    Highly Efficient Real-Time Streaming and Fully On-Device Speaker Diarization with Multi-Stage Clustering. (arXiv:2210.13690v2 [eess.AS] UPDATED)
    While recent research advances in speaker diarization mostly focus on improving the quality of diarization results, there is also an increasing interest in improving the efficiency of diarization systems. In this paper, we propose a multi-stage clustering strategy, that uses different clustering algorithms for input of different lengths. Specifically, a fallback clusterer is used to handle short-form inputs; a main clusterer is used to handle medium-length inputs; and a pre-clusterer is used to compress long-form inputs before they are processed by the main clusterer. Both the main clusterer and the pre-clusterer can be configured with an upper bound of the computational complexity to adapt to devices with different constraints. This multi-stage clustering strategy is critical for streaming on-device speaker diarization systems, where the budgets of CPU, memory and battery are tight.  ( 2 min )
    Algorithms with Prediction Portfolios. (arXiv:2210.12438v2 [cs.LG] UPDATED)
    The research area of algorithms with predictions has seen recent success showing how to incorporate machine learning into algorithm design to improve performance when the predictions are correct, while retaining worst-case guarantees when they are not. Most previous work has assumed that the algorithm has access to a single predictor. However, in practice, there are many machine learning methods available, often with incomparable generalization guarantees, making it hard to pick a best method a priori. In this work we consider scenarios where multiple predictors are available to the algorithm and the question is how to best utilize them. Ideally, we would like the algorithm's performance to depend on the quality of the best predictor. However, utilizing more predictions comes with a cost, since we now have to identify which prediction is the best. We study the use of multiple predictors for a number of fundamental problems, including matching, load balancing, and non-clairvoyant scheduling, which have been well-studied in the single predictor setting. For each of these problems we introduce new algorithms that take advantage of multiple predictors, and prove bounds on the resulting performance.  ( 2 min )
    The Sufficiency of Off-Policyness and Soft Clipping: PPO is still Insufficient according to an Off-Policy Measure. (arXiv:2205.10047v6 [cs.LG] UPDATED)
    The popular Proximal Policy Optimization (PPO) algorithm approximates the solution in a clipped policy space. Does there exist better policies outside of this space? By using a novel surrogate objective that employs the sigmoid function (which provides an interesting way of exploration), we found that the answer is ``YES'', and the better policies are in fact located very far from the clipped space. We show that PPO is insufficient in ``off-policyness'', according to an off-policy metric called DEON. Our algorithm explores in a much larger policy space than PPO, and it maximizes the Conservative Policy Iteration (CPI) objective better than PPO during training. To the best of our knowledge, all current PPO methods have the clipping operation and optimize in the clipped policy space. Our method is the first of this kind, which advances the understanding of CPI optimization and policy gradient methods. Code is available at https://github.com/raincchio/P3O.  ( 2 min )
    Calculus on MDPs: Potential Shaping as a Gradient. (arXiv:2208.09570v2 [cs.LG] UPDATED)
    In reinforcement learning, different reward functions can be equivalent in terms of the optimal policies they induce. A particularly well-known and important example is potential shaping, a class of functions that can be added to any reward function without changing the optimal policy set under arbitrary transition dynamics. Potential shaping is conceptually similar to potentials, conservative vector fields and gauge transformations in math and physics, but this connection has not previously been formally explored. We develop a formalism for discrete calculus on graphs that abstract a Markov Decision Process, and show how potential shaping can be formally interpreted as a gradient within this framework. This allows us to strengthen results from Ng et al. (1999) describing conditions under which potential shaping is the only additive reward transformation to always preserve optimal policies. As an additional application of our formalism, we define a rule for picking a single unique reward function from each potential shaping equivalence class.  ( 2 min )
    Learn to Detect and Detect to Learn: Structure Learning and Decision Feedback for MIMO-OFDM Receive Processing. (arXiv:2208.09287v2 [eess.SP] UPDATED)
    One major open challenge in MIMO-OFDM receive processing is how to efficiently and effectively utilize the extremely limited over-the-air pilot symbols to detect the transmitted data symbols. Recent advances have been devoted to investigating effective ways to utilize the limited pilots. However, we notice that besides exploiting the pilots, one can take advantage of the data symbols to improve detection performance. Thus, this paper introduces an online subframe-based approach, namely RC-StructNet-DF, that can efficiently learn from the precious pilot symbols and be dynamically updated with the detected payload data using the decision feedback (DF) approach. With the DF mechanism, the network can dynamically track the channel changes within a subframe. To mitigate the error propagation of the DF approach, the specially designed StructNet is adopted in the frequency domain, which is shown to be robust to the incorrect labels owing to the embedded structural information. The introduced parameter estimation (PE) layer in the StructNet further facilitates the DF method by utilizing the network weights to learn the parameters. Extensive experiments have been conducted to demonstrate the effectiveness of RC-StructNet-DF in detection in both the MIMO-OFDM system and the massive MIMO-OFDM system with different modulation orders under various scenarios.  ( 2 min )
    What is Not in the Context? Evaluation of Few-shot Learners with Informative Demonstrations. (arXiv:2212.01692v1 [cs.CL])
    Large language models demonstrate an emergent ability to learn a new task from a small number of input-output demonstrations, referred to as in-context few-shot learning. However, recent work shows that in such settings, models mainly learn to mimic the new task distribution, instead of the mechanics of the new task. We argue that the commonly-used evaluation settings of few-shot models utilizing a random selection of in-context demonstrations is not able to disentangle models' ability to learn new skills from demonstrations, as most of the such-selected demonstrations are not informative for prediction beyond exposing the new task's input and output distribution. Therefore, we introduce an evaluation technique that disentangles few-shot learners' gain from in-context learning by picking the demonstrations sharing a specific, informative concept with the predicted sample, in addition to the performance reached by mainly non-informative samples. We find that regardless of the model size, existing few-shot learners are not able to benefit from observing such informative concepts in demonstrations. We also find that such ability may not be obtained trivially by exposing the informative demonstrations in the training process, leaving the challenge of training true in-context learners open.  ( 2 min )
    Scalable and Robust Community Detection with Randomized Sketching. (arXiv:1805.10927v4 [cs.SI] UPDATED)
    This article explores and analyzes the unsupervised clustering of large partially observed graphs. We propose a scalable and provable randomized framework for clustering graphs generated from the stochastic block model. The clustering is first applied to a sub-matrix of the graph's adjacency matrix associated with a reduced graph sketch constructed using random sampling. Then, the clusters of the full graph are inferred based on the clusters extracted from the sketch using a correlation-based retrieval step. Uniform random node sampling is shown to improve the computational complexity over clustering of the full graph when the cluster sizes are balanced. A new random degree-based node sampling algorithm is presented which significantly improves upon the performance of the clustering algorithm even when clusters are unbalanced. This framework improves the phase transitions for matrix-decomposition-based clustering with regard to computational complexity and minimum cluster size, which are shown to be nearly dimension-free in the low inter-cluster connectivity regime. A third sampling technique is shown to improve balance by randomly sampling nodes based on spatial distribution. We provide analysis and numerical results using a convex clustering algorithm based on matrix completion.  ( 2 min )
    Quantile Multi-Armed Bandits: Optimal Best-Arm Identification and a Differentially Private Scheme. (arXiv:2006.06792v4 [stat.ML] UPDATED)
    We study the best-arm identification problem in multi-armed bandits with stochastic, potentially private rewards, when the goal is to identify the arm with the highest quantile at a fixed, prescribed level. First, we propose a (non-private) successive elimination algorithm for strictly optimal best-arm identification, we show that our algorithm is $\delta$-PAC and we characterize its sample complexity. Further, we provide a lower bound on the expected number of pulls, showing that the proposed algorithm is essentially optimal up to logarithmic factors. Both upper and lower complexity bounds depend on a special definition of the associated suboptimality gap, designed in particular for the quantile bandit problem, as we show when the gap approaches zero, best-arm identification is impossible. Second, motivated by applications where the rewards are private, we provide a differentially private successive elimination algorithm whose sample complexity is finite even for distributions with infinite support-size, and we characterize its sample complexity. Our algorithms do not require prior knowledge of either the suboptimality gap or other statistical information related to the bandit problem at hand.  ( 2 min )
    Segmentation-free PVC for Cardiac SPECT using a Densely-connected Multi-dimensional Dynamic Network. (arXiv:2206.12344v2 [eess.IV] UPDATED)
    In nuclear imaging, limited resolution causes partial volume effects (PVEs) that affect image sharpness and quantitative accuracy. Partial volume correction (PVC) methods incorporating high-resolution anatomical information from CT or MRI have been demonstrated to be effective. However, such anatomical-guided methods typically require tedious image registration and segmentation steps. Accurately segmented organ templates are also hard to obtain, particularly in cardiac SPECT imaging, due to the lack of hybrid SPECT/CT scanners with high-end CT and associated motion artifacts. Slight mis-registration/mis-segmentation would result in severe degradation in image quality after PVC. In this work, we develop a deep-learning-based method for fast cardiac SPECT PVC without anatomical information and associated organ segmentation. The proposed network involves a densely-connected multi-dimensional dynamic mechanism, allowing the convolutional kernels to be adapted based on the input images, even after the network is fully trained. Intramyocardial blood volume (IMBV) is introduced as an additional clinical-relevant loss function for network optimization. The proposed network demonstrated promising performance on 28 canine studies acquired on a GE Discovery NM/CT 570c dedicated cardiac SPECT scanner with a 64-slice CT using Technetium-99m-labeled red blood cells. This work showed that the proposed network with densely-connected dynamic mechanism produced superior results compared with the same network without such mechanism. Results also showed that the proposed network without anatomical information could produce images with statistically comparable IMBV measurements to the images generated by anatomical-guided PVC methods, which could be helpful in clinical translation.  ( 3 min )
    A Generalist Neural Algorithmic Learner. (arXiv:2209.11142v2 [cs.LG] UPDATED)
    The cornerstone of neural algorithmic reasoning is the ability to solve algorithmic tasks, especially in a way that generalises out of distribution. While recent years have seen a surge in methodological improvements in this area, they mostly focused on building specialist models. Specialist models are capable of learning to neurally execute either only one algorithm or a collection of algorithms with identical control-flow backbone. Here, instead, we focus on constructing a generalist neural algorithmic learner -- a single graph neural network processor capable of learning to execute a wide range of algorithms, such as sorting, searching, dynamic programming, path-finding and geometry. We leverage the CLRS benchmark to empirically show that, much like recent successes in the domain of perception, generalist algorithmic learners can be built by "incorporating" knowledge. That is, it is possible to effectively learn algorithms in a multi-task manner, so long as we can learn to execute them well in a single-task regime. Motivated by this, we present a series of improvements to the input representation, training regime and processor architecture over CLRS, improving average single-task performance by over 20% from prior art. We then conduct a thorough ablation of multi-task learners leveraging these improvements. Our results demonstrate a generalist learner that effectively incorporates knowledge captured by specialist models.  ( 2 min )
    Kernel Methods for Multistage Causal Inference: Mediation Analysis and Dynamic Treatment Effects. (arXiv:2111.03950v3 [stat.ME] UPDATED)
    We propose simple estimators for mediation analysis and dynamic treatment effects over short horizons based on kernel ridge regression. We study both nonparametric response curves and semiparametric treatment effects, allowing treatments, mediators, and covariates to be continuous or discrete in general spaces. Our key innovation is a new RKHS technique called sequential mean embedding, which facilitates the construction of simple estimators for complex causal estimands, including new estimands without existing alternatives. In particular, we propose machine learning estimators of dynamic dose response curves and dynamic counterfactual distributions without restrictive linearity, Markov, or no-effect-modification assumptions. Our simple estimators preserve the generality of classic identification while also achieving nonasymptotic uniform rates for causal functions and semiparametric efficiency for causal scalars. In nonlinear simulations with many covariates, we demonstrate state-of-the-art performance. We estimate mediated and dynamic response curves of the US Job Corps program for disadvantaged youth, and share a data set that may serve as a benchmark in future work.  ( 2 min )
    SigMaNet: One Laplacian to Rule Them All. (arXiv:2205.13459v2 [cs.LG] UPDATED)
    This paper introduces SigMaNet, a generalized Graph Convolutional Network (GCN) capable of handling both undirected and directed graphs with weights not restricted in sign nor magnitude. The cornerstone of SigMaNet is the Sign-Magnetic Laplacian ($L^{\sigma}$), a new Laplacian matrix that we introduce ex novo in this work. $L^{\sigma}$ allows us to bridge a gap in the current literature by extending the theory of spectral GCNs to (directed) graphs with both positive and negative weights. $L^{\sigma}$ exhibits several desirable properties not enjoyed by other Laplacian matrices on which several state-of-the-art architectures are based, among which encoding the edge direction and weight in a clear and natural way that is not negatively affected by the weight magnitude. $L^{\sigma}$ is also completely parameter-free, which is not the case of other Laplacian operators such as, e.g., the Magnetic Laplacian. The versatility and the performance of our proposed approach is amply demonstrated via computational experiments. Indeed, our results show that, for at least a metric, SigMaNet achieves the best performance in 15 out of 21 cases and either the first- or second-best performance in 21 cases out of 21, even when compared to architectures that are either more complex or that, due to being designed for a narrower class of graphs, should -- but do not -- achieve a better performance.  ( 2 min )
    Task and Model Agnostic Adversarial Attack on Graph Neural Networks. (arXiv:2112.13267v2 [cs.LG] UPDATED)
    Adversarial attacks on Graph Neural Networks (GNNs) reveal their security vulnerabilities, limiting their adoption in safety-critical applications. However, existing attack strategies rely on the knowledge of either the GNN model being used or the predictive task being attacked. Is this knowledge necessary? For example, a graph may be used for multiple downstream tasks unknown to a practical attacker. It is thus important to test the vulnerability of GNNs to adversarial perturbations in a model and task agnostic setting. In this work, we study this problem and show that GNNs remain vulnerable even when the downstream task and model are unknown. The proposed algorithm, TANDIS (Targeted Attack via Neighborhood DIStortion) shows that distortion of node neighborhoods is effective in drastically compromising prediction performance. Although neighborhood distortion is an NP-hard problem, TANDIS designs an effective heuristic through a novel combination of Graph Isomorphism Network with deep Q-learning. Extensive experiments on real datasets and state-of-the-art models show that, on average, TANDIS is up to 50% more effective than state-of-the-art techniques, while being more than 1000 times faster.  ( 2 min )
    Optimizing Connectivity through Network Gradients for Restricted Boltzmann Machines. (arXiv:2209.06932v3 [cs.LG] UPDATED)
    Leveraging sparse networks to connect successive layers in deep neural networks has recently been shown to provide benefits to large scale state-of-the-art models. However, network connectivity also plays a significant role on the learning performance of shallow networks, such as the classic Restricted Boltzmann Machines (RBM). Efficiently finding sparse connectivity patterns that improve the learning performance of shallow networks is a fundamental problem. While recent principled approaches explicitly include network connections as model parameters that must be optimized, they often rely on explicit penalization or have network sparsity as a hyperparameter. This work presents the Network Connectivity Gradients (NCG), a method to find optimal connectivity patterns for RBMs based on the idea of network gradients: computing the gradient of every possible connection, given a specific connection pattern, and using the gradient to drive a continuous connection strength parameter that in turn is used to determine the connection pattern. Thus, learning RBM parameters and learning network connections is truly jointly performed, albeit with different learning rates, and without changes to the model's classic objective function. The method is applied to the MNIST and other data sets showing that better RBM models are found for the benchmark tasks of sample generation and input classification. Results also show that NCG is robust to network initialization, both adding and removing network connections while learning.  ( 2 min )
    Flashlight: Scalable Link Prediction with Effective Decoders. (arXiv:2209.10100v3 [cs.SI] UPDATED)
    Link prediction (LP) has been recognized as an important task in graph learning with its broad practical applications. A typical application of LP is to retrieve the top scoring neighbors for a given source node, such as the friend recommendation. These services desire the high inference scalability to find the top scoring neighbors from many candidate nodes at low latencies. There are two popular decoders that the recent LP models mainly use to compute the edge scores from node embeddings: the HadamardMLP and Dot Product decoders. After theoretical and empirical analysis, we find that the HadamardMLP decoders are generally more effective for LP. However, HadamardMLP lacks the scalability for retrieving top scoring neighbors on large graphs, since to the best of our knowledge, there does not exist an algorithm to retrieve the top scoring neighbors for HadamardMLP decoders in sublinear complexity. To make HadamardMLP scalable, we propose the Flashlight algorithm to accelerate the top scoring neighbor retrievals for HadamardMLP: a sublinear algorithm that progressively applies approximate maximum inner product search (MIPS) techniques with adaptively adjusted query embeddings. Empirical results show that Flashlight improves the inference speed of LP by more than 100 times on the large OGBL-CITATION2 dataset without sacrificing effectiveness. Our work paves the way for large-scale LP applications with the effective HadamardMLP decoders by greatly accelerating their inference.  ( 2 min )
    Reasoning-Modulated Representations. (arXiv:2107.08881v2 [cs.LG] UPDATED)
    Neural networks leverage robust internal representations in order to generalise. Learning them is difficult, and often requires a large training set that covers the data distribution densely. We study a common setting where our task is not purely opaque. Indeed, very often we may have access to information about the underlying system (e.g. that observations must obey certain laws of physics) that any "tabula rasa" neural network would need to re-learn from scratch, penalising performance. We incorporate this information into a pre-trained reasoning module, and investigate its role in shaping the discovered representations in diverse self-supervised learning settings from pixels. Our approach paves the way for a new class of representation learning, grounded in algorithmic priors.  ( 2 min )
    Variable-rate hierarchical CPC leads to acoustic unit discovery in speech. (arXiv:2206.02211v3 [cs.SD] UPDATED)
    The success of deep learning comes from its ability to capture the hierarchical structure of data by learning high-level representations defined in terms of low-level ones. In this paper we explore self-supervised learning of hierarchical representations of speech by applying multiple levels of Contrastive Predictive Coding (CPC). We observe that simply stacking two CPC models does not yield significant improvements over single-level architectures. Inspired by the fact that speech is often described as a sequence of discrete units unevenly distributed in time, we propose a model in which the output of a low-level CPC module is non-uniformly downsampled to directly minimize the loss of a high-level CPC module. The latter is designed to also enforce a prior of separability and discreteness in its representations by enforcing dissimilarity of successive high-level representations through focused negative sampling, and by quantization of the prediction targets. Accounting for the structure of the speech signal improves upon single-level CPC features and enhances the disentanglement of the learned representations, as measured by downstream speech recognition tasks, while resulting in a meaningful segmentation of the signal that closely resembles phone boundaries.  ( 2 min )
    Constant matters: Fine-grained Complexity of Differentially Private Continual Observation. (arXiv:2202.11205v4 [cs.DS] UPDATED)
    We study fine-grained error bounds for differentially private algorithms for counting under continual observation. Our main insight is that the matrix mechanism when using lower-triangular matrices can be used in the continual observation model. More specifically, we give an explicit factorization for the counting matrix $M_\mathsf{count}$ and upper bound the error explicitly. We also give a fine-grained analysis, specifying the exact constant in the upper bound. Our analysis is based on upper and lower bounds of the {\em completely bounded norm} (cb-norm) of $M_\mathsf{count}$. Along the way, we improve the best-known bound of 28 years by Mathias (SIAM Journal on Matrix Analysis and Applications, 1993) on the cb-norm of $M_\mathsf{count}$ for a large range of the dimension of $M_\mathsf{count}$. Furthermore, we are the first to give concrete error bounds for various problems under continual observation such as binary counting, maintaining a histogram, releasing an approximately cut-preserving synthetic graph, many graph-based statistics, and substring and episode counting. Finally, we note that our result can be used to get a fine-grained error bound for non-interactive local learning {and the first lower bounds on the additive error for $(\epsilon,\delta)$-differentially-private counting under continual observation.} Subsequent to this work, Henzinger et al. (SODA2023) showed that our factorization also achieves fine-grained mean-squared error.  ( 2 min )
    Sketch-based community detection in evolving networks. (arXiv:2009.11835v2 [physics.soc-ph] UPDATED)
    We consider an approach for community detection in time-varying networks. At its core, this approach maintains a small sketch graph to capture the essential community structure found in each snapshot of the full network. We demonstrate how the sketch can be used to explicitly identify six key community events which typically occur during network evolution: growth, shrinkage, merging, splitting, birth and death. Based on these detection techniques, we formulate a community detection algorithm which can process a network concurrently exhibiting all processes. One advantage afforded by the sketch-based algorithm is the efficient handling of large networks. Whereas detecting events in the full graph may be computationally expensive, the small size of the sketch allows changes to be quickly assessed. A second advantage occurs in networks containing clusters of disproportionate size. The sketch is constructed such that there is equal representation of each cluster, thus reducing the possibility that the small clusters are lost in the estimate. We present a new standardized benchmark based on the stochastic block model which models the addition and deletion of nodes, as well as the birth and death of communities. When coupled with existing benchmarks, this new benchmark provides a comprehensive suite of tests encompassing all six community events. We provide analysis and a set of numerical results demonstrating the advantages of our approach both in run time and in the handling of small clusters.  ( 2 min )
    Semantic Graph Neural Network with Multi-measure Learning for Semi-supervised Classification. (arXiv:2212.01749v1 [cs.LG])
    Graph Neural Networks (GNNs) have attracted increasing attention in recent years and have achieved excellent performance in semi-supervised node classification tasks. The success of most GNNs relies on one fundamental assumption, i.e., the original graph structure data is available. However, recent studies have shown that GNNs are vulnerable to the complex underlying structure of the graph, making it necessary to learn comprehensive and robust graph structures for downstream tasks, rather than relying only on the raw graph structure. In light of this, we seek to learn optimal graph structures for downstream tasks and propose a novel framework for semi-supervised classification. Specifically, based on the structural context information of graph and node representations, we encode the complex interactions in semantics and generate semantic graphs to preserve the global structure. Moreover, we develop a novel multi-measure attention layer to optimize the similarity rather than prescribing it a priori, so that the similarity can be adaptively evaluated by integrating measures. These graphs are fused and optimized together with GNN towards semi-supervised classification objective. Extensive experiments and ablation studies on six real-world datasets clearly demonstrate the effectiveness of our proposed model and the contribution of each component.  ( 2 min )
    On the Equivalence of Automatic and Symbolic Differentiation. (arXiv:1904.02990v4 [cs.SC] UPDATED)
    We show that reverse mode automatic differentiation and symbolic differentiation are equivalent in the sense that they both perform the same operations when computing derivatives. This is in stark contrast to the common claim that they are substantially different. The difference is often illustrated by claiming that symbolic differentiation suffers from "expression swell" whereas automatic differentiation does not. Here, we show that this statement is not true. "Expression swell" refers to the phenomenon of a much larger representation of the derivative as opposed to the representation of the original function.  ( 2 min )
    Competing Bandits: The Perils of Exploration Under Competition. (arXiv:2007.10144v7 [cs.GT] UPDATED)
    Most online platforms strive to learn from interactions with users, and many engage in exploration: making potentially suboptimal choices for the sake of acquiring new information. We study the interplay between exploration and competition: how such platforms balance the exploration for learning and the competition for users. Here users play three distinct roles: they are customers that generate revenue, they are sources of data for learning, and they are self-interested agents which choose among the competing platforms. We consider a stylized duopoly model in which two firms face the same multi-armed bandit problem. Users arrive one by one and choose between the two firms, so that each firm makes progress on its bandit problem only if it is chosen. Through a mix of theoretical results and numerical simulations, we study whether and to what extent competition incentivizes the adoption of better bandit algorithms, and whether it leads to welfare increases for users. We find that stark competition induces firms to commit to a "greedy" bandit algorithm that leads to low welfare. However, weakening competition by providing firms with some "free" users incentivizes better exploration strategies and increases welfare. We investigate two channels for weakening the competition: relaxing the rationality of users and giving one firm a first-mover advantage. Our findings are closely related to the "competition vs. innovation" relationship, and elucidate the first-mover advantage in the digital economy.  ( 3 min )
    Efficient Optimization of Dominant Set Clustering with Frank-Wolfe Algorithms. (arXiv:2007.11652v3 [cs.LG] UPDATED)
    We study Frank-Wolfe algorithms - standard, pairwise, and away-steps - for efficient optimization of Dominant Set Clustering. We present a unified and computationally efficient framework to employ the different variants of Frank-Wolfe methods, and we investigate its effectiveness via several experimental studies. In addition, we provide explicit convergence rates for the algorithms in terms of the so-called Frank-Wolfe gap. The theoretical analysis has been specialized to Dominant Set Clustering and covers consistently the different variants.  ( 2 min )
    Recognizing Object by Components with Human Prior Knowledge Enhances Adversarial Robustness of Deep Neural Networks. (arXiv:2212.01806v1 [cs.CV])
    Adversarial attacks can easily fool object recognition systems based on deep neural networks (DNNs). Although many defense methods have been proposed in recent years, most of them can still be adaptively evaded. One reason for the weak adversarial robustness may be that DNNs are only supervised by category labels and do not have part-based inductive bias like the recognition process of humans. Inspired by a well-known theory in cognitive psychology -- recognition-by-components, we propose a novel object recognition model ROCK (Recognizing Object by Components with human prior Knowledge). It first segments parts of objects from images, then scores part segmentation results with predefined human prior knowledge, and finally outputs prediction based on the scores. The first stage of ROCK corresponds to the process of decomposing objects into parts in human vision. The second stage corresponds to the decision process of the human brain. ROCK shows better robustness than classical recognition models across various attack settings. These results encourage researchers to rethink the rationality of currently widely-used DNN-based object recognition models and explore the potential of part-based models, once important but recently ignored, for improving robustness.  ( 2 min )
    Fairness in Contextual Resource Allocation Systems: Metrics and Incompatibility Results. (arXiv:2212.01725v1 [cs.CY])
    We study critical systems that allocate scarce resources to satisfy basic needs, such as homeless services that provide housing. These systems often support communities disproportionately affected by systemic racial, gender, or other injustices, so it is crucial to design these systems with fairness considerations in mind. To address this problem, we propose a framework for evaluating fairness in contextual resource allocation systems that is inspired by fairness metrics in machine learning. This framework can be applied to evaluate the fairness properties of a historical policy, as well as to impose constraints in the design of new (counterfactual) allocation policies. Our work culminates with a set of incompatibility results that investigate the interplay between the different fairness metrics we propose. Notably, we demonstrate that: 1) fairness in allocation and fairness in outcomes are usually incompatible; 2) policies that prioritize based on a vulnerability score will usually result in unequal outcomes across groups, even if the score is perfectly calibrated; 3) policies using contextual information beyond what is needed to characterize baseline risk and treatment effects can be fairer in their outcomes than those using just baseline risk and treatment effects; and 4) policies using group status in addition to baseline risk and treatment effects are as fair as possible given all available information. Our framework can help guide the discussion among stakeholders in deciding which fairness metrics to impose when allocating scarce resources.  ( 2 min )
    Close the Gate: Detecting Backdoored Models in Federated Learning based on Client-Side Deep Layer Output Analysis. (arXiv:2210.07714v2 [cs.CR] UPDATED)
    Federated Learning (FL) is a scheme for collaboratively training Deep Neural Networks (DNNs) with multiple data sources from different clients. Instead of sharing the data, each client trains the model locally, resulting in improved privacy. However, recently so-called targeted poisoning attacks have been proposed that allow individual clients to inject a backdoor into the trained model. Existing defenses against these backdoor attacks either rely on techniques like Differential Privacy to mitigate the backdoor, or analyze the weights of the individual models and apply outlier detection methods that restricts these defenses to certain data distributions. However, adding noise to the models' parameters or excluding benign outliers might also reduce the accuracy of the collaboratively trained model. Additionally, allowing the server to inspect the clients' models creates a privacy risk due to existing knowledge extraction methods. We propose CrowdGuard, a model filtering defense, that mitigates backdoor attacks by leveraging the clients' data to analyze the individual models before the aggregation. To prevent data leaks, the server sends the individual models to secure enclaves, running in client-located Trusted Execution Environments. To effectively distinguish benign and poisoned models, even if the data of different clients are not independently and identically distributed (non-IID), we introduce a novel metric called HLBIM to analyze the outputs of the DNN's hidden layers. We show that the applied significance-based detection algorithm combined can effectively detect poisoned models, even in non-IID scenarios. We show in our extensive evaluation that CrowdGuard can effectively mitigate targeted poisoning attacks and achieve in various scenarios a True-Positive-Rate of 100% and a True-Negative-Rate of 100%.  ( 3 min )
    Deep Counterfactual Estimation with Categorical Background Variables. (arXiv:2210.05811v3 [cs.LG] UPDATED)
    Referred to as the third rung of the causal inference ladder, counterfactual queries typically ask the "What if ?" question retrospectively. The standard approach to estimate counterfactuals resides in using a structural equation model that accurately reflects the underlying data generating process. However, such models are seldom available in practice and one usually wishes to infer them from observational data alone. Unfortunately, the correct structural equation model is in general not identifiable from the observed factual distribution. Nevertheless, in this work, we show that under the assumption that the main latent contributors to the treatment responses are categorical, the counterfactuals can be still reliably predicted. Building upon this assumption, we introduce CounterFactual Query Prediction (CFQP), a novel method to infer counterfactuals from continuous observations when the background variables are categorical. We show that our method significantly outperforms previously available deep-learning-based counterfactual methods, both theoretically and empirically on time series and image data. Our code is available at https://github.com/edebrouwer/cfqp.  ( 2 min )
    Quantum Federated Learning with Entanglement Controlled Circuits and Superposition Coding. (arXiv:2212.01732v1 [quant-ph])
    While witnessing the noisy intermediate-scale quantum (NISQ) era and beyond, quantum federated learning (QFL) has recently become an emerging field of study. In QFL, each quantum computer or device locally trains its quantum neural network (QNN) with trainable gates, and communicates only these gate parameters over classical channels, without costly quantum communications. Towards enabling QFL under various channel conditions, in this article we develop a depth-controllable architecture of entangled slimmable quantum neural networks (eSQNNs), and propose an entangled slimmable QFL (eSQFL) that communicates the superposition-coded parameters of eS-QNNs. Compared to the existing depth-fixed QNNs, training the depth-controllable eSQNN architecture is more challenging due to high entanglement entropy and inter-depth interference, which are mitigated by introducing entanglement controlled universal (CU) gates and an inplace fidelity distillation (IPFD) regularizer penalizing inter-depth quantum state differences, respectively. Furthermore, we optimize the superposition coding power allocation by deriving and minimizing the convergence bound of eSQFL. In an image classification task, extensive simulations corroborate the effectiveness of eSQFL in terms of prediction accuracy, fidelity, and entropy compared to Vanilla QFL as well as under different channel conditions and various data distributions.  ( 2 min )
    ContrastVAE: Contrastive Variational AutoEncoder for Sequential Recommendation. (arXiv:2209.00456v2 [cs.IR] UPDATED)
    Aiming at exploiting the rich information in user behaviour sequences, sequential recommendation has been widely adopted in real-world recommender systems. However, current methods suffer from the following issues: 1) sparsity of user-item interactions, 2) uncertainty of sequential records, 3) long-tail items. In this paper, we propose to incorporate contrastive learning into the framework of Variational AutoEncoders to address these challenges simultaneously. Firstly, we introduce ContrastELBO, a novel training objective that extends the conventional single-view ELBO to two-view case and theoretically builds a connection between VAE and contrastive learning from a two-view perspective. Then we propose Contrastive Variational AutoEncoder (ContrastVAE in short), a two-branched VAE model with contrastive regularization as an embodiment of ContrastELBO for sequential recommendation. We further introduce two simple yet effective augmentation strategies named model augmentation and variational augmentation to create a second view of a sequence and thus making contrastive learning possible. Experiments on four benchmark datasets demonstrate the effectiveness of ContrastVAE and the proposed augmentation methods. Codes are available at https://github.com/YuWang-1024/ContrastVAE
    Learning New Tasks from a Few Examples with Soft-Label Prototypes. (arXiv:2210.17437v2 [cs.LG] UPDATED)
    It has been experimentally demonstrated that humans are able to learn in a manner that allows them to make predictions on categories for which they have not seen any examples (Malaviya et al., 2022). Sucholutsky and Schonlau (2020) have recently presented a machine learning approach that aims to do the same. They utilise synthetically generated data and demonstrate that it is possible to achieve sub-linear scaling and develop models that can learn to recognise N classes from M training samples where M is less than N - aka less-than-one shot learning. Their method was, however, defined for univariate or simple multivariate data (Sucholutsky et al., 2021). We extend it to work on large, high-dimensional and real-world datasets and empirically validate it in this new and challenging setting. We apply this method to learn previously unseen NLP tasks from very few examples (4, 8 or 16). We first generate compact, sophisticated less-than-one shot representations called soft-label prototypes which are fitted on training data, capturing the distribution of different classes across the input domain space. We then use a modified k-Nearest Neighbours classifier to demonstrate that soft-label prototypes can classify data competitively, even outperforming much more computationally complex few-shot learning methods.
    Performer: A Novel PPG-to-ECG Reconstruction Transformer for a Digital Biomarker of Cardiovascular Disease Detection. (arXiv:2204.11795v3 [eess.SP] UPDATED)
    Electrocardiography (ECG), an electrical measurement which captures cardiac activities, is the gold standard for diagnosing cardiovascular disease (CVD). However, ECG is infeasible for continuous cardiac monitoring due to its requirement for user participation. By contrast, photoplethysmography (PPG) provides easy-to-collect data, but its limited accuracy constrains its clinical usage. To combine the advantages of both signals, recent studies incorporate various deep learning techniques for the reconstruction of PPG signals to ECG; however, the lack of contextual information as well as the limited abilities to denoise biomedical signals ultimately constrain model performance. In this research, we propose Performer, a novel Transformer-based architecture that reconstructs ECG from PPG and combines the PPG and reconstructed ECG as multiple modalities for CVD detection. This method is the first time that Transformer sequence-to-sequence translation has been performed on biomedical waveform reconstruction, combining the advantages of both PPG and ECG. We also create Shifted Patch-based Attention (SPA), an effective method to encode/decode the biomedical waveforms. Through fetching the various sequence lengths and capturing cross-patch connections, SPA maximizes the signal processing for both local features and global contextual representations. The proposed architecture generates a state-of-the-art performance of 0.29 RMSE for the reconstruction of PPG to ECG on the BIDMC database, surpassing prior studies. We also evaluated this model on the MIMIC-III dataset, achieving a 95.9% accuracy in CVD detection, and on the PPG-BP dataset, achieving 75.9% accuracy in related CVD diabetes detection, indicating its generalizability. As a proof of concept, an earring wearable named PEARL (prototype), was designed to scale up the point-of-care (POC) healthcare system.
    Convergence under Lipschitz smoothness of ease-controlled Random Reshuffling gradient Algorithms. (arXiv:2212.01848v1 [math.OC])
    We consider minimizing the average of a very large number of smooth and possibly non-convex functions. This optimization problem has deserved much attention in the past years due to the many applications in different fields, the most challenging being training Machine Learning models. Widely used approaches for solving this problem are mini-batch gradient methods which, at each iteration, update the decision vector moving along the gradient of a mini-batch of the component functions. We consider the Incremental Gradient (IG) and the Random reshuffling (RR) methods which proceed in cycles, picking batches in a fixed order or by reshuffling the order after each epoch. Convergence properties of these schemes have been proved under different assumptions, usually quite strong. We aim to define ease-controlled modifications of the IG/RR schemes, which require a light additional computational effort and can be proved to converge under very weak and standard assumptions. In particular, we define two algorithmic schemes, monotone or non-monotone, in which the IG/RR iteration is controlled by using a watchdog rule and a derivative-free line search that activates only sporadically to guarantee convergence. The two schemes also allow controlling the updating of the stepsize used in the main IG/RR iteration, avoiding the use of preset rules. We prove convergence under the lonely assumption of Lipschitz continuity of the gradients of the component functions and perform extensive computational analysis using Deep Neural Architectures and a benchmark of datasets. We compare our implementation with both full batch gradient methods and online standard implementation of IG/RR methods, proving that the computational effort is comparable with the corresponding online methods and that the control on the learning rate may allow faster decrease.  ( 2 min )
    Exact Penalty Method for Federated Learning. (arXiv:2208.11231v2 [cs.LG] UPDATED)
    Federated learning has burgeoned recently in machine learning, giving rise to a variety of research topics. Popular optimization algorithms are based on the frameworks of the (stochastic) gradient descent methods or the alternating direction method of multipliers. In this paper, we deploy an exact penalty method to deal with federated learning and propose an algorithm, FedEPM, that enables to tackle four critical issues in federated learning: communication efficiency, computational complexity, stragglers' effect, and data privacy. Moreover, it is proven to be convergent and testified to have high numerical performance.
    Uncertainty Quantification and Exploration for Reinforcement Learning. (arXiv:1910.05471v3 [cs.LG] UPDATED)
    We investigate statistical uncertainty quantification for reinforcement learning (RL) and its implications in exploration policy. Despite ever-growing literature on RL applications, fundamental questions about inference and error quantification, such as large-sample behaviors, appear to remain quite open. In this paper, we fill in the literature gap by studying the central limit theorem behaviors of estimated Q-values and value functions under various RL settings. In particular, we explicitly identify closed-form expressions of the asymptotic variances, which allow us to efficiently construct asymptotically valid confidence regions for key RL quantities. Furthermore, we utilize these asymptotic expressions to design an effective exploration strategy, which we call Q-value-based Optimal Computing Budget Allocation (Q-OCBA). The policy relies on maximizing the relative discrepancies among the Q-value estimates. Numerical experiments show superior performances of our exploration strategy than other benchmark policies.  ( 2 min )
    Battery Degradation Long-term Forecast Using Gaussian Process Dynamical Models and Knowledge Transfer. (arXiv:2212.01609v1 [cs.LG])
    Batteries plays an essential role in modern energy ecosystem and are widely used in daily applications such as cell phones and electric vehicles. For many applications, the health status of batteries plays a critical role in the performance of the system by indicating efficient maintenance and on-time replacement. Directly modeling an individual battery using a computational models based on physical rules can be of low-efficiency, in terms of the difficulties in build such a model and the computational effort of tuning and running it especially on the edge. With the rapid development of sensor technology (to provide more insights into the system) and machine learning (to build capable yet fast model), it is now possible to directly build a data-riven model of the battery health status using the data collected from historical battery data (being possibly local and remote) to predict local battery health status in the future accurately. Nevertheless, most data-driven methods are trained based on the local battery data and lack the ability to extract common properties, such as generations and degradation, in the life span of other remote batteries. In this paper, we utilize a Gaussian process dynamical model (GPDM) to build a data-driven model of battery health status and propose a knowledge transfer method to extract common properties in the life span of all batteries to accurately predict the battery health status with and without features extracted from the local battery. For modern benchmark problems, the proposed method outperform the state-of-the-art methods with significant margins in terms of accuracy and is able to accuracy predict the regeneration process.  ( 2 min )
    Model Selection in Contextual Stochastic Bandit Problems. (arXiv:2003.01704v3 [cs.LG] UPDATED)
    We study bandit model selection in stochastic environments. Our approach relies on a meta-algorithm that selects between candidate base algorithms. We develop a meta-algorithm-base algorithm abstraction that can work with general classes of base algorithms and different type of adversarial meta-algorithms. Our methods rely on a novel and generic smoothing transformation for bandit algorithms that permits us to obtain optimal $O(\sqrt{T})$ model selection guarantees for stochastic contextual bandit problems as long as the optimal base algorithm satisfies a high probability regret guarantee. We show through a lower bound that even when one of the base algorithms has $O(\log T)$ regret, in general it is impossible to get better than $\Omega(\sqrt{T})$ regret in model selection, even asymptotically. Using our techniques, we address model selection in a variety of problems such as misspecified linear contextual bandits, linear bandit with unknown dimension and reinforcement learning with unknown feature maps. Our algorithm requires the knowledge of the optimal base regret to adjust the meta-algorithm learning rate. We show that without such prior knowledge any meta-algorithm can suffer a regret larger than the optimal base regret.  ( 2 min )
    Spread Divergence. (arXiv:1811.08968v5 [stat.ML] UPDATED)
    For distributions $\mathbb{P}$ and $\mathbb{Q}$ with different supports or undefined densities, the divergence $\textrm{D}(\mathbb{P}||\mathbb{Q})$ may not exist. We define a Spread Divergence $\tilde{\textrm{D}}(\mathbb{P}||\mathbb{Q})$ on modified $\mathbb{P}$ and $\mathbb{Q}$ and describe sufficient conditions for the existence of such a divergence. We demonstrate how to maximize the discriminatory power of a given divergence by parameterizing and learning the spread. We also give examples of using a Spread Divergence to train implicit generative models, including linear models (Independent Components Analysis) and non-linear models (Deep Generative Networks).  ( 2 min )
    RePAD: Real-time Proactive Anomaly Detection for Time Series. (arXiv:2001.08922v5 [cs.LG] UPDATED)
    During the past decade, many anomaly detection approaches have been introduced in different fields such as network monitoring, fraud detection, and intrusion detection. However, they require understanding of data pattern and often need a long off-line period to build a model or network for the target data. Providing real-time and proactive anomaly detection for streaming time series without human intervention and domain knowledge is highly valuable since it greatly reduces human effort and enables appropriate countermeasures to be undertaken before a disastrous damage, failure, or other harmful event occurs. However, this issue has not been well studied yet. To address it, this paper proposes RePAD, which is a Real-time Proactive Anomaly Detection algorithm for streaming time series based on Long Short-Term Memory (LSTM). RePAD utilizes short-term historic data points to predict and determine whether or not the upcoming data point is a sign that an anomaly is likely to happen in the near future. By dynamically adjusting the detection threshold over time, RePAD is able to tolerate minor pattern change in time series and detect anomalies either proactively or on time. Experiments based on two time series datasets collected from the Numenta Anomaly Benchmark demonstrate that RePAD is able to proactively detect anomalies and provide early warnings in real time without human intervention and domain knowledge.  ( 2 min )
    An ADMM-Incorporated Latent Factorization of Tensors Method for QoS Prediction. (arXiv:2212.01606v1 [cs.LG])
    As the Internet developed rapidly, it is important to choose suitable web services from a wide range of candidates. Quality of service (QoS) describes the performance of a web service dynamically with respect to the service requested by the service consumer. Moreover, the latent factorization of tenors (LFT) is very effective for discovering temporal patterns in high dimensional and sparse (HiDS) tensors. However, current LFT models suffer from a low convergence rate and rarely account for the effects of outliers. To address the above problems, this paper proposes an Alternating direction method of multipliers (ADMM)-based Outlier-Resilient Nonnegative Latent-factorization of Tensors model. We maintain the non-negativity of the model by constructing an augmented Lagrangian function with the ADMM optimization framework. In addition, the Cauchy function is taken as the metric function to reduce the impact on the model training. The empirical work on two dynamic QoS datasets shows that the proposed method has faster convergence and better performance on prediction accuracy.  ( 2 min )
    Melody transcription via generative pre-training. (arXiv:2212.01884v1 [cs.SD])
    Despite the central role that melody plays in music perception, it remains an open challenge in music information retrieval to reliably detect the notes of the melody present in an arbitrary music recording. A key challenge in melody transcription is building methods which can handle broad audio containing any number of instrument ensembles and musical styles - existing strategies work well for some melody instruments or styles but not all. To confront this challenge, we leverage representations from Jukebox (Dhariwal et al. 2020), a generative model of broad music audio, thereby improving performance on melody transcription by $20$% relative to conventional spectrogram features. Another obstacle in melody transcription is a lack of training data - we derive a new dataset containing $50$ hours of melody transcriptions from crowdsourced annotations of broad music. The combination of generative pre-training and a new dataset for this task results in $77$% stronger performance on melody transcription relative to the strongest available baseline. By pairing our new melody transcription approach with solutions for beat detection, key estimation, and chord recognition, we build Sheet Sage, a system capable of transcribing human-readable lead sheets directly from music audio. Audio examples can be found at https://chrisdonahue.com/sheetsage and code at https://github.com/chrisdonahue/sheetsage .  ( 2 min )
    Contrastive introspection (ConSpec) to rapidly identify invariant prototypes for success in RL. (arXiv:2210.05845v3 [cs.LG] UPDATED)
    Reinforcement learning (RL) algorithms have achieved notable success in recent years, but still struggle with fundamental issues in long-term credit assignment. It remains difficult to learn in situations where success is contingent upon multiple critical steps that are distant in time from each other and from a sparse reward; as is often the case in real life. Moreover, how RL algorithms assign credit in these difficult situations is typically not coded in a way that can rapidly generalize to new situations. Here, we present an approach using offline contrastive learning, which we call contrastive introspection (ConSpec), that can be added to any existing RL algorithm and addresses both issues. In ConSpec, a contrastive loss is used during offline replay to identify invariances among successful episodes. This takes advantage of the fact that it is easier to retrospectively identify the small set of steps that success is contingent upon than it is to prospectively predict reward at every step taken in the environment. ConSpec stores this knowledge in a collection of prototypes summarizing the intermediate states required for success. During training, arrival at any state that matches these prototypes generates an intrinsic reward that is added to any external rewards. As well, the reward shaping provided by ConSpec can be made to preserve the optimal policy of the underlying RL agent. The prototypes in ConSpec provide two key benefits for credit assignment: (1) They enable rapid identification of all the critical states. (2) They do so in a readily interpretable manner, enabling out of distribution generalization when sensory features are altered. In summary, ConSpec is a modular system that can be added to any existing RL algorithm to improve its long-term credit assignment.
    Regularized ERM on random subspaces. (arXiv:2212.01866v1 [stat.ML])
    We study a natural extension of classical empirical risk minimization, where the hypothesis space is a random subspace of a given space. In particular, we consider possibly data dependent subspaces spanned by a random subset of the data, recovering as a special case Nystrom approaches for kernel methods. Considering random subspaces naturally leads to computational savings, but the question is whether the corresponding learning accuracy is degraded. These statistical-computational tradeoffs have been recently explored for the least squares loss and self-concordant loss functions, such as the logistic loss. Here, we work to extend these results to convex Lipschitz loss functions, that might not be smooth, such as the hinge loss used in support vector machines. This unified analysis requires developing new proofs, that use different technical tools, such as sub-gaussian inputs, to achieve fast rates. Our main results show the existence of different settings, depending on how hard the learning problem is, for which computational efficiency can be improved with no loss in performance.  ( 2 min )
    Online Shielding for Reinforcement Learning. (arXiv:2212.01861v1 [cs.LG])
    Besides the recent impressive results on reinforcement learning (RL), safety is still one of the major research challenges in RL. RL is a machine-learning approach to determine near-optimal policies in Markov decision processes (MDPs). In this paper, we consider the setting where the safety-relevant fragment of the MDP together with a temporal logic safety specification is given and many safety violations can be avoided by planning ahead a short time into the future. We propose an approach for online safety shielding of RL agents. During runtime, the shield analyses the safety of each available action. For any action, the shield computes the maximal probability to not violate the safety specification within the next $k$ steps when executing this action. Based on this probability and a given threshold, the shield decides whether to block an action from the agent. Existing offline shielding approaches compute exhaustively the safety of all state-action combinations ahead of time, resulting in huge computation times and large memory consumption. The intuition behind online shielding is to compute at runtime the set of all states that could be reached in the near future. For each of these states, the safety of all available actions is analysed and used for shielding as soon as one of the considered states is reached. Our approach is well suited for high-level planning problems where the time between decisions can be used for safety computations and it is sustainable for the agent to wait until these computations are finished. For our evaluation, we selected a 2-player version of the classical computer game SNAKE. The game represents a high-level planning problem that requires fast decisions and the multiplayer setting induces a large state space, which is computationally expensive to analyse exhaustively.  ( 2 min )
    Understanding Sinusoidal Neural Networks. (arXiv:2212.01833v1 [cs.LG])
    In this work, we investigate the representation capacity of multilayer perceptron networks that use the sine as activation function - sinusoidal neural networks. We show that the layer composition in such networks compacts information. For this, we prove that the composition of sinusoidal layers expands as a sum of sines consisting of a large number of new frequencies given by linear combinations of the weights of the network's first layer. We provide the expression of the corresponding amplitudes in terms of the Bessel functions and give an upper bound for them that can be used to control the resulting approximation.  ( 2 min )
    Axial-LOB: High-Frequency Trading with Axial Attention. (arXiv:2212.01807v1 [q-fin.TR])
    Previous attempts to predict stock price from limit order book (LOB) data are mostly based on deep convolutional neural networks. Although convolutions offer efficiency by restricting their operations to local interactions, it is at the cost of potentially missing out on the detection of long-range dependencies. Recent studies address this problem by employing additional recurrent or attention layers that increase computational complexity. In this work, we propose Axial-LOB, a novel fully-attentional deep learning architecture for predicting price movements of stocks from LOB data. By utilizing gated position-sensitive axial attention layers our architecture is able to construct feature maps that incorporate global interactions, while significantly reducing the size of the parameter space. Unlike previous works, Axial-LOB does not rely on hand-crafted convolutional kernels and hence has stable performance under input permutations and the capacity to incorporate additional LOB features. The effectiveness of Axial-LOB is demonstrated on a large benchmark dataset, containing time series representations of millions of high-frequency trading events, where our model establishes a new state of the art, achieving an excellent directional classification performance at all tested prediction horizons.  ( 2 min )
    Compound Tokens: Channel Fusion for Vision-Language Representation Learning. (arXiv:2212.01447v1 [cs.CV])
    We present an effective method for fusing visual-and-language representations for several question answering tasks including visual question answering and visual entailment. In contrast to prior works that concatenate unimodal representations or use only cross-attention, we compose multimodal representations via channel fusion. By fusing on the channels, the model is able to more effectively align the tokens compared to standard methods. These multimodal representations, which we call compound tokens are generated with cross-attention transformer layers. First, vision tokens are used as queries to retrieve compatible text tokens through cross-attention. We then chain the vision tokens and the queried text tokens along the channel dimension. We call the resulting representations compound tokens. A second group of compound tokens are generated using an analogous process where the text tokens serve as queries to the cross-attention layer. We concatenate all the compound tokens for further processing with multimodal encoder. We demonstrate the effectiveness of compound tokens using an encoder-decoder vision-language model trained end-to-end in the open-vocabulary setting. Compound Tokens achieve highly competitive performance across a range of question answering tasks including GQA, VQA2.0, and SNLI-VE.
    Twitter Data Analysis: Izmir Earthquake Case. (arXiv:2212.01453v1 [cs.CL])
    T\"urkiye is located on a fault line; earthquakes often occur on a large and small scale. There is a need for effective solutions for gathering current information during disasters. We can use social media to get insight into public opinion. This insight can be used in public relations and disaster management. In this study, Twitter posts on Izmir Earthquake that took place on October 2020 are analyzed. We question if this analysis can be used to make social inferences on time. Data mining and natural language processing (NLP) methods are used for this analysis. NLP is used for sentiment analysis and topic modelling. The latent Dirichlet Allocation (LDA) algorithm is used for topic modelling. We used the Bidirectional Encoder Representations from Transformers (BERT) model working with Transformers architecture for sentiment analysis. It is shown that the users shared their goodwill wishes and aimed to contribute to the initiated aid activities after the earthquake. The users desired to make their voices heard by competent institutions and organizations. The proposed methods work effectively. Future studies are also discussed.
    Operator inference with roll outs for learning reduced models from scarce and low-quality data. (arXiv:2212.01418v1 [cs.LG])
    Data-driven modeling has become a key building block in computational science and engineering. However, data that are available in science and engineering are typically scarce, often polluted with noise and affected by measurement errors and other perturbations, which makes learning the dynamics of systems challenging. In this work, we propose to combine data-driven modeling via operator inference with the dynamic training via roll outs of neural ordinary differential equations. Operator inference with roll outs inherits interpretability, scalability, and structure preservation of traditional operator inference while leveraging the dynamic training via roll outs over multiple time steps to increase stability and robustness for learning from low-quality and noisy data. Numerical experiments with data describing shallow water waves and surface quasi-geostrophic dynamics demonstrate that operator inference with roll outs provides predictive models from training trajectories even if data are sampled sparsely in time and polluted with noise of up to 10%.
    Can Evolutionary Clustering Have Theoretical Guarantees?. (arXiv:2212.01771v1 [cs.NE])
    Clustering is a fundamental problem in many areas, which aims to partition a given data set into groups based on some distance measure, such that the data points in the same group are similar while that in different groups are dissimilar. Due to its importance and NP-hardness, a lot of methods have been proposed, among which evolutionary algorithms are a class of popular ones. Evolutionary clustering has found many successful applications, but all the results are empirical, lacking theoretical support. This paper fills this gap by proving that the approximation performance of the GSEMO (a simple multi-objective evolutionary algorithm) for solving the three popular formulations of clustering, i.e., $k$-center, $k$-median and $k$-means, can be theoretically guaranteed. Furthermore, we prove that evolutionary clustering can have theoretical guarantees even when considering fairness, which tries to avoid algorithmic bias, and has recently been an important research topic in machine learning.
    Applications of AI in Astronomy. (arXiv:2212.01493v1 [astro-ph.IM])
    We provide a brief, and inevitably incomplete overview of the use of Machine Learning (ML) and other AI methods in astronomy, astrophysics, and cosmology. Astronomy entered the big data era with the first digital sky surveys in the early 1990s and the resulting Terascale data sets, which required automating of many data processing and analysis tasks, for example the star-galaxy separation, with billions of feature vectors in hundreds of dimensions. The exponential data growth continued, with the rise of synoptic sky surveys and the Time Domain Astronomy, with the resulting Petascale data streams and the need for a real-time processing, classification, and decision making. A broad variety of classification and clustering methods have been applied for these tasks, and this remains a very active area of research. Over the past decade we have seen an exponential growth of the astronomical literature involving a variety of ML/AI applications of an ever increasing complexity and sophistication. ML and AI are now a standard part of the astronomical toolkit. As the data complexity continues to increase, we anticipate further advances leading towards a collaborative human-AI discovery.
    Convolution, aggregation and attention based deep neural networks for accelerating simulations in mechanics. (arXiv:2212.01386v1 [cs.LG])
    Deep learning surrogate models are being increasingly used in accelerating scientific simulations as a replacement for costly conventional numerical techniques. However, their use remains a significant challenge when dealing with real-world complex examples. In this work, we demonstrate three types of neural network architectures for efficient learning of highly non-linear deformations of solid bodies. The first two architectures are based on the recently proposed CNN U-NET and MAgNET (graph U-NET) frameworks which have shown promising performance for learning on mesh-based data. The third architecture is Perceiver IO, a very recent architecture that belongs to the family of attention-based neural networks--a class that has revolutionised diverse engineering fields and is still unexplored in computational mechanics. We study and compare the performance of all three networks on two benchmark examples, and show their capabilities to accurately predict the non-linear mechanical responses of soft bodies.
    Cross-lingual Similarity of Multilingual Representations Revisited. (arXiv:2212.01924v1 [cs.CL])
    Related works used indexes like CKA and variants of CCA to measure the similarity of cross-lingual representations in multilingual language models. In this paper, we argue that assumptions of CKA/CCA align poorly with one of the motivating goals of cross-lingual learning analysis, i.e., explaining zero-shot cross-lingual transfer. We highlight what valuable aspects of cross-lingual similarity these indexes fail to capture and provide a motivating case study \textit{demonstrating the problem empirically}. Then, we introduce \textit{Average Neuron-Wise Correlation (ANC)} as a straightforward alternative that is exempt from the difficulties of CKA/CCA and is good specifically in a cross-lingual context. Finally, we use ANC to construct evidence that the previously introduced ``first align, then predict'' pattern takes place not only in masked language models (MLMs) but also in multilingual models with \textit{causal language modeling} objectives (CLMs). Moreover, we show that the pattern extends to the \textit{scaled versions} of the MLMs and CLMs (up to 85x original mBERT).\footnote{Our code is publicly available at \url{https://github.com/TartuNLP/xsim}}  ( 2 min )
    Meta-Shop: Improving Item Advertisement For Small Businesses. (arXiv:2212.01414v1 [cs.IR])
    In this paper, we study item advertisements for small businesses. This application recommends prospective customers to specific items requested by businesses. From analysis, we found that the existing Recommender Systems (RS) were ineffective for small/new businesses with a few sales history. Training samples in RS can be highly biased toward popular businesses with sufficient sales and can decrease advertising performance for small businesses. We propose a meta-learning-based RS to improve advertising performance for small/new businesses and shops: Meta-Shop. Meta-Shop leverages an advanced meta-learning optimization framework and builds a model for a shop-level recommendation. It also integrates and transfers knowledge between large and small shops, consequently learning better features in small shops. We conducted experiments on a real-world E-commerce dataset and a public benchmark dataset. Meta-Shop outperformed a production baseline and the state-of-the-art RS models. Specifically, it achieved up to 16.6% relative improvement of Recall@1M and 40.4% relative improvement of nDCG@3 for user recommendations to new shops compared to the other RS models.
    Adaptive Sample Selection for Robust Learning under Label Noise. (arXiv:2106.15292v3 [cs.LG] UPDATED)
    Deep Neural Networks (DNNs) have been shown to be susceptible to memorization or overfitting in the presence of noisily-labelled data. For the problem of robust learning under such noisy data, several algorithms have been proposed. A prominent class of algorithms rely on sample selection strategies wherein, essentially, a fraction of samples with loss values below a certain threshold are selected for training. These algorithms are sensitive to such thresholds, and it is difficult to fix or learn these thresholds. Often, these algorithms also require information such as label noise rates which are typically unavailable in practice. In this paper, we propose an adaptive sample selection strategy that relies only on batch statistics of a given mini-batch to provide robustness against label noise. The algorithm does not have any additional hyperparameters for sample selection, does not need any information on noise rates and does not need access to separate data with clean labels. We empirically demonstrate the effectiveness of our algorithm on benchmark datasets.
    iEnhancer-ELM: Improve Enhancer Identification by Extracting Multi-scale Contextual Information based on Enhancer Language Models. (arXiv:2212.01495v1 [q-bio.GN])
    Motivation: Enhancers are important cis-regulatory elements that regulate a wide range of biological functions and enhance the transcription of target genes. Although many state-of-the-art computational methods have been proposed in order to efficiently identify enhancers, learning globally contextual features is still one of the challenges for computational methods. Regarding the similarities between biological sequences and natural language sentences, the novel BERT-based language techniques have been applied to extracting complex contextual features in various computational biology tasks such as protein function/structure prediction. To speed up the research on enhancer identification, it is urgent to construct a BERT-based enhancer language model. Results: In this paper, we propose a multi-scale enhancer identification method (iEnhancer-ELM) based on enhancer language models, which treat enhancer sequences as natural language sentences that are composed of k-mer nucleotides. iEnhancer-ELM can extract contextual information of multi-scale k-mers with positions from raw enhancer sequences. Benefiting from the complementary information of k-mers in multi-scale, we ensemble four iEnhancer-ELM models for improving enhancer identification. The benchmark comparisons show that our model outperforms state-of-the-art methods. By the interpretable attention mechanism, we finds 30 biological patterns, where 40% (12/30) are verified by a widely used motif tool (STREME) and a popular dataset (JASPAR), demonstrating our model has a potential ability to reveal the biological mechanism of enhancer. Availability: The source code are available at https://github.com/chen-bioinfo/iEnhancer-ELM Contact: junjiechen@hit.edu.cn and junjie.chen.hit@gmail.com; Supplementary information: Supplementary data are available at Bioinformatics online.
    Unauthorized Drone Detection: Experiments and Prototypes. (arXiv:2212.01436v1 [cs.CV])
    The increase in the number of unmanned aerial vehicles a.k.a. drones pose several threats to public privacy, critical infrastructure and cyber security. Hence, detecting unauthorized drones is a significant problem which received attention in the last few years. In this paper, we present our experimental work on three drone detection methods (i.e., acoustic detection, radio frequency (RF) detection, and visual detection) to evaluate their efficacy in both indoor and outdoor environments. Owing to the limitations of these schemes, we present a novel encryption-based drone detection scheme that uses a two-stage verification of the drone's received signal strength indicator (RSSI) and the encryption key generated from the drone's position coordinates to reliably detect an unauthorized drone in the presence of authorized drones.
    PGFed: Personalize Each Client's Global Objective for Federated Learning. (arXiv:2212.01448v1 [cs.LG])
    The mediocre performance of conventional federated learning (FL) over heterogeneous data has been facilitating personalized FL solutions, where, unlike conventional FL which trains a single global consensus model, different models are allowed for different clients. However, in most existing personalized FL algorithms, the collaborative knowledge across the federation was only implicitly passed to the clients in ways such as model aggregation or regularization. We observed that this implicit knowledge transfer fails to maximize the potential value of each client's empirical risk toward other clients. Based on our observation, in this work, we propose Personalized Global Federated Learning (PGFed), a novel personalized FL framework that enables each client to personalize its own global objective by explicitly and adaptively aggregating the empirical risks of itself and other clients. To avoid massive ($O(N^2)$) communication overhead and potential privacy leakage, each client's risk is estimated through a first-order approximation for other clients' adaptive risk aggregation. On top of PGFed, we develop a momentum upgrade, dubbed PGFedMo, to more efficiently utilize clients' empirical risks. Our extensive experiments under different federated settings with benchmark datasets show consistent improvements of PGFed over the compared state-of-the-art alternatives.
    Probabilistic Verification of ReLU Neural Networks via Characteristic Functions. (arXiv:2212.01544v1 [cs.LG])
    Verifying the input-output relationships of a neural network so as to achieve some desired performance specification is a difficult, yet important, problem due to the growing ubiquity of neural nets in many engineering applications. We use ideas from probability theory in the frequency domain to provide probabilistic verification guarantees for ReLU neural networks. Specifically, we interpret a (deep) feedforward neural network as a discrete dynamical system over a finite horizon that shapes distributions of initial states, and use characteristic functions to propagate the distribution of the input data through the network. Using the inverse Fourier transform, we obtain the corresponding cumulative distribution function of the output set, which can be used to check if the network is performing as expected given any random point from the input set. The proposed approach does not require distributions to have well-defined moments or moment generating functions. We demonstrate our proposed approach on two examples, and compare its performance to related approaches.
    MeshDQN: A Deep Reinforcement Learning Framework for Improving Meshes in Computational Fluid Dynamics. (arXiv:2212.01428v1 [cs.LG])
    Meshing is a critical, but user-intensive process necessary for stable and accurate simulations in computational fluid dynamics (CFD). Mesh generation is often a bottleneck in CFD pipelines. Adaptive meshing techniques allow the mesh to be updated automatically to produce an accurate solution for the problem at hand. Existing classical techniques for adaptive meshing require either additional functionality out of solvers, many training simulations, or both. Current machine learning techniques often require substantial computational cost for training data generation, and are restricted in scope to the training data flow regime. MeshDQN is developed as a general purpose deep reinforcement learning framework to iteratively coarsen meshes while preserving target property calculation. A graph neural network based deep Q network is used to select mesh vertices for removal and solution interpolation is used to bypass expensive simulations at each step in the improvement process. MeshDQN requires a single simulation prior to mesh coarsening, while making no assumptions about flow regime, mesh type, or solver, only requiring the ability to modify meshes directly in a CFD pipeline. MeshDQN successfully improves meshes for two 2D airfoils.
    Representation Internal-Manipulation (RIM): A Neuro-Inspired Computational Theory of Consciousness. (arXiv:1912.13490v2 [cs.AI] UPDATED)
    Many theories, based on neuroscientific and psychological empirical evidence and on computational concepts, have been elaborated to explain the emergence of consciousness in the central nervous system. These theories propose key fundamental mechanisms to explain consciousness, but they only partially connect such mechanisms to the possible functional and adaptive role of consciousness. Recently, some cognitive and neuroscientific models try to solve this gap by linking consciousness to various aspects of goal-directed behaviour, the pivotal cognitive process that allows mammals to flexibly act in challenging environments. Here we propose the Representation Internal-Manipulation (RIM) theory of consciousness, a theory that links the main elements of consciousness theories to components and functions of goal-directed behaviour, ascribing a central role for consciousness to the goal-directed manipulation of internal representations. This manipulation relies on four specific computational operations to perform the flexible internal adaptation of all key elements of goal-directed computation, from the representations of objects to those of goals, actions, and plans. Finally, we propose the concept of `manipulation agency' relating the sense of agency to the internal manipulation of representations. This allows us to propose that the subjective experience of consciousness is associated to the human capacity to generate and control a simulated internal reality that is vividly perceived and felt through the same perceptual and emotional mechanisms used to tackle the external world.
    Understanding Interventional TreeSHAP : How and Why it Works. (arXiv:2209.15123v2 [cs.LG] UPDATED)
    Shapley values are ubiquitous in interpretable Machine Learning due to their strong theoretical background and efficient implementation in the SHAP library. Computing these values previously induced an exponential cost with respect to the number of input features of an opaque model. Now, with efficient implementations such as Interventional TreeSHAP, this exponential burden is alleviated assuming one is explaining ensembles of decision trees. Although Interventional TreeSHAP has risen in popularity, it still lacks a formal proof of how/why it works. We provide such proof with the aim of not only increasing the transparency of the algorithm but also to encourage further development of these ideas. Notably, our proof for Interventional TreeSHAP is easily adapted to Shapley-Taylor indices and one-hot-encoded features.
    Combinatorial Causal Bandits. (arXiv:2206.01995v4 [cs.LG] UPDATED)
    In combinatorial causal bandits (CCB), the learning agent chooses at most $K$ variables in each round to intervene, collects feedback from the observed variables, with the goal of minimizing expected regret on the target variable $Y$. We study under the context of binary generalized linear models (BGLMs) with a succinct parametric representation of the causal models. We present the algorithm BGLM-OFU for Markovian BGLMs (i.e. no hidden variables) based on the maximum likelihood estimation method, and show that it achieves $O(\sqrt{T}\log T)$ regret, where $T$ is the time horizon. For the special case of linear models with hidden variables, we apply causal inference techniques such as the do-calculus to convert the original model into a Markovian model, and then show that our BGLM-OFU algorithm and another algorithm based on the linear regression both solve such linear models with hidden variables. Our novelty includes (a) considering the combinatorial intervention action space and the general causal models including ones with hidden variables, (b) integrating and adapting techniques from diverse studies such as generalized linear bandits and online influence maximization, and (c) avoiding unrealistic assumptions (such as knowing the joint distribution of the parents of $Y$ under all interventions) and regret factors exponential to causal graph size in prior studies.
    Query-Driven Knowledge Base Completion using Multimodal Path Fusion over Multimodal Knowledge Graph. (arXiv:2212.01923v1 [cs.DB])
    Over the past few years, large knowledge bases have been constructed to store massive amounts of knowledge. However, these knowledge bases are highly incomplete, for example, over 70% of people in Freebase have no known place of birth. To solve this problem, we propose a query-driven knowledge base completion system with multimodal fusion of unstructured and structured information. To effectively fuse unstructured information from the Web and structured information in knowledge bases to achieve good performance, our system builds multimodal knowledge graphs based on question answering and rule inference. We propose a multimodal path fusion algorithm to rank candidate answers based on different paths in the multimodal knowledge graphs, achieving much better performance than question answering, rule inference and a baseline fusion algorithm. To improve system efficiency, query-driven techniques are utilized to reduce the runtime of our system, providing fast responses to user queries. Extensive experiments have been conducted to demonstrate the effectiveness and efficiency of our system.
    Characterizing instance hardness in classification and regression problems. (arXiv:2212.01897v1 [cs.LG])
    Some recent pieces of work in the Machine Learning (ML) literature have demonstrated the usefulness of assessing which observations are hardest to have their label predicted accurately. By identifying such instances, one may inspect whether they have any quality issues that should be addressed. Learning strategies based on the difficulty level of the observations can also be devised. This paper presents a set of meta-features that aim at characterizing which instances of a dataset are hardest to have their label predicted accurately and why they are so, aka instance hardness measures. Both classification and regression problems are considered. Synthetic datasets with different levels of complexity are built and analyzed. A Python package containing all implementations is also provided.
    ReRe: A Lightweight Real-time Ready-to-Go Anomaly Detection Approach for Time Series. (arXiv:2004.02319v4 [cs.LG] UPDATED)
    Anomaly detection is an active research topic in many different fields such as intrusion detection, network monitoring, system health monitoring, IoT healthcare, etc. However, many existing anomaly detection approaches require either human intervention or domain knowledge, and may suffer from high computation complexity, consequently hindering their applicability in real-world scenarios. Therefore, a lightweight and ready-to-go approach that is able to detect anomalies in real-time is highly sought-after. Such an approach could be easily and immediately applied to perform time series anomaly detection on any commodity machine. The approach could provide timely anomaly alerts and by that enable appropriate countermeasures to be undertaken as early as possible. With these goals in mind, this paper introduces ReRe, which is a Real-time Ready-to-go proactive Anomaly Detection algorithm for streaming time series. ReRe employs two lightweight Long Short-Term Memory (LSTM) models to predict and jointly determine whether or not an upcoming data point is anomalous based on short-term historical data points and two long-term self-adaptive thresholds. Experiments based on real-world time-series datasets demonstrate the good performance of ReRe in real-time anomaly detection without requiring human intervention or domain knowledge.
    Statistical Physics of Deep Neural Networks: Initialization toward Optimal Channels. (arXiv:2212.01744v1 [cs.LG])
    In deep learning, neural networks serve as noisy channels between input data and its representation. This perspective naturally relates deep learning with the pursuit of constructing channels with optimal performance in information transmission and representation. While considerable efforts are concentrated on realizing optimal channel properties during network optimization, we study a frequently overlooked possibility that neural networks can be initialized toward optimal channels. Our theory, consistent with experimental validation, identifies primary mechanics underlying this unknown possibility and suggests intrinsic connections between statistical physics and deep learning. Unlike the conventional theories that characterize neural networks applying the classic mean-filed approximation, we offer analytic proof that this extensively applied simplification scheme is not valid in studying neural networks as information channels. To fill this gap, we develop a corrected mean-field framework applicable for characterizing the limiting behaviors of information propagation in neural networks without strong assumptions on inputs. Based on it, we propose an analytic theory to prove that mutual information maximization is realized between inputs and propagated signals when neural networks are initialized at dynamic isometry, a case where information transmits via norm-preserving mappings. These theoretical predictions are validated by experiments on real neural networks, suggesting the robustness of our theory against finite-size effects. Finally, we analyze our findings with information bottleneck theory to confirm the precise relations among dynamic isometry, mutual information maximization, and optimal channel properties in deep learning.
    A dataset for audio-video based vehicle speed estimation. (arXiv:2212.01651v1 [cs.LG])
    Accurate speed estimation of road vehicles is important for several reasons. One is speed limit enforcement, which represents a crucial tool in decreasing traffic accidents and fatalities. Compared with other research areas and domains, the number of available datasets for vehicle speed estimation is still very limited. We present a dataset of on-road audio-video recordings of single vehicles passing by a camera at known speeds, maintained stable by the on-board cruise control. The dataset contains thirteen vehicles, selected to be as diverse as possible in terms of manufacturer, production year, engine type, power and transmission, resulting in a total of $ 400 $ annotated audio-video recordings. The dataset is fully available and intended as a public benchmark to facilitate research in audio-video vehicle speed estimation. In addition to the dataset, we propose a cross-validation strategy which can be used in a machine learning model for vehicle speed estimation. Two approaches to training-validation split of the dataset are proposed.
    High-Speed State Estimation in Power Systems with Extreme Unobservability Using Machine Learning. (arXiv:2212.01729v1 [eess.SP])
    Fast timescale state estimation for a large power system can be challenging if the sensors producing the measurements are few in number. This is particularly true for doing time-synchronized state estimation for a transmission system that has minimal phasor measurement unit (PMU) coverage. This paper proposes a Deep Neural network-based State Estimator (DeNSE) to overcome this extreme unobservability problem. For systems in which the existing PMU infrastructure is not able to bring the estimation errors within acceptable limits using the DeNSE, a data-driven incremental PMU placement methodology is also introduced. The practical utility of the proposed approach is demonstrated by considering topology changes, non-Gaussian measurement noise, bad data detection and correction, and large system application.
    FedKNOW: Federated Continual Learning with Signature Task Knowledge Integration at Edge. (arXiv:2212.01738v1 [cs.LG])
    Deep Neural Networks (DNNs) have been ubiquitously adopted in internet of things and are becoming an integral of our daily life. When tackling the evolving learning tasks in real world, such as classifying different types of objects, DNNs face the challenge to continually retrain themselves according to the tasks on different edge devices. Federated continual learning is a promising technique that offers partial solutions but yet to overcome the following difficulties: the significant accuracy loss due to the limited on-device processing, the negative knowledge transfer caused by the limited communication of non-IID data, and the limited scalability on the tasks and edge devices. In this paper, we propose FedKNOW, an accurate and scalable federated continual learning framework, via a novel concept of signature task knowledge. FedKNOW is a client side solution that continuously extracts and integrates the knowledge of signature tasks which are highly influenced by the current task. Each client of FedKNOW is composed of a knowledge extractor, a gradient restorer and, most importantly, a gradient integrator. Upon training for a new task, the gradient integrator ensures the prevention of catastrophic forgetting and mitigation of negative knowledge transfer by effectively combining signature tasks identified from the past local tasks and other clients' current tasks through the global model. We implement FedKNOW in PyTorch and extensively evaluate it against state-of-the-art techniques using popular federated continual learning benchmarks. Extensive evaluation results on heterogeneous edge devices show that FedKNOW improves model accuracy by 63.24% without increasing model training time, reduces communication cost by 34.28%, and achieves more improvements under difficult scenarios such as large numbers of tasks or clients, and training different complex networks.
    Visual Question Answering From Another Perspective: CLEVR Mental Rotation Tests. (arXiv:2212.01639v1 [stat.ML])
    Different types of mental rotation tests have been used extensively in psychology to understand human visual reasoning and perception. Understanding what an object or visual scene would look like from another viewpoint is a challenging problem that is made even harder if it must be performed from a single image. We explore a controlled setting whereby questions are posed about the properties of a scene if that scene was observed from another viewpoint. To do this we have created a new version of the CLEVR dataset that we call CLEVR Mental Rotation Tests (CLEVR-MRT). Using CLEVR-MRT we examine standard methods, show how they fall short, then explore novel neural architectures that involve inferring volumetric representations of a scene. These volumes can be manipulated via camera-conditioned transformations to answer the question. We examine the efficacy of different model variants through rigorous ablations and demonstrate the efficacy of volumetric representations.
    Languages You Know Influence Those You Learn: Impact of Language Characteristics on Multi-Lingual Text-to-Text Transfer. (arXiv:2212.01757v1 [cs.CL])
    Multi-lingual language models (LM), such as mBERT, XLM-R, mT5, mBART, have been remarkably successful in enabling natural language tasks in low-resource languages through cross-lingual transfer from high-resource ones. In this work, we try to better understand how such models, specifically mT5, transfer *any* linguistic and semantic knowledge across languages, even though no explicit cross-lingual signals are provided during pre-training. Rather, only unannotated texts from each language are presented to the model separately and independently of one another, and the model appears to implicitly learn cross-lingual connections. This raises several questions that motivate our study, such as: Are the cross-lingual connections between every language pair equally strong? What properties of source and target language impact the strength of cross-lingual transfer? Can we quantify the impact of those properties on the cross-lingual transfer? In our investigation, we analyze a pre-trained mT5 to discover the attributes of cross-lingual connections learned by the model. Through a statistical interpretation framework over 90 language pairs across three tasks, we show that transfer performance can be modeled by a few linguistic and data-derived features. These observations enable us to interpret cross-lingual understanding of the mT5 model. Through these observations, one can favorably choose the best source language for a task, and can anticipate its training data demands. A key finding of this work is that similarity of syntax, morphology and phonology are good predictors of cross-lingual transfer, significantly more than just the lexical similarity of languages. For a given language, we are able to predict zero-shot performance, that increases on a logarithmic scale with the number of few-shot target language data points.
    DACOM: Learning Delay-Aware Communication for Multi-Agent Reinforcement Learning. (arXiv:2212.01619v1 [cs.MA])
    Communication is supposed to improve multi-agent collaboration and overall performance in cooperative Multi-agent reinforcement learning (MARL). However, such improvements are prevalently limited in practice since most existing communication schemes ignore communication overheads (e.g., communication delays). In this paper, we demonstrate that ignoring communication delays has detrimental effects on collaborations, especially in delay-sensitive tasks such as autonomous driving. To mitigate this impact, we design a delay-aware multi-agent communication model (DACOM) to adapt communication to delays. Specifically, DACOM introduces a component, TimeNet, that is responsible for adjusting the waiting time of an agent to receive messages from other agents such that the uncertainty associated with delay can be addressed. Our experiments reveal that DACOM has a non-negligible performance improvement over other mechanisms by making a better trade-off between the benefits of communication and the costs of waiting for messages.
    GlueFL: Reconciling Client Sampling and Model Masking for Bandwidth Efficient Federated Learning. (arXiv:2212.01523v1 [cs.LG])
    Federated learning (FL) is an effective technique to directly involve edge devices in machine learning training while preserving client privacy. However, the substantial communication overhead of FL makes training challenging when edge devices have limited network bandwidth. Existing work to optimize FL bandwidth overlooks downstream transmission and does not account for FL client sampling. In this paper we propose GlueFL, a framework that incorporates new client sampling and model compression algorithms to mitigate low download bandwidths of FL clients. GlueFL prioritizes recently used clients and bounds the number of changed positions in compression masks in each round. Across three popular FL datasets and three state-of-the-art strategies, GlueFL reduces downstream client bandwidth by 27% on average and reduces training time by 29% on average.
    CoTMix: Contrastive Domain Adaptation for Time-Series via Temporal Mixup. (arXiv:2212.01555v1 [cs.LG])
    Unsupervised Domain Adaptation (UDA) has emerged as a powerful solution for the domain shift problem via transferring the knowledge from a labeled source domain to a shifted unlabeled target domain. Despite the prevalence of UDA for visual applications, it remains relatively less explored for time-series applications. In this work, we propose a novel lightweight contrastive domain adaptation framework called CoTMix for time-series data. Unlike existing approaches that either use statistical distances or adversarial techniques, we leverage contrastive learning solely to mitigate the distribution shift across the different domains. Specifically, we propose a novel temporal mixup strategy to generate two intermediate augmented views for the source and target domains. Subsequently, we leverage contrastive learning to maximize the similarity between each domain and its corresponding augmented view. The generated views consider the temporal dynamics of time-series data during the adaptation process while inheriting the semantics among the two domains. Hence, we gradually push both domains towards a common intermediate space, mitigating the distribution shift across them. Extensive experiments conducted on four real-world time-series datasets show that our approach can significantly outperform all state-of-the-art UDA methods. The implementation code of CoTMix is available at \href{https://github.com/emadeldeen24/CoTMix}{github.com/emadeldeen24/CoTMix}.
    Interpretable Node Representation with Attribute Decoding. (arXiv:2212.01682v1 [cs.LG])
    Variational Graph Autoencoders (VGAEs) are powerful models for unsupervised learning of node representations from graph data. In this work, we systematically analyze modeling node attributes in VGAEs and show that attribute decoding is important for node representation learning. We further propose a new learning model, interpretable NOde Representation with Attribute Decoding (NORAD). The model encodes node representations in an interpretable approach: node representations capture community structures in the graph and the relationship between communities and node attributes. We further propose a rectifying procedure to refine node representations of isolated notes, improving the quality of these nodes' representations. Our empirical results demonstrate the advantage of the proposed model when learning graph data in an interpretable approach.
    Security Analysis of SplitFed Learning. (arXiv:2212.01716v1 [cs.LG])
    Split Learning (SL) and Federated Learning (FL) are two prominent distributed collaborative learning techniques that maintain data privacy by allowing clients to never share their private data with other clients and servers, and fined extensive IoT applications in smart healthcare, smart cities, and smart industry. Prior work has extensively explored the security vulnerabilities of FL in the form of poisoning attacks. To mitigate the effect of these attacks, several defenses have also been proposed. Recently, a hybrid of both learning techniques has emerged (commonly known as SplitFed) that capitalizes on their advantages (fast training) and eliminates their intrinsic disadvantages (centralized model updates). In this paper, we perform the first ever empirical analysis of SplitFed's robustness to strong model poisoning attacks. We observe that the model updates in SplitFed have significantly smaller dimensionality as compared to FL that is known to have the curse of dimensionality. We show that large models that have higher dimensionality are more susceptible to privacy and security attacks, whereas the clients in SplitFed do not have the complete model and have lower dimensionality, making them more robust to existing model poisoning attacks. Our results show that the accuracy reduction due to the model poisoning attack is 5x lower for SplitFed compared to FL.
    LDL: A Defense for Label-Based Membership Inference Attacks. (arXiv:2212.01688v1 [cs.LG])
    The data used to train deep neural network (DNN) models in applications such as healthcare and finance typically contain sensitive information. A DNN model may suffer from overfitting. Overfitted models have been shown to be susceptible to query-based attacks such as membership inference attacks (MIAs). MIAs aim to determine whether a sample belongs to the dataset used to train a classifier (members) or not (nonmembers). Recently, a new class of label based MIAs (LAB MIAs) was proposed, where an adversary was only required to have knowledge of predicted labels of samples. Developing a defense against an adversary carrying out a LAB MIA on DNN models that cannot be retrained remains an open problem. We present LDL, a light weight defense against LAB MIAs. LDL works by constructing a high-dimensional sphere around queried samples such that the model decision is unchanged for (noisy) variants of the sample within the sphere. This sphere of label-invariance creates ambiguity and prevents a querying adversary from correctly determining whether a sample is a member or a nonmember. We analytically characterize the success rate of an adversary carrying out a LAB MIA when LDL is deployed, and show that the formulation is consistent with experimental observations. We evaluate LDL on seven datasets -- CIFAR-10, CIFAR-100, GTSRB, Face, Purchase, Location, and Texas -- with varying sizes of training data. All of these datasets have been used by SOTA LAB MIAs. Our experiments demonstrate that LDL reduces the success rate of an adversary carrying out a LAB MIA in each case. We empirically compare LDL with defenses against LAB MIAs that require retraining of DNN models, and show that LDL performs favorably despite not needing to retrain the DNNs.
    Multi-view deep learning based molecule design and structural optimization accelerates the SARS-CoV-2 inhibitor discovery. (arXiv:2212.01575v1 [cs.LG])
    In this work, we propose MEDICO, a Multi-viEw Deep generative model for molecule generation, structural optimization, and the SARS-CoV-2 Inhibitor disCOvery. To the best of our knowledge, MEDICO is the first-of-this-kind graph generative model that can generate molecular graphs similar to the structure of targeted molecules, with a multi-view representation learning framework to sufficiently and adaptively learn comprehensive structural semantics from targeted molecular topology and geometry. We show that our MEDICO significantly outperforms the state-of-the-art methods in generating valid, unique, and novel molecules under benchmarking comparisons. In particular, we showcase the multi-view deep learning model enables us to generate not only the molecules structurally similar to the targeted molecules but also the molecules with desired chemical properties, demonstrating the strong capability of our model in exploring the chemical space deeply. Moreover, case study results on targeted molecule generation for the SARS-CoV-2 main protease (Mpro) show that by integrating molecule docking into our model as chemical priori, we successfully generate new small molecules with desired drug-like properties for the Mpro, potentially accelerating the de novo design of Covid-19 drugs. Further, we apply MEDICO to the structural optimization of three well-known Mpro inhibitors (N3, 11a, and GC376) and achieve ~88% improvement in their binding affinity to Mpro, demonstrating the application value of our model for the development of therapeutics for SARS-CoV-2 infection.
    Learning-Assisted Algorithm Unrolling for Online Optimization with Budget Constraints. (arXiv:2212.01689v1 [cs.LG])
    Online optimization with multiple budget constraints is challenging since the online decisions over a short time horizon are coupled together by strict inventory constraints. The existing manually-designed algorithms cannot achieve satisfactory average performance for this setting because they often need a large number of time steps for convergence and/or may violate the inventory constraints. In this paper, we propose a new machine learning (ML) assisted unrolling approach, called LAAU (Learning-Assisted Algorithm Unrolling), which unrolls the online decision pipeline and leverages an ML model for updating the Lagrangian multiplier online. For efficient training via backpropagation, we derive gradients of the decision pipeline over time. We also provide the average cost bounds for two cases when training data is available offline and collected online, respectively. Finally, we present numerical results to highlight that LAAU can outperform the existing baselines.
    Exploring the Limits of Differentially Private Deep Learning with Group-wise Clipping. (arXiv:2212.01539v1 [cs.LG])
    Differentially private deep learning has recently witnessed advances in computational efficiency and privacy-utility trade-off. We explore whether further improvements along the two axes are possible and provide affirmative answers leveraging two instantiations of \emph{group-wise clipping}. To reduce the compute time overhead of private learning, we show that \emph{per-layer clipping}, where the gradient of each neural network layer is clipped separately, allows clipping to be performed in conjunction with backpropagation in differentially private optimization. This results in private learning that is as memory-efficient and almost as fast per training update as non-private learning for many workflows of interest. While per-layer clipping with constant thresholds tends to underperform standard flat clipping, per-layer clipping with adaptive thresholds matches or outperforms flat clipping under given training epoch constraints, hence attaining similar or better task performance within less wall time. To explore the limits of scaling (pretrained) models in differentially private deep learning, we privately fine-tune the 175 billion-parameter GPT-3. We bypass scaling challenges associated with clipping gradients that are distributed across multiple devices with \emph{per-device clipping} that clips the gradient of each model piece separately on its host device. Privately fine-tuning GPT-3 with per-device clipping achieves a task performance at $\epsilon=1$ better than what is attainable by non-privately fine-tuning the largest GPT-2 on a summarization task.
    Smoothing Policy Iteration for Zero-sum Markov Games. (arXiv:2212.01623v1 [cs.LG])
    Zero-sum Markov Games (MGs) has been an efficient framework for multi-agent systems and robust control, wherein a minimax problem is constructed to solve the equilibrium policies. At present, this formulation is well studied under tabular settings wherein the maximum operator is primarily and exactly solved to calculate the worst-case value function. However, it is non-trivial to extend such methods to handle complex tasks, as finding the maximum over large-scale action spaces is usually cumbersome. In this paper, we propose the smoothing policy iteration (SPI) algorithm to solve the zero-sum MGs approximately, where the maximum operator is replaced by the weighted LogSumExp (WLSE) function to obtain the nearly optimal equilibrium policies. Specially, the adversarial policy is served as the weight function to enable an efficient sampling over action spaces.We also prove the convergence of SPI and analyze its approximation error in $\infty -$norm based on the contraction mapping theorem. Besides, we propose a model-based algorithm called Smooth adversarial Actor-critic (SaAC) by extending SPI with the function approximations. The target value related to WLSE function is evaluated by the sampled trajectories and then mean square error is constructed to optimize the value function, and the gradient-ascent-descent methods are adopted to optimize the protagonist and adversarial policies jointly. In addition, we incorporate the reparameterization technique in model-based gradient back-propagation to prevent the gradient vanishing due to sampling from the stochastic policies. We verify our algorithm in both tabular and function approximation settings. Results show that SPI can approximate the worst-case value function with a high accuracy and SaAC can stabilize the training process and improve the adversarial robustness in a large margin.
    RLogist: Fast Observation Strategy on Whole-slide Images with Deep Reinforcement Learning. (arXiv:2212.01737v1 [cs.CV])
    Whole-slide images (WSI) in computational pathology have high resolution with gigapixel size, but are generally with sparse regions of interest, which leads to weak diagnostic relevance and data inefficiency for each area in the slide. Most of the existing methods rely on a multiple instance learning framework that requires densely sampling local patches at high magnification. The limitation is evident in the application stage as the heavy computation for extracting patch-level features is inevitable. In this paper, we develop RLogist, a benchmarking deep reinforcement learning (DRL) method for fast observation strategy on WSIs. Imitating the diagnostic logic of human pathologists, our RL agent learns how to find regions of observation value and obtain representative features across multiple resolution levels, without having to analyze each part of the WSI at the high magnification. We benchmark our method on two whole-slide level classification tasks, including detection of metastases in WSIs of lymph node sections, and subtyping of lung cancer. Experimental results demonstrate that RLogist achieves competitive classification performance compared to typical multiple instance learning algorithms, while having a significantly short observation path. In addition, the observation path given by RLogist provides good decision-making interpretability, and its ability of reading path navigation can potentially be used by pathologists for educational/assistive purposes. Our code is available at: \url{https://github.com/tencent-ailab/RLogist}.
    FedRolex: Model-Heterogeneous Federated Learning with Rolling Sub-Model Extraction. (arXiv:2212.01548v1 [cs.LG])
    Most cross-device federated learning (FL) studies focus on the model-homogeneous setting where the global server model and local client models are identical. However, such constraint not only excludes low-end clients who would otherwise make unique contributions to model training but also restrains clients from training large models due to on-device resource bottlenecks. In this work, we propose FedRolex, a partial training (PT)-based approach that enables model-heterogeneous FL and can train a global server model larger than the largest client model. At its core, FedRolex employs a rolling sub-model extraction scheme that allows different parts of the global server model to be evenly trained, which mitigates the client drift induced by the inconsistency between individual client models and server model architectures. We show that FedRolex outperforms state-of-the-art PT-based model-heterogeneous FL methods (e.g. Federated Dropout) and reduces the gap between model-heterogeneous and model-homogeneous FL, especially under the large-model large-dataset regime. In addition, we provide theoretical statistical analysis on its advantage over Federated Dropout and evaluate FedRolex on an emulated real-world device distribution to show that FedRolex can enhance the inclusiveness of FL and boost the performance of low-end devices that would otherwise not benefit from FL. Our code is available at https://github.com/MSU-MLSys-Lab/FedRolex.
    Domain-Invariant Feature Alignment Using Variational Inference For Partial Domain Adaptation. (arXiv:2212.01590v1 [cs.LG])
    The standard closed-set domain adaptation approaches seek to mitigate distribution discrepancies between two domains under the constraint of both sharing identical label sets. However, in realistic scenarios, finding an optimal source domain with identical label space is a challenging task. Partial domain adaptation alleviates this problem of procuring a labeled dataset with identical label space assumptions and addresses a more practical scenario where the source label set subsumes the target label set. This, however, presents a few additional obstacles during adaptation. Samples with categories private to the source domain thwart relevant knowledge transfer and degrade model performance. In this work, we try to address these issues by coupling variational information and adversarial learning with a pseudo-labeling technique to enforce class distribution alignment and minimize the transfer of superfluous information from the source samples. The experimental findings in numerous cross-domain classification tasks demonstrate that the proposed technique delivers superior and comparable accuracy to existing methods.
    Classification by sparse additive models. (arXiv:2212.01792v1 [math.ST])
    We consider (nonparametric) sparse additive models (SpAM) for classification. The design of a SpAM classifier is based on minimizing the logistic loss with a sparse group Lasso/Slope-type penalties on the coefficients of univariate components' expansions in orthonormal series (e.g., Fourier or wavelets). The resulting classifier is inherently adaptive to the unknown sparsity and smoothness. We show that it is nearly-minimax (up to log-factors) within the entire range of analytic, Sobolev and Besov classes, and illustrate its performance on the real-data example.
    Learning-based Autonomous Channel Access in the Presence of Hidden Terminals. (arXiv:2207.03605v2 [cs.LG] UPDATED)
    We consider the problem of autonomous channel access (AutoCA), where a group of terminals tries to discover a communication strategy with an access point (AP) via a common wireless channel in a distributed fashion. Due to the irregular topology and the limited communication range of terminals, a practical challenge for AutoCA is the hidden terminal problem, which is notorious in wireless networks for deteriorating the throughput and delay performances. To meet the challenge, this paper presents a new multi-agent deep reinforcement learning paradigm, dubbed MADRL-HT, tailored for AutoCA in the presence of hidden terminals. MADRL-HT exploits topological insights and transforms the observation space of each terminal into a scalable form independent of the number of terminals. To compensate for the partial observability, we put forth a look-back mechanism such that the terminals can infer behaviors of their hidden terminals from the carrier sensed channel states as well as feedback from the AP. A window-based global reward function is proposed, whereby the terminals are instructed to maximize the system throughput while balancing the terminals' transmission opportunities over the course of learning. Extensive numerical experiments verified the superior performance of our solution benchmarked against the legacy carrier-sense multiple access with collision avoidance (CSMA/CA) protocol.
    Modeling Wind Turbine Performance and Wake Interactions with Machine Learning. (arXiv:2212.01483v1 [physics.flu-dyn])
    Different machine learning (ML) models are trained on SCADA and meteorological data collected at an onshore wind farm and then assessed in terms of fidelity and accuracy for predictions of wind speed, turbulence intensity, and power capture at the turbine and wind farm levels for different wind and atmospheric conditions. ML methods for data quality control and pre-processing are applied to the data set under investigation and found to outperform standard statistical methods. A hybrid model, comprised of a linear interpolation model, Gaussian process, deep neural network (DNN), and support vector machine, paired with a DNN filter, is found to achieve high accuracy for modeling wind turbine power capture. Modifications of the incoming freestream wind speed and turbulence intensity, $TI$, due to the evolution of the wind field over the wind farm and effects associated with operating turbines are also captured using DNN models. Thus, turbine-level modeling is achieved using models for predicting power capture while farm-level modeling is achieved by combining models predicting wind speed and $TI$ at each turbine location from freestream conditions with models predicting power capture. Combining these models provides results consistent with expected power capture performance and holds promise for future endeavors in wind farm modeling and diagnostics. Though training ML models is computationally expensive, using the trained models to simulate the entire wind farm takes only a few seconds on a typical modern laptop computer, and the total computational cost is still lower than other available mid-fidelity simulation approaches.
    Beyond ADMM: A Unified Client-variance-reduced Adaptive Federated Learning Framework. (arXiv:2212.01519v1 [cs.LG])
    As a novel distributed learning paradigm, federated learning (FL) faces serious challenges in dealing with massive clients with heterogeneous data distribution and computation and communication resources. Various client-variance-reduction schemes and client sampling strategies have been respectively introduced to improve the robustness of FL. Among others, primal-dual algorithms such as the alternating direction of method multipliers (ADMM) have been found being resilient to data distribution and outperform most of the primal-only FL algorithms. However, the reason behind remains a mystery still. In this paper, we firstly reveal the fact that the federated ADMM is essentially a client-variance-reduced algorithm. While this explains the inherent robustness of federated ADMM, the vanilla version of it lacks the ability to be adaptive to the degree of client heterogeneity. Besides, the global model at the server under client sampling is biased which slows down the practical convergence. To go beyond ADMM, we propose a novel primal-dual FL algorithm, termed FedVRA, that allows one to adaptively control the variance-reduction level and biasness of the global model. In addition, FedVRA unifies several representative FL algorithms in the sense that they are either special instances of FedVRA or are close to it. Extensions of FedVRA to semi/un-supervised learning are also presented. Experiments based on (semi-)supervised image classification tasks demonstrate superiority of FedVRA over the existing schemes in learning scenarios with massive heterogeneous clients and client sampling.
    Laplacian Convolutional Representation for Traffic Time Series Imputation. (arXiv:2212.01529v1 [cs.LG])
    Spatiotemporal traffic data imputation is of great significance in intelligent transportation systems and data-driven decision-making processes. To make an accurate reconstruction on partially observed traffic data, we assert the importance of characterizing both global and local trends in traffic time series. In the literature, substantial prior works have demonstrated the effectiveness of utilizing low-rankness property of traffic data by matrix/tensor completion models. In this study, we first introduce a Laplacian kernel to temporal regularization for characterizing local trends in traffic time series, which can be formulated in the form of circular convolution. Then, we develop a low-rank Laplacian convolutional representation (LCR) model by putting the nuclear norm of a circulant matrix and the Laplacian temporal regularization together, which is proved to meet a unified framework that takes a fast Fourier transform solution in a relatively low time complexity. Through extensive experiments on some traffic datasets, we demonstrate the superiority of LCR for imputing traffic time series of various time series behaviors (e.g., data noises and strong/weak periodicity). The proposed LCR model is an efficient and effective solution to large-scale traffic data imputation over the existing baseline models. The adapted datasets and Python implementation are publicly available at https://github.com/xinychen/transdim.
    Distribution Fitting for Combating Mode Collapse in GANs. (arXiv:2212.01521v1 [cs.LG])
    Mode collapse is still a major unsolved problem in generative adversarial networks. In this work, we analyze the causes of mode collapse from a new perspective. Due to the nonuniform sampling in the training process, some sub-distributions can be missed while sampling data. Therefore, the GAN objective can reach the minimum when the generated distribution is not the same as the real one. To alleviate the problem, we propose a global distribution fitting (GDF) method by a penalty term to constrain generated data distribution. On the basis of not changing the global minimum of the GAN objective, GDF will make it harder to reach the minimum value when the generated distribution is not the same as the real one. Furthermore, we also propose a local distribution fitting (LDF) method to cope with the situation that the real distribution is unknown. Experiments on several benchmarks demonstrate the effectiveness and competitive performance of GDF and LDF.
    Demystifying Approximate RL with $\epsilon$-greedy Exploration: A Differential Inclusion View. (arXiv:2205.13617v2 [cs.LG] UPDATED)
    Q-learning and SARSA(0) with $\epsilon$-greedy exploration are leading reinforcement learning methods, and their tabular forms converge to the optimal Q-function under reasonable conditions. However, with function approximation, these methods exhibit strange behaviors, e.g., policy oscillation and chattering, convergence to different attractors (possibly even the worst policy) on different runs, etc., apart from the usual instability. Accordingly, a theory to explain these phenomena has been a long-standing open problem, even for basic linear function approximation (Sutton, 1999). Our work uses differential inclusion theory to provide the first framework for resolving this problem. We further illustrate via numerical examples how this framework helps explain these algorithms' asymptotic behaviors.
    Learning with Combinatorial Optimization Layers: a Probabilistic Approach. (arXiv:2207.13513v2 [stat.ML] UPDATED)
    Combinatorial optimization (CO) layers in machine learning (ML) pipelines are a powerful tool to tackle data-driven decision tasks, but they come with two main challenges. First, the solution of a CO problem often behaves as a piecewise constant function of its objective parameters. Given that ML pipelines are typically trained using stochastic gradient descent, the absence of slope information is very detrimental. Second, standard ML losses do not work well in combinatorial settings. A growing body of research addresses these challenges through diverse methods. Unfortunately, the lack of well-maintained implementations slows down the adoption of CO layers. In this paper, building upon previous works, we introduce a probabilistic perspective on CO layers, which lends itself naturally to approximate differentiation and the construction of structured losses. We recover many approaches from the literature as special cases, and we also derive new ones. Based on this unifying perspective, we present InferOpt.jl, an open-source Julia package that 1) allows turning any CO oracle with a linear objective into a differentiable layer, and 2) defines adequate losses to train pipelines containing such layers. Our library works with arbitrary optimization algorithms, and it is fully compatible with Julia's ML ecosystem. We demonstrate its abilities using a pathfinding problem on video game maps as guiding example, as well as three other applications from operations research.
    Skin feature point tracking using deep feature encodings. (arXiv:2112.14159v2 [cs.CV] UPDATED)
    Facial feature tracking is a key component of imaging ballistocardiography (BCG) where accurate quantification of the displacement of facial keypoints is needed for good heart rate estimation. Skin feature tracking enables video-based quantification of motor degradation in Parkinson's disease. Traditional computer vision algorithms include Scale Invariant Feature Transform (SIFT), Speeded-Up Robust Features (SURF), and Lucas-Kanade method (LK). These have long represented the state-of-the-art in efficiency and accuracy but fail when common deformations, like affine local transformations or illumination changes, are present. Over the past five years, deep convolutional neural networks have outperformed traditional methods for most computer vision tasks. We propose a pipeline for feature tracking, that applies a convolutional stacked autoencoder to identify the most similar crop in an image to a reference crop containing the feature of interest. The autoencoder learns to represent image crops into deep feature encodings specific to the object category it is trained on. We train the autoencoder on facial images and validate its ability to track skin features in general using manually labeled face and hand videos. The tracking errors of distinctive skin features (moles) are so small that we cannot exclude that they stem from the manual labelling based on a $\chi^2$-test. With a mean error of 0.6-4.2 pixels, our method outperformed the other methods in all but one scenario. More importantly, our method was the only one to not diverge. We conclude that our method creates better feature descriptors for feature tracking, feature matching, and image registration than the traditional algorithms.
    Online Estimation of the Koopman Operator Using Fourier Features. (arXiv:2212.01503v1 [cs.RO])
    Transfer operators offer linear representations and global, physically meaningful features of nonlinear dynamical systems. Discovering transfer operators, such as the Koopman operator, require careful crafted dictionaries of observables, acting on states of the dynamical system. This is ad hoc and requires the full dataset for evaluation. In this paper, we offer an optimization scheme to allow joint learning of the observables and Koopman operator with online data. Our results show we are able to reconstruct the evolution and represent the global features of complex dynamical systems.
    GraphGDP: Generative Diffusion Processes for Permutation Invariant Graph Generation. (arXiv:2212.01842v1 [cs.LG])
    Graph generative models have broad applications in biology, chemistry and social science. However, modelling and understanding the generative process of graphs is challenging due to the discrete and high-dimensional nature of graphs, as well as permutation invariance to node orderings in underlying graph distributions. Current leading autoregressive models fail to capture the permutation invariance nature of graphs for the reliance on generation ordering and have high time complexity. Here, we propose a continuous-time generative diffusion process for permutation invariant graph generation to mitigate these issues. Specifically, we first construct a forward diffusion process defined by a stochastic differential equation (SDE), which smoothly converts graphs within the complex distribution to random graphs that follow a known edge probability. Solving the corresponding reverse-time SDE, graphs can be generated from newly sampled random graphs. To facilitate the reverse-time SDE, we newly design a position-enhanced graph score network, capturing the evolving structure and position information from perturbed graphs for permutation equivariant score estimation. Under the evaluation of comprehensive metrics, our proposed generative diffusion process achieves competitive performance in graph distribution learning. Experimental results also show that GraphGDP can generate high-quality graphs in only 24 function evaluations, much faster than previous autoregressive models.
    A Tutorial on Sparse Gaussian Processes and Variational Inference. (arXiv:2012.13962v13 [cs.LG] UPDATED)
    Gaussian processes (GPs) provide a framework for Bayesian inference that can offer principled uncertainty estimates for a large range of problems. For example, if we consider regression problems with Gaussian likelihoods, a GP model enjoys a posterior in closed form. However, identifying the posterior GP scales cubically with the number of training examples and requires to store all examples in memory. In order to overcome these obstacles, sparse GPs have been proposed that approximate the true posterior GP with pseudo-training examples. Importantly, the number of pseudo-training examples is user-defined and enables control over computational and memory complexity. In the general case, sparse GPs do not enjoy closed-form solutions and one has to resort to approximate inference. In this context, a convenient choice for approximate inference is variational inference (VI), where the problem of Bayesian inference is cast as an optimization problem -- namely, to maximize a lower bound of the log marginal likelihood. This paves the way for a powerful and versatile framework, where pseudo-training examples are treated as optimization arguments of the approximate posterior that are jointly identified together with hyperparameters of the generative model (i.e. prior and likelihood). The framework can naturally handle a wide scope of supervised learning problems, ranging from regression with heteroscedastic and non-Gaussian likelihoods to classification problems with discrete labels, but also problems with multidimensional labels. The purpose of this tutorial is to provide access to the basic matter for readers without prior knowledge in both GPs and VI. A proper exposition to the subject enables also access to more recent advances (like importance-weighted VI as well as interdomain, multioutput and deep GPs) that can serve as an inspiration for new research ideas.
    Learning and Blending Robot Hugging Behaviors in Time and Space. (arXiv:2212.01507v1 [cs.RO])
    We introduce an imitation learning-based physical human-robot interaction algorithm capable of predicting appropriate robot responses in complex interactions involving a superposition of multiple interactions. Our proposed algorithm, Blending Bayesian Interaction Primitives (B-BIP) allows us to achieve responsive interactions in complex hugging scenarios, capable of reciprocating and adapting to a hugs motion and timing. We show that this algorithm is a generalization of prior work, for which the original formulation reduces to the particular case of a single interaction, and evaluate our method through both an extensive user study and empirical experiments. Our algorithm yields significantly better quantitative prediction error and more-favorable participant responses with respect to accuracy, responsiveness, and timing, when compared to existing state-of-the-art methods.
    Hedging against Complexity: Distributionally Robust Optimization with Parametric Approximation. (arXiv:2212.01518v1 [math.OC])
    Empirical risk minimization (ERM) and distributionally robust optimization (DRO) are popular approaches for solving stochastic optimization problems that appear in operations management and machine learning. Existing generalization error bounds for these methods depend on either the complexity of the cost function or dimension of the uncertain parameters; consequently, the performance of these methods is poor for high-dimensional problems with objective functions under high complexity. We propose a simple approach in which the distribution of uncertain parameters is approximated using a parametric family of distributions. This mitigates both sources of complexity; however, it introduces a model misspecification error. We show that this new source of error can be controlled by suitable DRO formulations. Our proposed parametric DRO approach has significantly improved generalization bounds over existing ERM / DRO methods and parametric ERM for a wide variety of settings. Our method is particularly effective under distribution shifts. We also illustrate the superior performance of our approach on both synthetic and real-data portfolio optimization and regression tasks.
    Constrained Reinforcement Learning via Dissipative Saddle Flow Dynamics. (arXiv:2212.01505v1 [cs.LG])
    In constrained reinforcement learning (C-RL), an agent seeks to learn from the environment a policy that maximizes the expected cumulative reward while satisfying minimum requirements in secondary cumulative reward constraints. Several algorithms rooted in sampled-based primal-dual methods have been recently proposed to solve this problem in policy space. However, such methods are based on stochastic gradient descent ascent algorithms whose trajectories are connected to the optimal policy only after a mixing output stage that depends on the algorithm's history. As a result, there is a mismatch between the behavioral policy and the optimal one. In this work, we propose a novel algorithm for constrained RL that does not suffer from these limitations. Leveraging recent results on regularized saddle-flow dynamics, we develop a novel stochastic gradient descent-ascent algorithm whose trajectories converge to the optimal policy almost surely.
    Shisha: Online scheduling of CNN pipelines on heterogeneous architectures. (arXiv:2202.11575v2 [cs.PF] UPDATED)
    Chiplets have become a common methodology in modern chip design. Chiplets improve yield and enable heterogeneity at the level of cores, memory subsystem and the interconnect. Convolutional Neural Networks (CNNs) have high computational, bandwidth and memory capacity requirements owing to the increasingly large amount of weights. Thus to exploit chiplet-based architectures, CNNs must be optimized in terms of scheduling and workload distribution among computing resources. We propose Shisha, an online approach to generate and schedule parallel CNN pipelines on chiplet architectures. Shisha targets heterogeneity in compute performance and memory bandwidth and tunes the pipeline schedule through a fast online exploration technique. We compare Shisha with Simulated Annealing, Hill Climbing and Pipe-Search. On average, the convergence time is improved by ~35x in Shisha compared to other exploration algorithms. Despite the quick exploration, Shisha's solution is often better than that of other heuristic exploration algorithms.
    Scalable Classifier-Agnostic Channel Selection for Multivariate Time Series Classification. (arXiv:2206.09274v2 [cs.LG] UPDATED)
    Accuracy is a key focus of current work in time series classification. However, speed and data reduction in many applications is equally important, especially when the data scale and storage requirements increase rapidly. Current MTSC algorithms need hundreds of compute hours to complete training and prediction. This is due to the nature of multivariate time series data, which grows with the number of time series, their length and the number of channels. In many applications, not all the channels are useful for the classification task; hence we require methods that can efficiently select useful channels and thus save computational resources. We propose and evaluate two methods for channel selection. Our techniques work by representing each class by a prototype time series and performing channel selection based on the prototype distance between classes. The main hypothesis is that useful channels enable better separation between classes; hence, channels with the higher distance between class prototypes are more useful. On the UEA Multivariate Time Series Classification (MTSC) benchmark, we show that these techniques achieve significant data reduction and classifier speedup for similar levels of classification accuracy. Channel selection is applied as a pre-processing step before training state-of-the-art MTSC algorithms and saves about 70\% of computation time and data storage, with preserved accuracy. Furthermore, our methods enable even efficient classifiers, such as ROCKET, to achieve better accuracy than using no channel selection or forward channel selection. To further study the impact of our techniques, we present experiments on classifying synthetic multivariate time series datasets with more than 100 channels, as well as a real-world case study on a dataset with 50 channels. Our channel selection methods lead to significant data reduction with preserved or improved accuracy.
    Influence of uncertainty estimation techniques on false-positive reduction in liver lesion detection. (arXiv:2206.10911v2 [eess.IV] UPDATED)
    Deep learning techniques show success in detecting objects in medical images, but still suffer from false-positive predictions that may hinder accurate diagnosis. The estimated uncertainty of the neural network output has been used to flag incorrect predictions. We study the role played by features computed from neural network uncertainty estimates and shape-based features computed from binary predictions in reducing false positives in liver lesion detection by developing a classification-based post-processing step for different uncertainty estimation methods. We demonstrate an improvement in the lesion detection performance of the neural network (with respect to F1-score) for all uncertainty estimation methods on two datasets, comprising abdominal MR and CT images, respectively. We show that features computed from neural network uncertainty estimates tend not to contribute much toward reducing false positives. Our results show that factors like class imbalance (true over false positive ratio) and shape-based features extracted from uncertainty maps play an important role in distinguishing false positive from true positive predictions. Our code can be found at https://github.com/ishaanb92/FPCPipeline.
    Concentration inequalities and optimal number of layers for stochastic deep neural networks. (arXiv:2206.11241v3 [cs.LG] UPDATED)
    We state concentration inequalities for the output of the hidden layers of a stochastic deep neural network (SDNN), as well as for the output of the whole SDNN. These results allow us to introduce an expected classifier (EC), and to give probabilistic upper bound for the classification error of the EC. We also state the optimal number of layers for the SDNN via an optimal stopping procedure. We apply our analysis to a stochastic version of a feedforward neural network with ReLU activation function.
    Continual Learning for On-Device Speech Recognition using Disentangled Conformers. (arXiv:2212.01393v1 [eess.AS])
    Automatic speech recognition research focuses on training and evaluating on static datasets. Yet, as speech models are increasingly deployed on personal devices, such models encounter user-specific distributional shifts. To simulate this real-world scenario, we introduce LibriContinual, a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks, with data corresponding to 118 individual speakers and 6 train splits per speaker of different sizes. Additionally, current speech recognition models and continual learning algorithms are not optimized to be compute-efficient. We adapt a general-purpose training algorithm NetAug for ASR and create a novel Conformer variant called the DisConformer (Disentangled Conformer). This algorithm produces ASR models consisting of a frozen 'core' network for general-purpose use and several tunable 'augment' networks for speaker-specific tuning. Using such models, we propose a novel compute-efficient continual learning algorithm called DisentangledCL. Our experiments show that the DisConformer models significantly outperform baselines on general ASR i.e. LibriSpeech (15.58% rel. WER on test-other). On speaker-specific LibriContinual they significantly outperform trainable-parameter-matched baselines (by 20.65% rel. WER on test) and even match fully finetuned baselines in some settings.
    PreQuEL: Quality Estimation of Machine Translation Outputs in Advance. (arXiv:2205.09178v2 [cs.CL] UPDATED)
    We present the task of PreQuEL, Pre-(Quality-Estimation) Learning. A PreQuEL system predicts how well a given sentence will be translated, without recourse to the actual translation, thus eschewing unnecessary resource allocation when translation quality is bound to be low. PreQuEL can be defined relative to a given MT system (e.g., some industry service) or generally relative to the state-of-the-art. From a theoretical perspective, PreQuEL places the focus on the source text, tracing properties, possibly linguistic features, that make a sentence harder to machine translate. We develop a baseline model for the task and analyze its performance. We also develop a data augmentation method (from parallel corpora), that improves results substantially. We show that this augmentation method can improve the performance of the Quality-Estimation task as well. We investigate the properties of the input text that our model is sensitive to, by testing it on challenge sets and different languages. We conclude that it is aware of syntactic and semantic distinctions, and correlates and even over-emphasizes the importance of standard NLP features.
    Principal Geodesic Analysis of Merge Trees (and Persistence Diagrams). (arXiv:2207.10960v3 [cs.GR] UPDATED)
    This paper presents a computational framework for the Principal Geodesic Analysis of merge trees (MT-PGA), a novel adaptation of the celebrated Principal Component Analysis (PCA) framework [87] to the Wasserstein metric space of merge trees [92]. We formulate MT-PGA computation as a constrained optimization problem, aiming at adjusting a basis of orthogonal geodesic axes, while minimizing a fitting energy. We introduce an efficient, iterative algorithm which exploits shared-memory parallelism, as well as an analytic expression of the fitting energy gradient, to ensure fast iterations. Our approach also trivially extends to extremum persistence diagrams. Extensive experiments on public ensembles demonstrate the efficiency of our approach - with MT-PGA computations in the orders of minutes for the largest examples. We show the utility of our contributions by extending to merge trees two typical PCA applications. First, we apply MT-PGA to data reduction and reliably compress merge trees by concisely representing them by their first coordinates in the MT-PGA basis. Second, we present a dimensionality reduction framework exploiting the first two directions of the MT-PGA basis to generate two-dimensional layouts of the ensemble. We augment these layouts with persistence correlation views, enabling global and local visual inspections of the feature variability in the ensemble. In both applications, quantitative experiments assess the relevance of our framework. Finally, we provide a C++ implementation that can be used to reproduce our results.
    Conditional Antibody Design as 3D Equivariant Graph Translation. (arXiv:2208.06073v3 [q-bio.BM] UPDATED)
    Antibody design is valuable for therapeutic usage and biological research. Existing deep-learning-based methods encounter several key issues: 1) incomplete context for Complementarity-Determining Regions (CDRs) generation; 2) incapable of capturing the entire 3D geometry of the input structure; 3) inefficient prediction of the CDR sequences in an autoregressive manner. In this paper, we propose Multi-channel Equivariant Attention Network (MEAN), an end-to-end model that is able to co-design 1D sequences and 3D structures of CDRs. To be specific, MEAN formulates antibody design as a conditional graph translation problem by importing extra components including the target antigen and the light chain of the antibody. Then, MEAN resorts to E(3)-equivariant message passing along with a proposed attention mechanism to better capture the geometrical correlation between different components. Finally, it outputs both the 1D sequences and 3D structure via a multi-round progressive full-shot scheme, which enjoys more efficiency against previous autoregressive approaches. Our method significantly surpasses state-of-the-art models in sequence and structure modeling, antigen-binding antibody design, and binding affinity optimization. Specifically, the relative improvement to baselines is about 23% in antigen-binding CDR design and 34% for affinity optimization.
    CrossSplit: Mitigating Label Noise Memorization through Data Splitting. (arXiv:2212.01674v1 [cs.CV])
    We approach the problem of improving robustness of deep learning algorithms in the presence of label noise. Building upon existing label correction and co-teaching methods, we propose a novel training procedure to mitigate the memorization of noisy labels, called CrossSplit, which uses a pair of neural networks trained on two disjoint parts of the dataset. CrossSplit combines two main ingredients: (i) Cross-split label correction. The idea is that, since the model trained on one part of the data cannot memorize example-label pairs from the other part, the training labels presented to each network can be smoothly adjusted by using the predictions of its peer network; (ii) Cross-split semi-supervised training. A network trained on one part of the data also uses the unlabeled inputs of the other part. Extensive experiments on CIFAR-10, CIFAR-100, Tiny-ImageNet and mini-WebVision datasets demonstrate that our method can outperform the current state-of-the-art up to 90% noise ratio.
    Hyperbolic Curvature Graph Neural Network. (arXiv:2212.01793v1 [cs.LG])
    Hyperbolic space is emerging as a promising learning space for representation learning, owning to its exponential growth volume. Compared with the flat Euclidean space, the curved hyperbolic space is far more ambient and embeddable, particularly for datasets with implicit tree-like architectures, such as hierarchies and power-law distributions. On the other hand, the structure of a real-world network is usually intricate, with some regions being tree-like, some being flat, and others being circular. Directly embedding heterogeneous structural networks into a homogeneous embedding space unavoidably brings inductive biases and distortions. Inspiringly, the discrete curvature can well describe the local structure of a node and its surroundings, which motivates us to investigate the information conveyed by the network topology explicitly in improving geometric learning. To this end, we explore the properties of the local discrete curvature of graph topology and the continuous global curvature of embedding space. Besides, a Hyperbolic Curvature-aware Graph Neural Network, HCGNN, is further proposed. In particular, HCGNN utilizes the discrete curvature to lead message passing of the surroundings and adaptively adjust the continuous curvature simultaneously. Extensive experiments on node classification and link prediction tasks show that the proposed method outperforms various competitive models by a large margin in both high and low hyperbolic graph data. Case studies further illustrate the efficacy of discrete curvature in finding local clusters and alleviating the distortion caused by hyperbolic geometry.
    Testing Tail Weight of a Distribution Via Hazard Rate. (arXiv:2010.02888v2 [cs.LG] UPDATED)
    Understanding the shape of a distribution of data is of interest to people in a great variety of fields, as it may affect the types of algorithms used for that data. We study one such problem in the framework of distribution property testing, characterizing the number of samples required to to distinguish whether a distribution has a certain property or is far from having that property. In particular, given samples from a distribution, we seek to characterize the tail of the distribution, that is, understand how many elements appear infrequently. We develop an algorithm based on a careful bucketing scheme that distinguishes light-tailed distributions from non-light-tailed ones with respect to a definition based on the hazard rate, under natural smoothness and ordering assumptions. We bound the number of samples required for this test to succeed with high probability in terms of the parameters of the problem, showing that it is polynomial in these parameters. Further, we prove a hardness result that implies that this problem cannot be solved without any assumptions.
    Reinforcement Learning for Branch-and-Bound Optimisation using Retrospective Trajectories. (arXiv:2205.14345v3 [cs.LG] UPDATED)
    Combinatorial optimisation problems framed as mixed integer linear programmes (MILPs) are ubiquitous across a range of real-world applications. The canonical branch-and-bound algorithm seeks to exactly solve MILPs by constructing a search tree of increasingly constrained sub-problems. In practice, its solving time performance is dependent on heuristics, such as the choice of the next variable to constrain ('branching'). Recently, machine learning (ML) has emerged as a promising paradigm for branching. However, prior works have struggled to apply reinforcement learning (RL), citing sparse rewards, difficult exploration, and partial observability as significant challenges. Instead, leading ML methodologies resort to approximating high quality handcrafted heuristics with imitation learning (IL), which precludes the discovery of novel policies and requires expensive data labelling. In this work, we propose retro branching; a simple yet effective approach to RL for branching. By retrospectively deconstructing the search tree into multiple paths each contained within a sub-tree, we enable the agent to learn from shorter trajectories with more predictable next states. In experiments on four combinatorial tasks, our approach enables learning-to-branch without any expert guidance or pre-training. We outperform the current state-of-the-art RL branching algorithm by 3-5x and come within 20% of the best IL method's performance on MILPs with 500 constraints and 1000 variables, with ablations verifying that our retrospectively constructed trajectories are essential to achieving these results.
    Lottery Pools: Winning More by Interpolating Tickets without Increasing Training or Inference Cost. (arXiv:2208.10842v3 [cs.LG] UPDATED)
    Lottery tickets (LTs) is able to discover accurate and sparse subnetworks that could be trained in isolation to match the performance of dense networks. Ensemble, in parallel, is one of the oldest time-proven tricks in machine learning to improve performance by combining the output of multiple independent models. However, the benefits of ensemble in the context of LTs will be diluted since ensemble does not directly lead to stronger sparse subnetworks, but leverages their predictions for a better decision. In this work, we first observe that directly averaging the weights of the adjacent learned subnetworks significantly boosts the performance of LTs. Encouraged by this observation, we further propose an alternative way to perform an 'ensemble' over the subnetworks identified by iterative magnitude pruning via a simple interpolating strategy. We call our method Lottery Pools. In contrast to the naive ensemble which brings no performance gains to each single subnetwork, Lottery Pools yields much stronger sparse subnetworks than the original LTs without requiring any extra training or inference cost. Across various modern architectures on CIFAR-10/100 and ImageNet, we show that our method achieves significant performance gains in both, in-distribution and out-of-distribution scenarios. Impressively, evaluated with VGG-16 and ResNet-18, the produced sparse subnetworks outperform the original LTs by up to 1.88% on CIFAR-100 and 2.36% on CIFAR-100-C; the resulting dense network surpasses the pre-trained dense-model up to 2.22% on CIFAR-100 and 2.38% on CIFAR-100-C.
    An Improved Algorithm For Online Reranking. (arXiv:2209.04870v2 [cs.DS] UPDATED)
    We study a fundamental model of online preference aggregation, where an algorithm maintains an ordered list of $n$ elements. An input is a stream of preferred sets $R_1, R_2, \dots, R_t, \dots$. Upon seeing $R_t$ and without knowledge of any future sets, an algorithm has to rerank elements (change the list ordering), so that at least one element of $R_t$ is found near the list front. The incurred cost is a sum of the list update costs (the number of swaps of neighboring list elements) and access costs (position of the first element of $R_t$ on the list). This scenario occurs naturally in applications such as ordering items in an online shop using aggregated preferences of shop customers. The theoretical underpinning of this problem is known as Min-Sum Set Cover. Unlike previous work (Fotakis et al., ICALP 2020, NIPS 2020) that mostly studied the performance of an online algorithm ALG against the static optimal solution (a single optimal list ordering), in this paper, we study an arguably harder variant where the benchmark is the provably stronger optimal dynamic solution OPT (that may also modify the list ordering). In terms of an online shop, this means that the aggregated preferences of its user base evolve with time. We construct a computationally efficient randomized algorithm whose competitive ratio (ALG-to-OPT cost ratio) is $O(r^2)$ and prove the existence of a deterministic $O(r^4)$-competitive algorithm. Here, $r$ is the maximum cardinality of sets $R_t$. This is the first algorithm whose ratio does not depend on $n$: the previously best algorithm for this problem was $O(r^{3/2} \cdot \sqrt{n})$-competitive and $\Omega(r)$ is a lower bound on the performance of any deterministic online algorithm.
    Thermal half-lives of azobenzene derivatives: virtual screening based on intersystem crossing using a machine learning potential. (arXiv:2207.11592v4 [physics.chem-ph] UPDATED)
    Molecular photoswitches are the foundation of light-activated drugs. A key photoswitch is azobenzene, which exhibits trans-cis isomerism in response to light. The thermal half-life of the cis isomer is of crucial importance, since it controls the duration of the light-induced biological effect. Here we introduce a computational tool for predicting the thermal half-lives of azobenzene derivatives. Our automated approach uses a fast and accurate machine learning potential trained on quantum chemistry data. Building on well-established earlier evidence, we argue that thermal isomerization proceeds through rotation mediated by intersystem crossing, and incorporate this mechanism into our automated workflow. We use our approach to predict the thermal half-lives of 19,000 azobenzene derivatives. We explore trends and tradeoffs between barriers and absorption wavelengths, and open-source our data and software to accelerate research in photopharmacology.
    Reconstructing Training Data from Trained Neural Networks. (arXiv:2206.07758v3 [cs.LG] UPDATED)
    Understanding to what extent neural networks memorize training data is an intriguing question with practical and theoretical implications. In this paper we show that in some cases a significant fraction of the training data can in fact be reconstructed from the parameters of a trained neural network classifier. We propose a novel reconstruction scheme that stems from recent theoretical results about the implicit bias in training neural networks with gradient-based methods. To the best of our knowledge, our results are the first to show that reconstructing a large portion of the actual training samples from a trained neural network classifier is generally possible. This has negative implications on privacy, as it can be used as an attack for revealing sensitive training data. We demonstrate our method for binary MLP classifiers on a few standard computer vision datasets.
    General Cutting Planes for Bound-Propagation-Based Neural Network Verification. (arXiv:2208.05740v2 [cs.LG] UPDATED)
    Bound propagation methods, when combined with branch and bound, are among the most effective methods to formally verify properties of deep neural networks such as correctness, robustness, and safety. However, existing works cannot handle the general form of cutting plane constraints widely accepted in traditional solvers, which are crucial for strengthening verifiers with tightened convex relaxations. In this paper, we generalize the bound propagation procedure to allow the addition of arbitrary cutting plane constraints, including those involving relaxed integer variables that do not appear in existing bound propagation formulations. Our generalized bound propagation method, GCP-CROWN, opens up the opportunity to apply general cutting plane methods for neural network verification while benefiting from the efficiency and GPU acceleration of bound propagation methods. As a case study, we investigate the use of cutting planes generated by off-the-shelf mixed integer programming (MIP) solver. We find that MIP solvers can generate high-quality cutting planes for strengthening bound-propagation-based verifiers using our new formulation. Since the branching-focused bound propagation procedure and the cutting-plane-focused MIP solver can run in parallel utilizing different types of hardware (GPUs and CPUs), their combination can quickly explore a large number of branches with strong cutting planes, leading to strong verification performance. Experiments demonstrate that our method is the first verifier that can completely solve the oval20 benchmark and verify twice as many instances on the oval21 benchmark compared to the best tool in VNN-COMP 2021, and also noticeably outperforms state-of-the-art verifiers on a wide range of benchmarks. GCP-CROWN is part of the $\alpha,\!\beta$-CROWN verifier, the VNN-COMP 2022 winner. Code is available at this http URL
    Avoiding spurious correlations via logit correction. (arXiv:2212.01433v1 [cs.LG])
    Empirical studies suggest that machine learning models trained with empirical risk minimization (ERM) often rely on attributes that may be spuriously correlated with the class labels. Such models typically lead to poor performance during inference for data lacking such correlations. In this work, we explicitly consider a situation where potential spurious correlations are present in the majority of training data. In contrast with existing approaches, which use the ERM model outputs to detect the samples without spurious correlations, and either heuristically upweighting or upsampling those samples; we propose the logit correction (LC) loss, a simple yet effective improvement on the softmax cross-entropy loss, to correct the sample logit. We demonstrate that minimizing the LC loss is equivalent to maximizing the group-balanced accuracy, so the proposed LC could mitigate the negative impacts of spurious correlations. Our extensive experimental results further reveal that the proposed LC loss outperforms the SoTA solutions on multiple popular benchmarks by a large margin, an average 5.5% absolute improvement, without access to spurious attribute labels. LC is also competitive with oracle methods that make use of the attribute labels. Code is available at https://github.com/shengliu66/LC.
    SMARTQUERY: An Active Learning Framework for Graph Neural Networks through Hybrid Uncertainty Reduction. (arXiv:2212.01440v1 [cs.LG])
    Graph neural networks have achieved significant success in representation learning. However, the performance gains come at a cost; acquiring comprehensive labeled data for training can be prohibitively expensive. Active learning mitigates this issue by searching the unexplored data space and prioritizing the selection of data to maximize model's performance gain. In this paper, we propose a novel method SMARTQUERY, a framework to learn a graph neural network with very few labeled nodes using a hybrid uncertainty reduction function. This is achieved using two key steps: (a) design a multi-stage active graph learning framework by exploiting diverse explicit graph information and (b) introduce label propagation to efficiently exploit known labels to assess the implicit embedding information. Using a comprehensive set of experiments on three network datasets, we demonstrate the competitive performance of our method against state-of-the-arts on very few labeled data (up to 5 labeled nodes per class).
    Identifying Heterogeneous Treatment Effects in Multiple Outcomes using Joint Confidence Intervals. (arXiv:2212.01437v1 [cs.LG])
    Heterogeneous treatment effects (HTEs) are commonly identified during randomized controlled trials (RCTs). Identifying subgroups of patients with similar treatment effects is of high interest in clinical research to advance precision medicine. Often, multiple clinical outcomes are measured during an RCT, each having a potentially heterogeneous effect. Recently there has been high interest in identifying subgroups from HTEs, however, there has been less focus on developing tools in settings where there are multiple outcomes. In this work, we propose a framework for partitioning the covariate space to identify subgroups across multiple outcomes based on the joint CIs. We test our algorithm on synthetic and semi-synthetic data where there are two outcomes, and demonstrate that our algorithm is able to capture the HTE in both outcomes simultaneously.
    Brain Tumor Synthetic Data Generation with Adaptive StyleGANs. (arXiv:2212.01772v1 [eess.IV])
    Generative models have been very successful over the years and have received significant attention for synthetic data generation. As deep learning models are getting more and more complex, they require large amounts of data to perform accurately. In medical image analysis, such generative models play a crucial role as the available data is limited due to challenges related to data privacy, lack of data diversity, or uneven data distributions. In this paper, we present a method to generate brain tumor MRI images using generative adversarial networks. We have utilized StyleGAN2 with ADA methodology to generate high-quality brain MRI with tumors while using a significantly smaller amount of training data when compared to the existing approaches. We use three pre-trained models for transfer learning. Results demonstrate that the proposed method can learn the distributions of brain tumors. Furthermore, the model can generate high-quality synthetic brain MRI with a tumor that can limit the small sample size issues. The approach can addresses the limited data availability by generating realistic-looking brain MRI with tumors. The code is available at: ~\url{https://github.com/rizwanqureshi123/Brain-Tumor-Synthetic-Data}.  ( 2 min )
    Grammar Detection for Sentiment Analysis through Improved Viterbi Algorithm. (arXiv:2205.13148v2 [cs.CL] UPDATED)
    Grammar Detection, also referred to as Parts of Speech Tagging of raw text, is considered an underlying building block of the various Natural Language Processing pipelines like named entity recognition, question answering, and sentiment analysis. In short, forgiven a sentence, Parts of Speech tagging is the task of specifying and tagging each word of a sentence with nouns, verbs, adjectives, adverbs, and more. Sentiment Analysis may well be a procedure accustomed to determining if a given sentence's emotional tone is neutral, positive or negative. To assign polarity scores to the thesis or entities within phrase, in-text analysis and analytics, machine learning and natural language processing, approaches are incorporated. This Sentiment Analysis using POS tagger helps us urge a summary of the broader public over a specific topic. For this, we are using the Viterbi algorithm, Hidden Markov Model, Constraint based Viterbi algorithm for POS tagging. By comparing the accuracies, we select the foremost accurate result of the model for Sentiment Analysis for determining the character of the sentence.  ( 2 min )
    The Open Kidney Ultrasound Data Set. (arXiv:2206.06657v2 [eess.IV] UPDATED)
    Ultrasound, because of its low cost, non-ionizing, and non-invasive characteristics, has established itself as a cornerstone radiological examination. Research on ultrasound applications has also expanded, especially with image analysis with machine learning. However, ultrasound data are frequently restricted to closed data sets, with only a few openly available. Despite being a frequently examined organ, the kidney lacks a publicly available ultrasonography data set. The proposed Open Kidney Ultrasound Data Set is the first publicly available set of kidney brightness mode (B-mode) ultrasound data that includes annotations for multi-class semantic segmentation. It is based on data retrospectively collected in a 5-year period from over 500 patients with a mean age of 53.2 +/- 14.7 years, body mass index of 27.0 +/- 5.4 kg/m2, and most common primary diseases being diabetes mellitus, immunoglobulin A (IgA) nephropathy, and hypertension. There are labels for the view and fine-grained manual annotations from two expert sonographers. Notably, this data includes native and transplanted kidneys. Initial bench-marking measurements are performed, demonstrating a state-of-the-art algorithm achieving a Dice Sorenson Coefficient of 0.85 for the kidney capsule. This data set is a high-quality data set, including two sets of expert annotations, with a larger breadth of images than previously available. In increasing access to kidney ultrasound data, future researchers may be able to create novel image analysis techniques for tissue characterization, disease detection, and prognostication.  ( 2 min )
    Welfare and Fairness in Multi-objective Reinforcement Learning. (arXiv:2212.01382v1 [cs.GT])
    We study fair multi-objective reinforcement learning in which an agent must learn a policy that simultaneously achieves high reward on multiple dimensions of a vector-valued reward. Motivated by the fair resource allocation literature, we model this as an expected welfare maximization problem, for some non-linear fair welfare function of the vector of long-term cumulative rewards. One canonical example of such a function is the Nash Social Welfare, or geometric mean, the log transform of which is also known as the Proportional Fairness objective. We show that even approximately optimal optimization of the expected Nash Social Welfare is computationally intractable even in the tabular case. Nevertheless, we provide a novel adaptation of Q-learning that combines non-linear scalarized learning updates and non-stationary action selection to learn effective policies for optimizing nonlinear welfare functions. We show that our algorithm is provably convergent, and we demonstrate experimentally that our approach outperforms techniques based on linear scalarization, mixtures of optimal linear scalarizations, or stationary action selection for the Nash Social Welfare Objective.
    Measuring Competency of Machine Learning Systems and Enforcing Reliability. (arXiv:2212.01415v1 [cs.LG])
    We explore the impact of environmental conditions on the competency of machine learning agents and how real-time competency assessments improve the reliability of ML agents. We learn a representation of conditions which impact the strategies and performance of the ML agent enabling determination of actions the agent can make to maintain operator expectations in the case of a convolutional neural network that leverages visual imagery to aid in the obstacle avoidance task of a simulated self-driving vehicle.
    PROB: Probabilistic Objectness for Open World Object Detection. (arXiv:2212.01424v1 [cs.CV])
    Open World Object Detection (OWOD) is a new and challenging computer vision task that bridges the gap between classic object detection (OD) benchmarks and object detection in the real world. In addition to detecting and classifying seen/labeled objects, OWOD algorithms are expected to detect novel/unknown objects - which can be classified and incrementally learned. In standard OD, object proposals not overlapping with a labeled object are automatically classified as background. Therefore, simply applying OD methods to OWOD fails as unknown objects would be predicted as background. The challenge of detecting unknown objects stems from the lack of supervision in distinguishing unknown objects and background object proposals. Previous OWOD methods have attempted to overcome this issue by generating supervision using pseudo-labeling - however, unknown object detection has remained low. Probabilistic/generative models may provide a solution for this challenge. Herein, we introduce a novel probabilistic framework for objectness estimation, where we alternate between probability distribution estimation and objectness likelihood maximization of known objects in the embedded feature space - ultimately allowing us to estimate the objectness probability of different proposals. The resulting Probabilistic Objectness transformer-based open-world detector, PROB, integrates our framework into traditional object detection models, adapting them for the open-world setting. Comprehensive experiments on OWOD benchmarks show that PROB outperforms all existing OWOD methods in both unknown object detection ($\sim 2\times$ unknown recall) and known object detection ($\sim 10\%$ mAP). Our code will be made available upon publication at https://github.com/orrzohar/PROB.
    Downscaling Extreme Rainfall Using Physical-Statistical Generative Adversarial Learning. (arXiv:2212.01446v1 [physics.ao-ph])
    Modeling the risk of extreme weather events in a changing climate is essential for developing effective adaptation and mitigation strategies. Although the available low-resolution climate models capture different scenarios, accurate risk assessment for mitigation and adaption often demands detail that they typically cannot resolve. Here, we develop a dynamic data-driven downscaling (super-resolution) method that incorporates physics and statistics in a generative framework to learn the fine-scale spatial details of rainfall. Our method transforms coarse-resolution ($0.25^{\circ} \times 0.25^{\circ}$) climate model outputs into high-resolution ($0.01^{\circ} \times 0.01^{\circ}$) rainfall fields while efficaciously quantifying uncertainty. Results indicate that the downscaled rainfall fields closely match observed spatial fields and their risk distributions.
    Reproducibility in Optimization: Theoretical Framework and Limits. (arXiv:2202.04598v4 [math.OC] UPDATED)
    We initiate a formal study of reproducibility in optimization. We define a quantitative measure of reproducibility of optimization procedures in the face of noisy or error-prone operations such as inexact or stochastic gradient computations or inexact initialization. We then analyze several convex optimization settings of interest such as smooth, non-smooth, and strongly-convex objective functions and establish tight bounds on the limits of reproducibility in each setting. Our analysis reveals a fundamental trade-off between computation and reproducibility: more computation is necessary (and sufficient) for better reproducibility.  ( 2 min )
    Understanding DDPM Latent Codes Through Optimal Transport. (arXiv:2202.07477v2 [stat.ML] UPDATED)
    Diffusion models have recently outperformed alternative approaches to model the distribution of natural images, such as GANs. Such diffusion models allow for deterministic sampling via the probability flow ODE, giving rise to a latent space and an encoder map. While having important practical applications, such as estimation of the likelihood, the theoretical properties of this map are not yet fully understood. In the present work, we partially address this question for the popular case of the VP SDE (DDPM) approach. We show that, perhaps surprisingly, the DDPM encoder map coincides with the optimal transport map for common distributions; we support this claim theoretically and by extensive numerical experiments.  ( 2 min )
    Predicting Drug Repurposing Candidates and Their Mechanisms from A Biomedical Knowledge Graph. (arXiv:2212.01384v1 [q-bio.QM])
    Computational drug repurposing is a cost- and time-efficient method to identify new indications of approved or experimental drugs/compounds. It is especially critical for emerging and/or orphan diseases due to its cheaper investment and shorter research cycle compared with traditional wet-lab drug discovery approaches. However, the underlying mechanisms of action between repurposed drugs and their target diseases remain largely unknown, which is still an unsolved issue in existing repurposing methods. As such, computational drug repurposing has not been widely adopted in clinical settings. In this work, based on a massive biomedical knowledge graph, we propose a computational drug repurposing framework that not only predicts the treatment probabilities between drugs and diseases but also predicts the path-based, testable mechanisms of action (MOAs) as their biomedical explanations. Specifically, we utilize the GraphSAGE model in an unsupervised manner to integrate each entity's neighborhood information and employ a Random Forest model to predict the treatment probabilities between pairs of drugs and diseases. Moreover, we train an adversarial actor-critic reinforcement learning model to predict the potential MOA for explaining drug purposing. To encourage the model to find biologically reasonable paths, we utilize the curated molecular interactions of drugs and a PubMed-publication-based concept distance to extract potential drug MOA paths from the knowledge graph as "demonstration paths" to guide the model during the process of path-finding. Comprehensive experiments and case studies show that the proposed framework outperforms state-of-the-art baselines in both predictive performance of drug repurposing and explanatory performance of recapitulating human-curated DrugMechDB-based paths.
    Policy Learning for Active Target Tracking over Continuous SE(3) Trajectories. (arXiv:2212.01498v1 [cs.RO])
    This paper proposes a novel model-based policy gradient algorithm for tracking dynamic targets using a mobile robot, equipped with an onboard sensor with limited field of view. The task is to obtain a continuous control policy for the mobile robot to collect sensor measurements that reduce uncertainty in the target states, measured by the target distribution entropy. We design a neural network control policy with the robot $SE(3)$ pose and the mean vector and information matrix of the joint target distribution as inputs and attention layers to handle variable numbers of targets. We also derive the gradient of the target entropy with respect to the network parameters explicitly, allowing efficient model-based policy gradient optimization.
    LGGNet: Learning from Local-Global-Graph Representations for Brain-Computer Interface. (arXiv:2105.02786v3 [cs.NE] UPDATED)
    Neuropsychological studies suggest that co-operative activities among different brain functional areas drive high-level cognitive processes. To learn the brain activities within and among different functional areas of the brain, we propose LGGNet, a novel neurologically inspired graph neural network, to learn local-global-graph representations of electroencephalography (EEG) for Brain-Computer Interface (BCI). The input layer of LGGNet comprises a series of temporal convolutions with multi-scale 1D convolutional kernels and kernel-level attentive fusion. It captures temporal dynamics of EEG which then serves as input to the proposed local and global graph-filtering layers. Using a defined neurophysiologically meaningful set of local and global graphs, LGGNet models the complex relations within and among functional areas of the brain. Under the robust nested cross-validation settings, the proposed method is evaluated on three publicly available datasets for four types of cognitive classification tasks, namely, the attention, fatigue, emotion, and preference classification tasks. LGGNet is compared with state-of-the-art methods, such as DeepConvNet, EEGNet, R2G-STNN, TSception, RGNN, AMCNN-DGCN, HRNN and GraphNet. The results show that LGGNet outperforms these methods, and the improvements are statistically significant (p<0.05) in most cases. The results show that bringing neuroscience prior knowledge into neural network design yields an improvement of classification performance. The source code can be found at https://github.com/yi-ding-cs/LGG  ( 2 min )
    RoSGAS: Adaptive Social Bot Detection with Reinforced Self-Supervised GNN Architecture Search. (arXiv:2206.06757v2 [cs.SI] UPDATED)
    Social bots are referred to as the automated accounts on social networks that make attempts to behave like human. While Graph Neural Networks (GNNs) has been massively applied to the field of social bot detection, a huge amount of domain expertise and prior knowledge is heavily engaged in the state-of-the art approaches to design a dedicated neural network architecture for a specific classification task. Involving oversized nodes and network layers in the model design, however, usually causes the over-smoothing problem and the lack of embedding discrimination. In this paper, we propose RoSGAS, a novel Reinforced and Self-supervised GNN Architecture Search framework to adaptively pinpoint the most suitable multi-hop neighborhood and the number of layers in the GNN architecture. More specifically, we consider the social bot detection problem as a user-centric subgraph embedding and classification task. We exploit heterogeneous information network to present the user connectivity by leveraging account metadata, relationships, behavioral features and content features. RoSGAS uses a multi-agent deep reinforcement learning (RL) mechanism for navigating the search of optimal neighborhood and network layers to learn individually the subgraph embedding for each target user. A nearest neighbor mechanism is developed for accelerating the RL training process, and RoSGAS can learn more discriminative subgraph embedding with the aid of self-supervised learning. Experiments on 5 Twitter datasets show that RoSGAS outperforms the state-of-the-art approaches in terms of accuracy, training efficiency and stability, and has better generalization when handling unseen samples.  ( 2 min )
    Singular Value Perturbation and Deep Network Optimization. (arXiv:2203.03099v4 [cs.LG] UPDATED)
    We develop new theoretical results on matrix perturbation to shed light on the impact of architecture on the performance of a deep network. In particular, we explain analytically what deep learning practitioners have long observed empirically: the parameters of some deep architectures (e.g., residual networks, ResNets, and Dense networks, DenseNets) are easier to optimize than others (e.g., convolutional networks, ConvNets). Building on our earlier work connecting deep networks with continuous piecewise-affine splines, we develop an exact local linear representation of a deep network layer for a family of modern deep networks that includes ConvNets at one end of a spectrum and ResNets, DenseNets, and other networks with skip connections at the other. For regression and classification tasks that optimize the squared-error loss, we show that the optimization loss surface of a modern deep network is piecewise quadratic in the parameters, with local shape governed by the singular values of a matrix that is a function of the local linear representation. We develop new perturbation results for how the singular values of matrices of this sort behave as we add a fraction of the identity and multiply by certain diagonal matrices. A direct application of our perturbation results explains analytically why a network with skip connections (such as a ResNet or DenseNet) is easier to optimize than a ConvNet: thanks to its more stable singular values and smaller condition number, the local loss surface of such a network is less erratic, less eccentric, and features local minima that are more accommodating to gradient-based optimization. Our results also shed new light on the impact of different nonlinear activation functions on a deep network's singular values, regardless of its architecture.  ( 3 min )
    Federated Deep Learning Meets Autonomous Vehicle Perception: Design and Verification. (arXiv:2206.01748v2 [cs.RO] UPDATED)
    Realizing human-like perception is a challenge in open driving scenarios due to corner cases and visual occlusions. To gather knowledge of rare and occluded instances, federated learning assisted connected autonomous vehicle (FLCAV) has been proposed, which leverages vehicular networks to establish federated deep neural networks (DNNs) from distributed data captured by vehicles and road sensors. Without the need of data aggregation, FLCAV preserves privacy while reducing communication costs compared with conventional centralized learning. However, it is challenging to determine the network resources and road sensor placements for multi-stage training with multi-modal datasets in multi-variant scenarios. This article presents networking and training frameworks for FLCAV perception. Multi-layer graph resource allocation and vehicle-road contrastive sensor placement are proposed to address the network management and sensor deployment problems, respectively. We also develop CarlaFLCAV, a software platform that implements the above system and methods. Experimental results confirm the superiority of the proposed techniques compared with various benchmarks.  ( 2 min )
    Deep Learning-Based Rate-Splitting Multiple Access for Reconfigurable Intelligent Surface-Aided Tera-Hertz Massive MIMO. (arXiv:2209.08456v3 [eess.SP] UPDATED)
    Reconfigurable intelligent surface (RIS) can significantly enhance the service coverage of Tera-Hertz massive multiple-input multiple-output (MIMO) communication systems. However, obtaining accurate high-dimensional channel state information (CSI) with limited pilot and feedback signaling overhead is challenging, severely degrading the performance of conventional spatial division multiple access. To improve the robustness against CSI imperfection, this paper proposes a deep learning (DL)-based rate-splitting multiple access (RSMA) scheme for RIS-aided Tera-Hertz multi-user MIMO systems. Specifically, we first propose a hybrid data-model driven DL-based RSMA precoding scheme, including the passive precoding at the RIS as well as the analog active precoding and the RSMA digital active precoding at the base station (BS). To realize the passive precoding at the RIS, we propose a Transformer-based data-driven RIS reflecting network (RRN). As for the analog active precoding at the BS, we propose a match-filter based analog precoding scheme considering that the BS and RIS adopt the LoS-MIMO antenna array architecture. As for the RSMA digital active precoding at the BS, we propose a low-complexity approximate weighted minimum mean square error (AWMMSE) digital precoding scheme. Furthermore, for better precoding performance as well as lower computational complexity, a model-driven deep unfolding active precoding network (DFAPN) is also designed by combining the proposed AWMMSE scheme with DL. Then, to acquire accurate CSI at the BS for the investigated RSMA precoding scheme to achieve higher spectral efficiency, we propose a CSI acquisition network (CAN) with low pilot and feedback signaling overhead, where the downlink pilot transmission, CSI feedback at the user equipments (UEs), and CSI reconstruction at the BS are modeled as an end-to-end neural network based on Transformer.  ( 3 min )
    A Trust Crisis In Simulation-Based Inference? Your Posterior Approximations Can Be Unfaithful. (arXiv:2110.06581v3 [stat.ML] UPDATED)
    We present extensive empirical evidence showing that current Bayesian simulation-based inference algorithms can produce computationally unfaithful posterior approximations. Our results show that all benchmarked algorithms -- (Sequential) Neural Posterior Estimation, (Sequential) Neural Ratio Estimation, Sequential Neural Likelihood and variants of Approximate Bayesian Computation -- can yield overconfident posterior approximations, which makes them unreliable for scientific use cases and falsificationist inquiry. Failing to address this issue may reduce the range of applicability of simulation-based inference. For this reason, we argue that research efforts should be made towards theoretical and methodological developments of conservative approximate inference algorithms and present research directions towards this objective. In this regard, we show empirical evidence that ensembling posterior surrogates provides more reliable approximations and mitigates the issue.  ( 2 min )
    d3rlpy: An Offline Deep Reinforcement Learning Library. (arXiv:2111.03788v2 [cs.LG] UPDATED)
    In this paper, we introduce d3rlpy, an open-sourced offline deep reinforcement learning (RL) library for Python. d3rlpy supports a set of offline deep RL algorithms as well as off-policy online algorithms via a fully documented plug-and-play API. To address a reproducibility issue, we conduct a large-scale benchmark with D4RL and Atari 2600 dataset to ensure implementation quality and provide experimental scripts and full tables of results. The d3rlpy source code can be found on GitHub: \url{https://github.com/takuseno/d3rlpy}.  ( 2 min )
    Intermediate Entity-based Sparse Interpretable Representation Learning. (arXiv:2212.01641v1 [cs.CL])
    Interpretable entity representations (IERs) are sparse embeddings that are "human-readable" in that dimensions correspond to fine-grained entity types and values are predicted probabilities that a given entity is of the corresponding type. These methods perform well in zero-shot and low supervision settings. Compared to standard dense neural embeddings, such interpretable representations may permit analysis and debugging. However, while fine-tuning sparse, interpretable representations improves accuracy on downstream tasks, it destroys the semantics of the dimensions which were enforced in pre-training. Can we maintain the interpretable semantics afforded by IERs while improving predictive performance on downstream tasks? Toward this end, we propose Intermediate enTity-based Sparse Interpretable Representation Learning (ItsIRL). ItsIRL realizes improved performance over prior IERs on biomedical tasks, while maintaining "interpretability" generally and their ability to support model debugging specifically. The latter is enabled in part by the ability to perform "counterfactual" fine-grained entity type manipulation, which we explore in this work. Finally, we propose a method to construct entity type based class prototypes for revealing global semantic properties of classes learned by our model.
    Learning-to-defer for sequential medical decision-making under uncertainty. (arXiv:2109.06312v2 [cs.LG] UPDATED)
    Learning-to-defer is a framework to automatically defer decision-making to a human expert when ML-based decisions are deemed unreliable. Existing learning-to-defer frameworks are not designed for sequential settings. That is, they defer at every instance independently, based on immediate predictions, while ignoring the potential long-term impact of these interventions. As a result, existing frameworks are myopic. Further, they do not defer adaptively, which is crucial when human interventions are costly. In this work, we propose Sequential Learning-to-Defer (SLTD), a framework for learning-to-defer to a domain expert in sequential decision-making settings. Contrary to existing literature, we pose the problem of learning-to-defer as model-based reinforcement learning (RL) to i) account for long-term consequences of ML-based actions using RL and ii) adaptively defer based on the dynamics (model-based). Our proposed framework determines whether to defer (at each time step) by quantifying whether a deferral now will improve the value compared to delaying deferral to the next time step. To quantify the improvement, we account for potential future deferrals. As a result, we learn a pre-emptive deferral policy (i.e. a policy that defers early if using the ML-based policy could worsen long-term outcomes). Our deferral policy is adaptive to the non-stationarity in the dynamics. We demonstrate that adaptive deferral via SLTD provides an improved trade-off between long-term outcomes and deferral frequency on synthetic, semi-synthetic, and real-world data with non-stationary dynamics. Finally, we interpret the deferral decision by decomposing the propagated (long-term) uncertainty around the outcome, to justify the deferral decision.  ( 2 min )
    Decentral and Incentivized Federated Learning Frameworks: A Systematic Literature Review. (arXiv:2205.07855v3 [cs.LG] UPDATED)
    The advent of Federated Learning (FL) has ignited a new paradigm for parallel and confidential decentralized Machine Learning (ML) with the potential of utilizing the computational power of a vast number of IoT, mobile and edge devices without data leaving the respective device, ensuring privacy by design. Yet, in order to scale this new paradigm beyond small groups of already entrusted entities towards mass adoption, the Federated Learning Framework (FLF) has to become (i) truly decentralized and (ii) participants have to be incentivized. This is the first systematic literature review analyzing holistic FLFs in the domain of both, decentralized and incentivized federated learning. 422 publications were retrieved, by querying 12 major scientific databases. Finally, 40 articles remained after a systematic review and filtering process for in-depth examination. Although having massive potential to direct the future of a more distributed and secure AI, none of the analyzed FLF is production-ready. The approaches vary heavily in terms of use-cases, system design, solved issues and thoroughness. We are the first to provide a systematic approach to classify and quantify differences between FLF, exposing limitations of current works and derive future directions for research in this novel domain.  ( 2 min )
    Global memory transformer for processing long documents. (arXiv:2212.01650v1 [cs.CL])
    Transformer variants dominate the state-of-the-art in different natural language processing tasks such as translation, reading comprehension and summarization. Our paper is more directed to use general memory slots added to the inputs and studying the results of adding these slots. This paper is a go on study of general memory slots rule that were added to the input of the proposed model in previous work. We have two main tasks;1) pretraining task using masked language modeling and b) fine tuning task using HotpotQA . This study aims to verify the ability of the proposed model to handle chunks as if they were one chunk comparing with the base model. As baseline we used T5 transformer. We studied the rule of memory slots augmented to each input chunk and studied the model performance without selector. We found that adding memory to input chunks helped the proposed model to overcome the baseline on Masked language modeling task with specific training parameters. Ablation study reveals the ability of using the compressed input chunks with a degradation in performance.
    Neural Improvement Heuristics for Graph Combinatorial Optimization Problems. (arXiv:2206.00383v2 [cs.AI] UPDATED)
    In recent years, methods based on deep neural networks, and especially Neural Improvement (NI) models, have led to a revolution in the field of combinatorial optimization. Given an instance of a graph-based problem and a candidate solution, they are able to propose a modification rule that improves its quality. However, existing NI approaches only consider node features and node-wise positional encodings to extract the instance and solution information, respectively. Thus, they are not suitable for problems where the essential information is encoded in the edges. In this paper, we present a NI model to solve graph-based problems where the information is stored either in the nodes, in the edges, or in both of them. We incorporate the NI model as a building block of hill-climbing-based algorithms to efficiently guide the election of neighborhood operations considering the solution at that iteration. Conducted experiments show that the model is able to recommend neighborhood operations that are in the $99^{th}$ percentile for the Preference Ranking Problem. Moreover, when incorporated to hill-climbing algorithms, such as Iterated or Multi-start Local Search, the NI model systematically outperforms the conventional versions. Finally, we demonstrate the flexibility of the model by extending the application to two well-known problems: the Traveling Salesman Problem and the Graph Partitioning Problem.
    Multi-trial Neural Architecture Search with Lottery Tickets. (arXiv:2203.04300v3 [cs.LG] UPDATED)
    Neural architecture search (NAS) has brought significant progress in recent image recognition tasks. Most existing NAS methods apply restricted search spaces, which limits the upper-bound performance of searched models. To address this issue, we propose a new search space named MobileNet3-MT. By reducing human-prior knowledge in omni dimensions of networks, MobileNet3-MT accommodates more potential candidates. For searching in this challenging search space, we present an efficient Multi-trial Evolution-based NAS method termed MENAS. Specifically, we accelerate the evolutionary search process by gradually pruning models in the population. Each model is trained with an early stop and replaced by its Lottery Tickets (the explored optimal pruned network).In this way, the full training pipeline of cumbersome networks is prevented and more efficient networks are automatically generated. Extensive experimental results on ImageNet-1K, CIFAR-10, and CIFAR-100 demonstrate that MENAS achieves state-of-the-art performance.  ( 2 min )
    FRAMED: An AutoML Approach for Structural Performance Prediction of Bicycle Frames. (arXiv:2201.10459v3 [cs.LG] UPDATED)
    This paper demonstrates how Automated Machine Learning (AutoML) methods can be used as effective surrogate models in engineering design problems. To do so, we consider the challenging problem of structurally-performant bicycle frame design and demonstrate across-the-board dominance by AutoML in regression and classification surrogate modeling tasks. We also introduce FRAMED -- a parametric dataset of 4500 bicycle frames based on bicycles designed by practitioners and enthusiasts worldwide. Accompanying these frame designs, we provide ten structural performance values such as weight, displacements under load, and safety factors computed using finite element simulations for all the bicycle frame designs. We formulate two challenging test problems: a performance-prediction regression problem and a feasibility-prediction classification problem. We then systematically search for optimal surrogate models using Bayesian hyperparameter tuning and neural architecture search. Finally, we show how a state-of-the-art AutoML method can be effective for both regression and classification problems. We demonstrate that the proposed AutoML models outperform the strongest gradient boosting and neural network surrogates identified through Bayesian optimization by an improved F1 score of 24\% for classification and reduced mean absolute error by 12.5\% for regression. Our work introduces a dataset for bicycle design practitioners, provides two benchmark problems for surrogate modeling researchers, and demonstrates the advantages of AutoML in machine learning tasks. The dataset and code are provided at \url{this http URL}.  ( 2 min )
    Rule Generation for Classification: Scalability, Interpretability, and Fairness. (arXiv:2104.10751v2 [cs.LG] UPDATED)
    We introduce a new rule-based optimization method for classification with constraints. The proposed method takes advantage of linear programming and column generation, and hence, is scalable to large datasets. Moreover, the method returns a set of rules along with their optimal weights indicating the importance of each rule for learning. Through assigning cost coefficients to the rules and introducing additional constraints, we show that one can also consider interpretability and fairness of the results. We test the performance of the proposed method on a collection of datasets and present two case studies to elaborate its different aspects. Our results show that a good compromise between interpretability and fairness on the one side, and accuracy on the other side, can be obtained by the proposed rule-based learning method.  ( 2 min )
    Nearly Optimal Policy Optimization with Stable at Any Time Guarantee. (arXiv:2112.10935v3 [cs.LG] UPDATED)
    Policy optimization methods are one of the most widely used classes of Reinforcement Learning (RL) algorithms. However, theoretical understanding of these methods remains insufficient. Even in the episodic (time-inhomogeneous) tabular setting, the state-of-the-art theoretical result of policy-based method in \citet{shani2020optimistic} is only $\tilde{O}(\sqrt{S^2AH^4K})$ where $S$ is the number of states, $A$ is the number of actions, $H$ is the horizon, and $K$ is the number of episodes, and there is a $\sqrt{SH}$ gap compared with the information theoretic lower bound $\tilde{\Omega}(\sqrt{SAH^3K})$. To bridge such a gap, we propose a novel algorithm Reference-based Policy Optimization with Stable at Any Time guarantee (\algnameacro), which features the property "Stable at Any Time". We prove that our algorithm achieves $\tilde{O}(\sqrt{SAH^3K} + \sqrt{AH^4K})$ regret. When $S > H$, our algorithm is minimax optimal when ignoring logarithmic factors. To our best knowledge, RPO-SAT is the first computationally efficient, nearly minimax optimal policy-based algorithm for tabular RL.
    N-Omniglot, a large-scale neuromorphic dataset for spatio-temporal sparse few-shot learning. (arXiv:2112.13230v3 [cs.NE] UPDATED)
    Few-shot learning (learning with a few samples) is one of the most important cognitive abilities of the human brain. However, the current artificial intelligence systems meet difficulties in achieving this ability. Similar challenges also exist for biologically plausible spiking neural networks (SNNs). Datasets for traditional few-shot learning domains provide few amounts of temporal information. and the absence of neuromorphic datasets has hindered the development of few-shot learning for SNNs. Here, to the best of our knowledge, we provide the first neuromorphic dataset for few-shot learning using SNNs: N-Omniglot, based on the Dynamic Vision Sensor. It contains 1,623 categories of handwritten characters, with only 20 samples per class. N-Omniglot eliminates the need for a neuromorphic dataset for SNNs with high spareness and tremendous temporal coherence. Additionally, the dataset provides a powerful challenge and a suitable benchmark for developing SNNs algorithms in the few-shot learning domain due to the chronological information of strokes. We also provide the improved nearest neighbor, convolutional network, SiameseNet, and meta-learning algorithm in the spiking version for verification.  ( 2 min )
    Understanding the Robustness of Multi-Exit Models under Common Corruptions. (arXiv:2212.01562v1 [cs.LG])
    Multi-Exit models (MEMs) use an early-exit strategy to improve the accuracy and efficiency of deep neural networks (DNNs) by allowing samples to exit the network before the last layer. However, the effectiveness of MEMs in the presence of distribution shifts remains largely unexplored. Our work examines how distribution shifts generated by common image corruptions affect the accuracy/efficiency of MEMs. We find that under common corruptions, early-exiting at the first correct exit reduces the inference cost and provides a significant boost in accuracy ( 10%) over exiting at the last layer. However, with realistic early-exit strategies, which do not assume knowledge about the correct exits, MEMs still reduce inference cost but provide a marginal improvement in accuracy (1%) compared to exiting at the last layer. Moreover, the presence of distribution shift widens the gap between an MEM's maximum classification accuracy and realistic early-exit strategies by 5% on average compared with the gap on in-distribution data. Our empirical analysis shows that the lack of calibration due to a distribution shift increases the susceptibility of such early-exit strategies to exit early and increases misclassification rates. Furthermore, the lack of calibration increases the inconsistency in the predictions of the model across exits, leading to both inefficient inference and more misclassifications compared with evaluation on in-distribution data. Finally, we propose two metrics, underthinking and overthinking, that quantify the different behavior of practical early-exit strategy under distribution shifts, and provide insights into improving the practical utility of MEMs.
    Multivariate Quantile Function Forecaster. (arXiv:2202.11316v2 [cs.LG] UPDATED)
    We propose Multivariate Quantile Function Forecaster (MQF$^2$), a global probabilistic forecasting method constructed using a multivariate quantile function and investigate its application to multi-horizon forecasting. Prior approaches are either autoregressive, implicitly capturing the dependency structure across time but exhibiting error accumulation with increasing forecast horizons, or multi-horizon sequence-to-sequence models, which do not exhibit error accumulation, but also do typically not model the dependency structure across time steps. MQF$^2$ combines the benefits of both approaches, by directly making predictions in the form of a multivariate quantile function, defined as the gradient of a convex function which we parametrize using input-convex neural networks. By design, the quantile function is monotone with respect to the input quantile levels and hence avoids quantile crossing. We provide two options to train MQF$^2$: with energy score or with maximum likelihood. Experimental results on real-world and synthetic datasets show that our model has comparable performance with state-of-the-art methods in terms of single time step metrics while capturing the time dependency structure.
    Learning to Reverse DNNs from AI Programs Automatically. (arXiv:2205.10364v2 [cs.LG] UPDATED)
    With the privatization deployment of DNNs on edge devices, the security of on-device DNNs has raised significant concern. To quantify the model leakage risk of on-device DNNs automatically, we propose NNReverse, the first learning-based method which can reverse DNNs from AI programs without domain knowledge. NNReverse trains a representation model to represent the semantics of binary code for DNN layers. By searching the most similar function in our database, NNReverse infers the layer type of a given function's binary code. To represent assembly instructions semantics precisely, NNReverse proposes a more fine-grained embedding model to represent the textual and structural-semantic of assembly functions.
    Recursive Importance Sketching for Rank Constrained Least Squares: Algorithms and High-order Convergence. (arXiv:2011.08360v4 [math.OC] UPDATED)
    In this paper, we propose {\it \underline{R}ecursive} {\it \underline{I}mportance} {\it \underline{S}ketching} algorithm for {\it \underline{R}ank} constrained least squares {\it \underline{O}ptimization} (RISRO). The key step of RISRO is recursive importance sketching, a new sketching framework based on deterministically designed recursive projections, which significantly differs from the randomized sketching in the literature \citep{mahoney2011randomized,woodruff2014sketching}. Several existing algorithms in the literature can be reinterpreted under this new sketching framework and RISRO offers clear advantages over them. RISRO is easy to implement and computationally efficient, where the core procedure in each iteration is to solve a dimension-reduced least squares problem. We establish the local quadratic-linear and quadratic rate of convergence for RISRO under some mild conditions. We also discover a deep connection of RISRO to the Riemannian Gauss-Newton algorithm on fixed rank matrices. The effectiveness of RISRO is demonstrated in two applications in machine learning and statistics: low-rank matrix trace regression and phase retrieval. Simulation studies demonstrate the superior numerical performance of RISRO.  ( 2 min )
    Personalized Execution Time Optimization for the Scheduled Jobs. (arXiv:2203.06158v2 [cs.LG] UPDATED)
    Scheduled batch jobs have been widely used on the asynchronous computing platforms to execute various enterprise applications, including the scheduled notifications and the candidate pre-computation for the modern recommender systems. It is important to deliver or update the information to the users at the right time to maintain the user experience and the execution impact. However, it is challenging to provide a versatile execution time optimization solution for the user-basis scheduled jobs to satisfy various product scenarios while maintaining reasonable infrastructure resource consumption. In this paper, we describe how we apply a learning-to-rank approach plus a "best time policy" in the best time selection. In addition, we propose an ensemble learner to minimize the ranking loss by efficiently leveraging multiple streams of user activity signals in our scheduling decisions of the execution time. Especially, we observe the cannibalization cross use cases to compete the user's peak time slot and introduce a coordination system to mitigate the problem. Our optimization approach has been successfully tested with production traffic that serves billions of users per day, with statistically significant improvements in various product metrics, including the notifications and content candidate generation. To the best of our knowledge, our study represents the first ML-based multi-tenant solution of the execution time optimization problem for the scheduled jobs at a large industrial scale cross different product domains.
    Joint graph learning from Gaussian observations in the presence of hidden nodes. (arXiv:2212.01816v1 [eess.SP])
    Graph learning problems are typically approached by focusing on learning the topology of a single graph when signals from all nodes are available. However, many contemporary setups involve multiple related networks and, moreover, it is often the case that only a subset of nodes is observed while the rest remain hidden. Motivated by this, we propose a joint graph learning method that takes into account the presence of hidden (latent) variables. Intuitively, the presence of the hidden nodes renders the inference task ill-posed and challenging to solve, so we overcome this detrimental influence by harnessing the similarity of the estimated graphs. To that end, we assume that the observed signals are drawn from a Gaussian Markov random field with latent variables and we carefully model the graph similarity among hidden (latent) nodes. Then, we exploit the structure resulting from the previous considerations to propose a convex optimization problem that solves the joint graph learning task by providing a regularized maximum likelihood estimator. Finally, we compare the proposed algorithm with different baselines and evaluate its performance over synthetic and real-world graphs.
    Euler Characteristic Curves and Profiles: a stable shape invariant for big data problems. (arXiv:2212.01666v1 [math.AT])
    Tools of Topological Data Analysis provide stable summaries encapsulating the shape of the considered data. Persistent homology, the most standard and well studied data summary, suffers a number of limitations; its computations are hard to distribute, it is hard to generalize to multifiltrations and is computationally prohibitive for big data-sets. In this paper we study the concept of Euler Characteristics Curves, for one parameter filtrations and Euler Characteristic Profiles, for multi-parameter filtrations. While being a weaker invariant in one dimension, we show that Euler Characteristic based approaches do not possess some handicaps of persistent homology; we show efficient algorithms to compute them in a distributed way, their generalization to multifiltrations and practical applicability for big data problems. In addition we show that the Euler Curves and Profiles enjoys certain type of stability which makes them robust tool in data analysis. Lastly, to show their practical applicability, multiple use-cases are considered.
    VNIbCReg: VICReg with Neighboring-Invariance and better-Covariance Evaluated on Non-stationary Seismic Signal Time Series. (arXiv:2204.02697v5 [cs.LG] UPDATED)
    One of the latest self-supervised learning (SSL) methods, VICReg, showed a great performance both in the linear evaluation and the fine-tuning evaluation. However, VICReg is proposed in computer vision and it learns by pulling representations of random crops of an image while maintaining the representation space by the variance and covariance loss. However, VICReg would be ineffective on non-stationary time series where different parts/crops of input should be differently encoded to consider the non-stationarity. Another recent SSL proposal, Temporal Neighborhood Coding (TNC) is effective for encoding non-stationary time series. This study shows that a combination of a VICReg-style method and TNC is very effective for SSL on non-stationary time series, where a non-stationary seismic signal time series is used as an evaluation dataset.  ( 2 min )
    Domain Constraints in Feature Space: Strengthening Robustness of Android Malware Detection against Realizable Adversarial Examples. (arXiv:2205.15128v2 [cs.LG] UPDATED)
    Strengthening the robustness of machine learning-based Android malware detectors in the real world requires incorporating realizable adversarial examples (RealAEs), i.e., AEs that satisfy the domain constraints of Android malware. However, existing work focuses on generating RealAEs in the problem space, which is known to be time-consuming and impractical for adversarial training. In this paper, we propose to generate RealAEs in the feature space, leading to a simpler and more efficient solution. Our approach is driven by a novel interpretation of Android malware properties in the feature space. More concretely, we extract feature-space domain constraints by learning meaningful feature dependencies from data and applying them by constructing a robust feature space. Our experiments on DREBIN, a well-known Android malware detector, demonstrate that our approach outperforms the state-of-the-art defense, Sec-SVM, against realistic gradient- and query-based attacks. Additionally, we demonstrate that generating feature-space RealAEs is faster than generating problem-space RealAEs, indicating its high applicability in adversarial training. We further validate the ability of our learned feature-space domain constraints in representing the Android malware properties by showing that (i) re-training detectors with our feature-space RealAEs largely improves model performance on similar problem-space RealAEs and (ii) using our feature-space domain constraints can help distinguish RealAEs from unrealizable AEs (unRealAEs).  ( 2 min )
    Approximating Full Conformal Prediction at Scale via Influence Functions. (arXiv:2202.01315v2 [cs.LG] UPDATED)
    Conformal prediction (CP) is a wrapper around traditional machine learning models, giving coverage guarantees under the sole assumption of exchangeability; in classification problems, for a chosen significance level $\varepsilon$, CP guarantees that the error rate is at most $\varepsilon$, irrespective of whether the underlying model is misspecified. However, the prohibitive computational costs of "full" CP led researchers to design scalable alternatives, which alas do not attain the same guarantees or statistical power of full CP. In this paper, we use influence functions to efficiently approximate full CP. We prove that our method is a consistent approximation of full CP, and empirically show that the approximation error becomes smaller as the training set increases; e.g., for $10^{3}$ training points the two methods output p-values that are $<10^{-3}$ apart: a negligible error for any practical application. Our methods enable scaling full CP to large real-world datasets. We compare our full CP approximation (ACP) to mainstream CP alternatives, and observe that our method is computationally competitive whilst enjoying the statistical predictive power of full CP.  ( 2 min )
    A Deep Reinforcement Learning Approach for Solving the Traveling Salesman Problem with Drone. (arXiv:2112.12545v3 [math.OC] UPDATED)
    Reinforcement learning has recently shown promise in learning quality solutions in many combinatorial optimization problems. In particular, the attention-based encoder-decoder models show high effectiveness on various routing problems, including the Traveling Salesman Problem (TSP). Unfortunately, they perform poorly for the TSP with Drone (TSP-D), requiring routing a heterogeneous fleet of vehicles in coordination -- a truck and a drone. In TSP-D, the two vehicles are moving in tandem and may need to wait at a node for the other vehicle to join. State-less attention-based decoder fails to make such coordination between vehicles. We propose a hybrid model that uses an attention encoder and a Long Short-Term Memory (LSTM) network decoder, in which the decoder's hidden state can represent the sequence of actions made. We empirically demonstrate that such a hybrid model improves upon a purely attention-based model for both solution quality and computational efficiency. Our experiments on the min-max Capacitated Vehicle Routing Problem (mmCVRP) also confirm that the hybrid model is more suitable for the coordinated routing of multiple vehicles than the attention-based model. The proposed model demonstrates comparable results as the operations research baseline methods.  ( 2 min )
    Approximate Message Passing for Multi-Layer Estimation in Rotationally Invariant Models. (arXiv:2212.01572v1 [stat.ML])
    We consider the problem of reconstructing the signal and the hidden variables from observations coming from a multi-layer network with rotationally invariant weight matrices. The multi-layer structure models inference from deep generative priors, and the rotational invariance imposed on the weights generalizes the i.i.d.\ Gaussian assumption by allowing for a complex correlation structure, which is typical in applications. In this work, we present a new class of approximate message passing (AMP) algorithms and give a state evolution recursion which precisely characterizes their performance in the large system limit. In contrast with the existing multi-layer VAMP (ML-VAMP) approach, our proposed AMP -- dubbed multi-layer rotationally invariant generalized AMP (ML-RI-GAMP) -- provides a natural generalization beyond Gaussian designs, in the sense that it recovers the existing Gaussian AMP as a special case. Furthermore, ML-RI-GAMP exhibits a significantly lower complexity than ML-VAMP, as the computationally intensive singular value decomposition is replaced by an estimation of the moments of the design matrices. Finally, our numerical results show that this complexity gain comes at little to no cost in the performance of the algorithm.  ( 2 min )
    Gradient-Variation Bound for Online Convex Optimization with Constraints. (arXiv:2006.12455v3 [math.OC] UPDATED)
    We study online convex optimization with constraints consisting of multiple functional constraints and a relatively simple constraint set, such as a Euclidean ball. As enforcing the constraints at each time step through projections is computationally challenging in general, we allow decisions to violate the functional constraints but aim to achieve a low regret and cumulative violation of the constraints over a horizon of $T$ time steps. First-order methods achieve an $\mathcal{O}(\sqrt{T})$ regret and an $\mathcal{O}(1)$ constraint violation, which is the best-known bound under the Slater's condition, but do not take into account the structural information of the problem. Furthermore, the existing algorithms and analysis are limited to Euclidean space. In this paper, we provide an \emph{instance-dependent} bound for online convex optimization with complex constraints obtained by a novel online primal-dual mirror-prox algorithm. Our instance-dependent regret is quantified by the total gradient variation $V_*(T)$ in the sequence of loss functions. The proposed algorithm works in \emph{general} normed spaces and simultaneously achieves an $\mathcal{O}(\sqrt{V_*(T)})$ regret and an $\mathcal{O}(1)$ constraint violation, which is never worse than the best-known $( \mathcal{O}(\sqrt{T}), \mathcal{O}(1) )$ result and improves over previous works that applied mirror-prox-type algorithms for this problem achieving $\mathcal{O}(T^{2/3})$ regret and constraint violation. Finally, our algorithm is computationally efficient, as it only performs mirror descent steps in each iteration instead of solving a general Lagrangian minimization problem.  ( 2 min )
    Adjoint-aided inference of Gaussian process driven differential equations. (arXiv:2202.04589v4 [stat.ML] UPDATED)
    Linear systems occur throughout engineering and the sciences, most notably as differential equations. In many cases the forcing function for the system is unknown, and interest lies in using noisy observations of the system to infer the forcing, as well as other unknown parameters. In differential equations, the forcing function is an unknown function of the independent variables (typically time and space), and can be modelled as a Gaussian process (GP). In this paper we show how the adjoint of a linear system can be used to efficiently infer forcing functions modelled as GPs, using a truncated basis expansion of the GP kernel. We show how exact conjugate Bayesian inference for the truncated GP can be achieved, in many cases with substantially lower computation than would be required using MCMC methods. We demonstrate the approach on systems of both ordinary and partial differential equations, and show that the basis expansion approach approximates well the true forcing with a modest number of basis vectors. Finally, we show how to infer point estimates for the non-linear model parameters, such as the kernel length-scales, using Bayesian optimisation.  ( 2 min )
    Quantifying the Effects of Data Augmentation. (arXiv:2202.09134v2 [cs.LG] UPDATED)
    We provide results that exactly quantify how data augmentation affects the convergence rate and variance of estimates. They lead to some unexpected findings: Contrary to common intuition, data augmentation may increase rather than decrease the uncertainty of estimates, such as the empirical prediction risk. Our main theoretical tool is a limit theorem for functions of randomly transformed, high-dimensional random vectors. The proof draws on work in probability on noise stability of functions of many variables. The pathological behavior we identify is not a consequence of complex models, but can occur even in the simplest settings -- one of our examples is a ridge regressor with two parameters. On the other hand, our results also show that data augmentation can have real, quantifiable benefits.  ( 2 min )
    Black-Box Testing of Deep Neural Networks through Test Case Diversity. (arXiv:2112.12591v4 [cs.SE] UPDATED)
    Deep Neural Networks (DNNs) have been extensively used in many areas including image processing, medical diagnostics, and autonomous driving. However, DNNs can exhibit erroneous behaviours that may lead to critical errors, especially when used in safety-critical systems. Inspired by testing techniques for traditional software systems, researchers have proposed neuron coverage criteria, as an analogy to source code coverage, to guide the testing of DNN models. Despite very active research on DNN coverage, several recent studies have questioned the usefulness of such criteria in guiding DNN testing. Further, from a practical standpoint, these criteria are white-box as they require access to the internals or training data of DNN models, which is in many contexts not feasible or convenient. In this paper, we investigate black-box input diversity metrics as an alternative to white-box coverage criteria. To this end, we first select and adapt three diversity metrics and study, in a controlled manner, their capacity to measure actual diversity in input sets. We then analyse their statistical association with fault detection using four datasets and five DNN models. We further compare diversity with state-of-the-art white-box coverage criteria. Our experiments show that relying on the diversity of image features embedded in test input sets is a more reliable indicator than coverage criteria to effectively guide the testing of DNNs. Indeed, we found that one of our selected black-box diversity metrics far outperforms existing coverage criteria in terms of fault-revealing capability and computational time. Results also confirm the suspicions that state-of-the-art coverage metrics are not adequate to guide the construction of test input sets to detect as many faults as possible with natural inputs.  ( 3 min )
    Sparta: Spatially Attentive and Adversarially Robust Activation. (arXiv:2105.08269v2 [cs.LG] UPDATED)
    Adversarial training (AT) is one of the most effective ways for improving the robustness of deep convolution neural networks (CNNs). Just like common network training, the effectiveness of AT relies on the design of basic network components. In this paper, we conduct an in-depth study on the role of the basic ReLU activation component in AT for robust CNNs. We find that the spatially-shared and input-independent properties of ReLU activation make CNNs less robust to white-box adversarial attacks with either standard or adversarial training. To address this problem, we extend ReLU to a novel Sparta activation function (Spatially attentive and Adversarially Robust Activation), which enables CNNs to achieve both higher robustness, i.e., lower error rate on adversarial examples, and higher accuracy, i.e., lower error rate on clean examples, than the existing state-of-the-art (SOTA) activation functions. We further study the relationship between Sparta and the SOTA activation functions, providing more insights about the advantages of our method. With comprehensive experiments, we also find that the proposed method exhibits superior cross-CNN and cross-dataset transferability. For the former, the adversarially trained Sparta function for one CNN (e.g., ResNet-18) can be fixed and directly used to train another adversarially robust CNN (e.g., ResNet-34). For the latter, the Sparta function trained on one dataset (e.g., CIFAR-10) can be employed to train adversarially robust CNNs on another dataset (e.g., SVHN). In both cases, Sparta leads to CNNs with higher robustness than the vanilla ReLU, verifying the flexibility and versatility of the proposed method.  ( 2 min )
    A Comprehensive Survey and Taxonomy on Single Image Dehazing Based on Deep Learning. (arXiv:2106.03323v4 [cs.CV] UPDATED)
    With the development of convolutional neural networks, hundreds of deep learning based dehazing methods have been proposed. In this paper, we provide a comprehensive survey on supervised, semi-supervised, and unsupervised single image dehazing. We first discuss the physical model, datasets, network modules, loss functions, and evaluation metrics that are commonly used. Then, the main contributions of various dehazing algorithms are categorized and summarized. Further, quantitative and qualitative experiments of various baseline methods are carried out. Finally, the unsolved issues and challenges that can inspire the future research are pointed out. A collection of useful dehazing materials is available at \url{https://github.com/Xiaofeng-life/AwesomeDehazing}.  ( 2 min )
    Understanding How Model Size Affects Few-shot Instruction Prompting. (arXiv:2212.01907v1 [cs.CL])
    Large Language Models are affected by the phenomena of memorizing and forgetting their training data. But how do these vary by model size? We work towards this question by investigating how the model size affects the model's ability to discriminate a word's meaning in a given context. We introduce a dataset called DeltaWords, which evaluates a model's ability to follow instructions to select a sentence which replaces the target word with its antonym. We show a weak inverse scaling trend, where task accuracy degrades as model size increase, under extremely few-shot prompting regimes. We show that increasing the number of examples tend to disproportionately benefit larger models than smaller models.  ( 2 min )
    How Far Should We Look Back to Achieve Effective Real-Time Time-Series Anomaly Detection?. (arXiv:2102.06560v4 [cs.LG] UPDATED)
    Anomaly detection is the process of identifying unexpected events or ab-normalities in data, and it has been applied in many different areas such as system monitoring, fraud detection, healthcare, intrusion detection, etc. Providing real-time, lightweight, and proactive anomaly detection for time series with neither human intervention nor domain knowledge could be highly valuable since it reduces human effort and enables appropriate countermeasures to be undertaken before a disastrous event occurs. To our knowledge, RePAD (Real-time Proactive Anomaly Detection algorithm) is a generic approach with all above-mentioned features. To achieve real-time and lightweight detection, RePAD utilizes Long Short-Term Memory (LSTM) to detect whether or not each upcoming data point is anomalous based on short-term historical data points. However, it is unclear that how different amounts of historical data points affect the performance of RePAD. Therefore, in this paper, we investigate the impact of different amounts of historical data on RePAD by introducing a set of performance metrics that cover novel detection accuracy measures, time efficiency, readiness, and resource consumption, etc. Empirical experiments based on real-world time series datasets are conducted to evaluate RePAD in different scenarios, and the experimental results are presented and discussed.  ( 2 min )
    Layer-wise Analysis of a Self-supervised Speech Representation Model. (arXiv:2107.04734v3 [cs.CL] UPDATED)
    Recently proposed self-supervised learning approaches have been successful for pre-training speech representation models. The utility of these learned representations has been observed empirically, but not much has been studied about the type or extent of information encoded in the pre-trained representations themselves. Developing such insights can help understand the capabilities and limits of these models and enable the research community to more efficiently develop their usage for downstream applications. In this work, we begin to fill this gap by examining one recent and successful pre-trained model (wav2vec 2.0), via its intermediate representation vectors, using a suite of analysis tools. We use the metrics of canonical correlation, mutual information, and performance on simple downstream tasks with non-parametric probes, in order to (i) query for acoustic and linguistic information content, (ii) characterize the evolution of information across model layers, and (iii) understand how fine-tuning the model for automatic speech recognition (ASR) affects these observations. Our findings motivate modifying the fine-tuning protocol for ASR, which produces improved word error rates in a low-resource setting.  ( 2 min )
    Zero-shot and Few-shot Learning with Knowledge Graphs: A Comprehensive Survey. (arXiv:2112.10006v6 [cs.LG] UPDATED)
    Machine learning especially deep neural networks have achieved great success but many of them often rely on a number of labeled samples for supervision. As sufficient labeled training data are not always ready due to e.g., continuously emerging prediction targets and costly sample annotation in real world applications, machine learning with sample shortage is now being widely investigated. Among all these studies, many prefer to utilize auxiliary information including those in the form of Knowledge Graph (KG) to reduce the reliance on labeled samples. In this survey, we have comprehensively reviewed over 90 papers about KG-aware research for two major sample shortage settings -- zero-shot learning (ZSL) where some classes to be predicted have no labeled samples, and few-shot learning (FSL) where some classes to be predicted have only a small number of labeled samples that are available. We first introduce KGs used in ZSL and FSL as well as their construction methods, and then systematically categorize and summarize KG-aware ZSL and FSL methods, dividing them into different paradigms such as the mapping-based, the data augmentation, the propagation-based and the optimization-based. We next present different applications, including not only KG augmented prediction tasks such as image classification, question answering, text classification and knowledge extraction, but also KG completion tasks, and some typical evaluation resources for each task. We eventually discuss some challenges and open problems from different perspectives.  ( 3 min )
    Nonconvex Factorization and Manifold Formulations are Almost Equivalent in Low-rank Matrix Optimization. (arXiv:2108.01772v2 [math.OC] UPDATED)
    In this paper, we consider the geometric landscape connection of the widely studied manifold and factorization formulations in low-rank positive semidefinite (PSD) and general matrix optimization. We establish a sandwich relation on the spectrum of Riemannian and Euclidean Hessians at first-order stationary points (FOSPs). As a result of that, we obtain an equivalence on the set of FOSPs, second-order stationary points (SOSPs) and strict saddles between the manifold and the factorization formulations. In addition, we show the sandwich relation can be used to transfer more quantitative geometric properties from one formulation to another. Similarities and differences in the landscape connection under the PSD case and the general case are discussed. To the best of our knowledge, this is the first geometric landscape connection between the manifold and the factorization formulations for handling rank constraints, and it provides a geometric explanation for the similar empirical performance of factorization and manifold approaches in low-rank matrix optimization observed in the literature. In the general low-rank matrix optimization, the landscape connection of two factorization formulations (unregularized and regularized ones) is also provided. By applying these geometric landscape connections, in particular, the sandwich relation, we are able to solve unanswered questions in literature and establish stronger results in the applications on geometric analysis of phase retrieval, well-conditioned low-rank matrix optimization, and the role of regularization in factorization arising from machine learning and signal processing.  ( 2 min )
    Eliciting Structural and Semantic Global Knowledge in Unsupervised Graph Contrastive Learning. (arXiv:2202.08480v3 [cs.LG] UPDATED)
    Graph Contrastive Learning (GCL) has recently drawn much research interest for learning generalizable node representations in a self-supervised manner. In general, the contrastive learning process in GCL is performed on top of the representations learned by a graph neural network (GNN) backbone, which transforms and propagates the node contextual information based on its local neighborhoods. However, nodes sharing similar characteristics may not always be geographically close, which poses a great challenge for unsupervised GCL efforts due to their inherent limitations in capturing such global graph knowledge. In this work, we address their inherent limitations by proposing a simple yet effective framework -- Simple Neural Networks with Structural and Semantic Contrastive Learning} (S^3-CL). Notably, by virtue of the proposed structural and semantic contrastive learning algorithms, even a simple neural network can learn expressive node representations that preserve valuable global structural and semantic patterns. Our experiments demonstrate that the node representations learned by S^3-CL achieve superior performance on different downstream tasks compared with the state-of-the-art unsupervised GCL methods. Implementation and more experimental details are publicly available at \url{https://github.com/kaize0409/S-3-CL.}  ( 2 min )
    Distributed Policy Gradient with Variance Reduction in Multi-Agent Reinforcement Learning. (arXiv:2111.12961v3 [cs.MA] UPDATED)
    This paper studies a distributed policy gradient in collaborative multi-agent reinforcement learning (MARL), where agents over a communication network aim to find the optimal policy to maximize the average of all agents' local returns. Due to the non-concave performance function of policy gradient, the existing distributed stochastic optimization methods for convex problems cannot be directly used for policy gradient in MARL. This paper proposes a distributed policy gradient with variance reduction and gradient tracking to address the high variances of policy gradient, and utilizes importance weight to solve the {distribution shift} problem in the sampling process. We then provide an upper bound on the mean-squared stationary gap, which depends on the number of iterations, the mini-batch size, the epoch size, the problem parameters, and the network topology. We further establish the sample and communication complexity to obtain an $\epsilon$-approximate stationary point. Numerical experiments are performed to validate the effectiveness of the proposed algorithm.  ( 2 min )
    Improving Calibration and Out-of-Distribution Detection in Medical Image Segmentation with Convolutional Neural Networks. (arXiv:2004.06569v3 [cs.CV] UPDATED)
    Convolutional Neural Networks (CNNs) have shown to be powerful medical image segmentation models. In this study, we address some of the main unresolved issues regarding these models. Specifically, training of these models on small medical image datasets is still challenging, with many studies promoting techniques such as transfer learning. Moreover, these models are infamous for producing over-confident predictions and for failing silently when presented with out-of-distribution (OOD) data at test time. In this paper, we advocate for multi-task learning, i.e., training a single model on several different datasets, spanning several different organs of interest and different imaging modalities. We show that not only a single CNN learns to automatically recognize the context and accurately segment the organ of interest in each context, but also that such a joint model often has more accurate and better-calibrated predictions than dedicated models trained separately on each dataset. Our experiments show that multi-task learning can outperform transfer learning in medical image segmentation tasks. For detecting OOD data, we propose a method based on spectral analysis of CNN feature maps. We show that different datasets, representing different imaging modalities and/or different organs of interest, have distinct spectral signatures, which can be used to identify whether or not a test image is similar to the images used to train a model. We show that this approach is far more accurate than OOD detection based on prediction uncertainty. The methods proposed in this paper contribute significantly to improving the accuracy and reliability of CNN-based medical image segmentation models.  ( 2 min )
    Compositional Learning-based Planning for Vision POMDPs. (arXiv:2112.09456v2 [cs.AI] UPDATED)
    The Partially Observable Markov Decision Process (POMDP) is a powerful framework for capturing decision-making problems that involve state and transition uncertainty. However, most current POMDP planners cannot effectively handle high-dimensional image observations prevalent in real world applications, and often require lengthy online training that requires interaction with the environment. In this work, we propose Visual Tree Search (VTS), a compositional learning and planning procedure that combines generative models learned offline with online model-based POMDP planning. The deep generative observation models evaluate the likelihood of and predict future image observations in a Monte Carlo tree search planner. We show that VTS is robust to different types of image noises that were not present during training and can adapt to different reward structures without the need to re-train. This new approach significantly and stably outperforms several baseline state-of-the-art vision POMDP algorithms while using a fraction of the training time.  ( 2 min )
    Representation Ensembling for Synergistic Lifelong Learning with Quasilinear Complexity. (arXiv:2004.12908v15 [cs.AI] UPDATED)
    In lifelong learning, data are used to improve performance not only on the current task, but also on previously encountered, and as yet unencountered tasks. In contrast, classical machine learning, which we define as, starts from a blank slate, or tabula rasa and uses data only for the single task at hand. While typical transfer learning algorithms can improve performance on future tasks, their performance on prior tasks degrades upon learning new tasks (called forgetting). Many recent approaches for continual or lifelong learning have attempted to maintain performance on old tasks given new tasks. But striving to avoid forgetting sets the goal unnecessarily low. The goal of lifelong learning should be not only to improve performance on future tasks (forward transfer) but also on past tasks (backward transfer) with any new data. Our key insight is that we can synergistically ensemble representations -- that were learned independently on disparate tasks -- to enable both forward and backward transfer. This generalizes ensembling decisions (like in decision forests) and complements ensembling dependently learned representations (like in multitask learning). Moreover, we can ensemble representations in quasilinear space and time. We demonstrate this insight with two algorithms: representation ensembles of (1) trees and (2) networks. Both algorithms demonstrate forward and backward transfer in a variety of simulated and benchmark data scenarios, including tabular, image, and spoken, and adversarial tasks. This is in stark contrast to the reference algorithms we compared to, most of which failed to transfer either forward or backward, or both, despite that many of them require quadratic space or time complexity.  ( 3 min )
    A Unified Survey on Anomaly, Novelty, Open-Set, and Out-of-Distribution Detection: Solutions and Future Challenges. (arXiv:2110.14051v5 [cs.CV] UPDATED)
    Machine learning models often encounter samples that are diverged from the training distribution. Failure to recognize an out-of-distribution (OOD) sample, and consequently assign that sample to an in-class label significantly compromises the reliability of a model. The problem has gained significant attention due to its importance for safety deploying models in open-world settings. Detecting OOD samples is challenging due to the intractability of modeling all possible unknown distributions. To date, several research domains tackle the problem of detecting unfamiliar samples, including anomaly detection, novelty detection, one-class learning, open set recognition, and out-of-distribution detection. Despite having similar and shared concepts, out-of-distribution, open-set, and anomaly detection have been investigated independently. Accordingly, these research avenues have not cross-pollinated, creating research barriers. While some surveys intend to provide an overview of these approaches, they seem to only focus on a specific domain without examining the relationship between different domains. This survey aims to provide a cross-domain and comprehensive review of numerous eminent works in respective areas while identifying their commonalities. Researchers can benefit from the overview of research advances in different fields and develop future methodology synergistically. Furthermore, to the best of our knowledge, while there are surveys in anomaly detection or one-class learning, there is no comprehensive or up-to-date survey on out-of-distribution detection, which our survey covers extensively. Finally, having a unified cross-domain perspective, we discuss and shed light on future lines of research, intending to bring these fields closer together.  ( 3 min )
    Automata Learning meets Shielding. (arXiv:2212.01838v1 [cs.LG])
    Safety is still one of the major research challenges in reinforcement learning (RL). In this paper, we address the problem of how to avoid safety violations of RL agents during exploration in probabilistic and partially unknown environments. Our approach combines automata learning for Markov Decision Processes (MDPs) and shield synthesis in an iterative approach. Initially, the MDP representing the environment is unknown. The agent starts exploring the environment and collects traces. From the collected traces, we passively learn MDPs that abstractly represent the safety-relevant aspects of the environment. Given a learned MDP and a safety specification, we construct a shield. For each state-action pair within a learned MDP, the shield computes exact probabilities on how likely it is that executing the action results in violating the specification from the current state within the next $k$ steps. After the shield is constructed, the shield is used during runtime and blocks any actions that induce a too large risk from the agent. The shielded agent continues to explore the environment and collects new data on the environment. Iteratively, we use the collected data to learn new MDPs with higher accuracy, resulting in turn in shields able to prevent more safety violations. We implemented our approach and present a detailed case study of a Q-learning agent exploring slippery Gridworlds. In our experiments, we show that as the agent explores more and more of the environment during training, the improved learned models lead to shields that are able to prevent many safety violations.  ( 2 min )
    Deviance Matrix Factorization. (arXiv:2110.05674v2 [stat.ML] UPDATED)
    We investigate a general matrix factorization for deviance-based data losses, extending the ubiquitous singular value decomposition beyond squared error loss. While similar approaches have been explored before, our method leverages classical statistical methodology from generalized linear models (GLMs) and provides an efficient algorithm that is flexible enough to allow for structural zeros and entry weights. Moreover, by adapting results from GLM theory, we provide support for these decompositions by (i) showing strong consistency under the GLM setup, (ii) checking the adequacy of a chosen exponential family via a generalized Hosmer-Lemeshow test, and (iii) determining the rank of the decomposition via a maximum eigenvalue gap method. To further support our findings, we conduct simulation studies to assess robustness to decomposition assumptions and extensive case studies using benchmark datasets from image face recognition, natural language processing, network analysis, and biomedical studies. Our theoretical and empirical results indicate that the proposed decomposition is more flexible, general, and robust, and can thus provide improved performance when compared to similar methods. To facilitate applications, an R package with efficient model fitting and family and rank determination is also provided.  ( 2 min )
    Learning logic programs by discovering where not to search. (arXiv:2202.09806v2 [cs.LG] UPDATED)
    The goal of inductive logic programming (ILP) is to search for a hypothesis that generalises training examples and background knowledge (BK). To improve performance, we introduce an approach that, before searching for a hypothesis, first discovers where not to search. We use given BK to discover constraints on hypotheses, such as that a number cannot be both even and odd. We use the constraints to bootstrap a constraint-driven ILP system. Our experiments on multiple domains (including program synthesis and game playing) show that our approach can (i) substantially reduce learning times by up to 97%, and (ii) scale to domains with millions of facts.  ( 2 min )
    Simple and Efficient Heterogeneous Graph Neural Network. (arXiv:2207.02547v2 [cs.LG] UPDATED)
    Heterogeneous graph neural networks (HGNNs) have powerful capability to embed rich structural and semantic information of a heterogeneous graph into node representations. Existing HGNNs inherit many mechanisms from graph neural networks (GNNs) over homogeneous graphs, especially the attention mechanism and the multi-layer structure. These mechanisms bring excessive complexity, but seldom work studies whether they are really effective on heterogeneous graphs. This paper conducts an in-depth and detailed study of these mechanisms and proposes Simple and Efficient Heterogeneous Graph Neural Network (SeHGNN). To easily capture structural information, SeHGNN pre-computes the neighbor aggregation using a light-weight mean aggregator, which reduces complexity by removing overused neighbor attention and avoiding repeated neighbor aggregation in every training epoch. To better utilize semantic information, SeHGNN adopts the single-layer structure with long metapaths to extend the receptive field, as well as a transformer-based semantic fusion module to fuse features from different metapaths. As a result, SeHGNN exhibits the characteristics of simple network structure, high prediction accuracy, and fast training speed. Extensive experiments on five real-world heterogeneous graphs demonstrate the superiority of SeHGNN over the state-of-the-arts on both accuracy and training speed.  ( 2 min )
    Bayesian Active Meta-Learning for Few Pilot Demodulation and Equalization. (arXiv:2108.00785v3 [cs.LG] UPDATED)
    Two of the main principles underlying the life cycle of an artificial intelligence (AI) module in communication networks are adaptation and monitoring. Adaptation refers to the need to adjust the operation of an AI module depending on the current conditions; while monitoring requires measures of the reliability of an AI module's decisions. Classical frequentist learning methods for the design of AI modules fall short on both counts of adaptation and monitoring, catering to one-off training and providing overconfident decisions. This paper proposes a solution to address both challenges by integrating meta-learning with Bayesian learning. As a specific use case, the problems of demodulation and equalization over a fading channel based on the availability of few pilots are studied. Meta-learning processes pilot information from multiple frames in order to extract useful shared properties of effective demodulators across frames. The resulting trained demodulators are demonstrated, via experiments, to offer better calibrated soft decisions, at the computational cost of running an ensemble of networks at run time. The capacity to quantify uncertainty in the model parameter space is further leveraged by extending Bayesian meta-learning to an active setting. In it, the designer can select in a sequential fashion channel conditions under which to generate data for meta-learning from a channel simulator. Bayesian active meta-learning is seen in experiments to significantly reduce the number of frames required to obtain efficient adaptation procedure for new frames.  ( 2 min )
    Counterfactual Fairness Is Basically Demographic Parity. (arXiv:2208.03843v2 [cs.LG] UPDATED)
    Making fair decisions is crucial to ethically implementing machine learning algorithms in social settings. In this work, we consider the celebrated definition of counterfactual fairness [Kusner et al., NeurIPS, 2017]. We begin by showing that an algorithm which satisfies counterfactual fairness also satisfies demographic parity, a far simpler fairness constraint. Similarly, we show that all algorithms satisfying demographic parity can be trivially modified to satisfy counterfactual fairness. Together, our results indicate that counterfactual fairness is basically equivalent to demographic parity, which has important implications for the growing body of work on counterfactual fairness. We then validate our theoretical findings empirically, analyzing three existing algorithms for counterfactual fairness against three simple benchmarks. We find that two simple benchmark algorithms outperform all three existing algorithms -- in terms of fairness, accuracy, and efficiency -- on several data sets. Our analysis leads us to formalize a concrete fairness goal: to preserve the order of individuals within protected groups. We believe transparency around the ordering of individuals within protected groups makes fair algorithms more trustworthy. By design, the two simple benchmark algorithms satisfy this goal while the existing algorithms for counterfactual fairness do not.  ( 2 min )
  • Open

    Quantifying the Effects of Data Augmentation. (arXiv:2202.09134v2 [cs.LG] UPDATED)
    We provide results that exactly quantify how data augmentation affects the convergence rate and variance of estimates. They lead to some unexpected findings: Contrary to common intuition, data augmentation may increase rather than decrease the uncertainty of estimates, such as the empirical prediction risk. Our main theoretical tool is a limit theorem for functions of randomly transformed, high-dimensional random vectors. The proof draws on work in probability on noise stability of functions of many variables. The pathological behavior we identify is not a consequence of complex models, but can occur even in the simplest settings -- one of our examples is a ridge regressor with two parameters. On the other hand, our results also show that data augmentation can have real, quantifiable benefits.
    Concentration inequalities and optimal number of layers for stochastic deep neural networks. (arXiv:2206.11241v3 [cs.LG] UPDATED)
    We state concentration inequalities for the output of the hidden layers of a stochastic deep neural network (SDNN), as well as for the output of the whole SDNN. These results allow us to introduce an expected classifier (EC), and to give probabilistic upper bound for the classification error of the EC. We also state the optimal number of layers for the SDNN via an optimal stopping procedure. We apply our analysis to a stochastic version of a feedforward neural network with ReLU activation function.
    Understanding DDPM Latent Codes Through Optimal Transport. (arXiv:2202.07477v2 [stat.ML] UPDATED)
    Diffusion models have recently outperformed alternative approaches to model the distribution of natural images, such as GANs. Such diffusion models allow for deterministic sampling via the probability flow ODE, giving rise to a latent space and an encoder map. While having important practical applications, such as estimation of the likelihood, the theoretical properties of this map are not yet fully understood. In the present work, we partially address this question for the popular case of the VP SDE (DDPM) approach. We show that, perhaps surprisingly, the DDPM encoder map coincides with the optimal transport map for common distributions; we support this claim theoretically and by extensive numerical experiments.
    Simultaneous approximation of a smooth function and its derivatives by deep neural networks with piecewise-polynomial activations. (arXiv:2206.09527v2 [math.NA] CROSS LISTED)
    This paper investigates the approximation properties of deep neural networks with piecewise-polynomial activation functions. We derive the required depth, width, and sparsity of a deep neural network to approximate any H\"{o}lder smooth function up to a given approximation error in H\"{o}lder norms in such a way that all weights of this neural network are bounded by $1$. The latter feature is essential to control generalization errors in many statistical and machine learning applications.
    VNIbCReg: VICReg with Neighboring-Invariance and better-Covariance Evaluated on Non-stationary Seismic Signal Time Series. (arXiv:2204.02697v5 [cs.LG] UPDATED)
    One of the latest self-supervised learning (SSL) methods, VICReg, showed a great performance both in the linear evaluation and the fine-tuning evaluation. However, VICReg is proposed in computer vision and it learns by pulling representations of random crops of an image while maintaining the representation space by the variance and covariance loss. However, VICReg would be ineffective on non-stationary time series where different parts/crops of input should be differently encoded to consider the non-stationarity. Another recent SSL proposal, Temporal Neighborhood Coding (TNC) is effective for encoding non-stationary time series. This study shows that a combination of a VICReg-style method and TNC is very effective for SSL on non-stationary time series, where a non-stationary seismic signal time series is used as an evaluation dataset.
    Reconstructing Training Data from Trained Neural Networks. (arXiv:2206.07758v3 [cs.LG] UPDATED)
    Understanding to what extent neural networks memorize training data is an intriguing question with practical and theoretical implications. In this paper we show that in some cases a significant fraction of the training data can in fact be reconstructed from the parameters of a trained neural network classifier. We propose a novel reconstruction scheme that stems from recent theoretical results about the implicit bias in training neural networks with gradient-based methods. To the best of our knowledge, our results are the first to show that reconstructing a large portion of the actual training samples from a trained neural network classifier is generally possible. This has negative implications on privacy, as it can be used as an attack for revealing sensitive training data. We demonstrate our method for binary MLP classifiers on a few standard computer vision datasets.
    A Generalist Neural Algorithmic Learner. (arXiv:2209.11142v2 [cs.LG] UPDATED)
    The cornerstone of neural algorithmic reasoning is the ability to solve algorithmic tasks, especially in a way that generalises out of distribution. While recent years have seen a surge in methodological improvements in this area, they mostly focused on building specialist models. Specialist models are capable of learning to neurally execute either only one algorithm or a collection of algorithms with identical control-flow backbone. Here, instead, we focus on constructing a generalist neural algorithmic learner -- a single graph neural network processor capable of learning to execute a wide range of algorithms, such as sorting, searching, dynamic programming, path-finding and geometry. We leverage the CLRS benchmark to empirically show that, much like recent successes in the domain of perception, generalist algorithmic learners can be built by "incorporating" knowledge. That is, it is possible to effectively learn algorithms in a multi-task manner, so long as we can learn to execute them well in a single-task regime. Motivated by this, we present a series of improvements to the input representation, training regime and processor architecture over CLRS, improving average single-task performance by over 20% from prior art. We then conduct a thorough ablation of multi-task learners leveraging these improvements. Our results demonstrate a generalist learner that effectively incorporates knowledge captured by specialist models.
    Diffusion Models for Graphs Benefit From Discrete State Spaces. (arXiv:2210.01549v2 [cs.LG] UPDATED)
    Denoising diffusion probabilistic models and score matching models have proven to be very powerful for generative tasks. While these approaches have also been applied to the generation of discrete graphs, they have, so far, relied on continuous Gaussian perturbations. Instead, in this work, we suggest using discrete noise for the forward Markov process. This ensures that in every intermediate step the graph remains discrete. Compared to the previous approach, our experimental results on four datasets and multiple architectures show that using a discrete noising process results in higher quality generated samples indicated with an average MMDs reduced by a factor of 1.5. Furthermore, the number of denoising steps is reduced from 1000 to 32 steps leading to a 30 times faster sampling procedure.
    Deviance Matrix Factorization. (arXiv:2110.05674v2 [stat.ML] UPDATED)
    We investigate a general matrix factorization for deviance-based data losses, extending the ubiquitous singular value decomposition beyond squared error loss. While similar approaches have been explored before, our method leverages classical statistical methodology from generalized linear models (GLMs) and provides an efficient algorithm that is flexible enough to allow for structural zeros and entry weights. Moreover, by adapting results from GLM theory, we provide support for these decompositions by (i) showing strong consistency under the GLM setup, (ii) checking the adequacy of a chosen exponential family via a generalized Hosmer-Lemeshow test, and (iii) determining the rank of the decomposition via a maximum eigenvalue gap method. To further support our findings, we conduct simulation studies to assess robustness to decomposition assumptions and extensive case studies using benchmark datasets from image face recognition, natural language processing, network analysis, and biomedical studies. Our theoretical and empirical results indicate that the proposed decomposition is more flexible, general, and robust, and can thus provide improved performance when compared to similar methods. To facilitate applications, an R package with efficient model fitting and family and rank determination is also provided.
    Adjoint-aided inference of Gaussian process driven differential equations. (arXiv:2202.04589v4 [stat.ML] UPDATED)
    Linear systems occur throughout engineering and the sciences, most notably as differential equations. In many cases the forcing function for the system is unknown, and interest lies in using noisy observations of the system to infer the forcing, as well as other unknown parameters. In differential equations, the forcing function is an unknown function of the independent variables (typically time and space), and can be modelled as a Gaussian process (GP). In this paper we show how the adjoint of a linear system can be used to efficiently infer forcing functions modelled as GPs, using a truncated basis expansion of the GP kernel. We show how exact conjugate Bayesian inference for the truncated GP can be achieved, in many cases with substantially lower computation than would be required using MCMC methods. We demonstrate the approach on systems of both ordinary and partial differential equations, and show that the basis expansion approach approximates well the true forcing with a modest number of basis vectors. Finally, we show how to infer point estimates for the non-linear model parameters, such as the kernel length-scales, using Bayesian optimisation.
    Multivariate Quantile Function Forecaster. (arXiv:2202.11316v2 [cs.LG] UPDATED)
    We propose Multivariate Quantile Function Forecaster (MQF$^2$), a global probabilistic forecasting method constructed using a multivariate quantile function and investigate its application to multi-horizon forecasting. Prior approaches are either autoregressive, implicitly capturing the dependency structure across time but exhibiting error accumulation with increasing forecast horizons, or multi-horizon sequence-to-sequence models, which do not exhibit error accumulation, but also do typically not model the dependency structure across time steps. MQF$^2$ combines the benefits of both approaches, by directly making predictions in the form of a multivariate quantile function, defined as the gradient of a convex function which we parametrize using input-convex neural networks. By design, the quantile function is monotone with respect to the input quantile levels and hence avoids quantile crossing. We provide two options to train MQF$^2$: with energy score or with maximum likelihood. Experimental results on real-world and synthetic datasets show that our model has comparable performance with state-of-the-art methods in terms of single time step metrics while capturing the time dependency structure.
    Nearly Optimal Policy Optimization with Stable at Any Time Guarantee. (arXiv:2112.10935v3 [cs.LG] UPDATED)
    Policy optimization methods are one of the most widely used classes of Reinforcement Learning (RL) algorithms. However, theoretical understanding of these methods remains insufficient. Even in the episodic (time-inhomogeneous) tabular setting, the state-of-the-art theoretical result of policy-based method in \citet{shani2020optimistic} is only $\tilde{O}(\sqrt{S^2AH^4K})$ where $S$ is the number of states, $A$ is the number of actions, $H$ is the horizon, and $K$ is the number of episodes, and there is a $\sqrt{SH}$ gap compared with the information theoretic lower bound $\tilde{\Omega}(\sqrt{SAH^3K})$. To bridge such a gap, we propose a novel algorithm Reference-based Policy Optimization with Stable at Any Time guarantee (\algnameacro), which features the property "Stable at Any Time". We prove that our algorithm achieves $\tilde{O}(\sqrt{SAH^3K} + \sqrt{AH^4K})$ regret. When $S > H$, our algorithm is minimax optimal when ignoring logarithmic factors. To our best knowledge, RPO-SAT is the first computationally efficient, nearly minimax optimal policy-based algorithm for tabular RL.
    A Tutorial on Sparse Gaussian Processes and Variational Inference. (arXiv:2012.13962v13 [cs.LG] UPDATED)
    Gaussian processes (GPs) provide a framework for Bayesian inference that can offer principled uncertainty estimates for a large range of problems. For example, if we consider regression problems with Gaussian likelihoods, a GP model enjoys a posterior in closed form. However, identifying the posterior GP scales cubically with the number of training examples and requires to store all examples in memory. In order to overcome these obstacles, sparse GPs have been proposed that approximate the true posterior GP with pseudo-training examples. Importantly, the number of pseudo-training examples is user-defined and enables control over computational and memory complexity. In the general case, sparse GPs do not enjoy closed-form solutions and one has to resort to approximate inference. In this context, a convenient choice for approximate inference is variational inference (VI), where the problem of Bayesian inference is cast as an optimization problem -- namely, to maximize a lower bound of the log marginal likelihood. This paves the way for a powerful and versatile framework, where pseudo-training examples are treated as optimization arguments of the approximate posterior that are jointly identified together with hyperparameters of the generative model (i.e. prior and likelihood). The framework can naturally handle a wide scope of supervised learning problems, ranging from regression with heteroscedastic and non-Gaussian likelihoods to classification problems with discrete labels, but also problems with multidimensional labels. The purpose of this tutorial is to provide access to the basic matter for readers without prior knowledge in both GPs and VI. A proper exposition to the subject enables also access to more recent advances (like importance-weighted VI as well as interdomain, multioutput and deep GPs) that can serve as an inspiration for new research ideas.
    General Cutting Planes for Bound-Propagation-Based Neural Network Verification. (arXiv:2208.05740v2 [cs.LG] UPDATED)
    Bound propagation methods, when combined with branch and bound, are among the most effective methods to formally verify properties of deep neural networks such as correctness, robustness, and safety. However, existing works cannot handle the general form of cutting plane constraints widely accepted in traditional solvers, which are crucial for strengthening verifiers with tightened convex relaxations. In this paper, we generalize the bound propagation procedure to allow the addition of arbitrary cutting plane constraints, including those involving relaxed integer variables that do not appear in existing bound propagation formulations. Our generalized bound propagation method, GCP-CROWN, opens up the opportunity to apply general cutting plane methods for neural network verification while benefiting from the efficiency and GPU acceleration of bound propagation methods. As a case study, we investigate the use of cutting planes generated by off-the-shelf mixed integer programming (MIP) solver. We find that MIP solvers can generate high-quality cutting planes for strengthening bound-propagation-based verifiers using our new formulation. Since the branching-focused bound propagation procedure and the cutting-plane-focused MIP solver can run in parallel utilizing different types of hardware (GPUs and CPUs), their combination can quickly explore a large number of branches with strong cutting planes, leading to strong verification performance. Experiments demonstrate that our method is the first verifier that can completely solve the oval20 benchmark and verify twice as many instances on the oval21 benchmark compared to the best tool in VNN-COMP 2021, and also noticeably outperforms state-of-the-art verifiers on a wide range of benchmarks. GCP-CROWN is part of the $\alpha,\!\beta$-CROWN verifier, the VNN-COMP 2022 winner. Code is available at this http URL
    Combinatorial Causal Bandits. (arXiv:2206.01995v4 [cs.LG] UPDATED)
    In combinatorial causal bandits (CCB), the learning agent chooses at most $K$ variables in each round to intervene, collects feedback from the observed variables, with the goal of minimizing expected regret on the target variable $Y$. We study under the context of binary generalized linear models (BGLMs) with a succinct parametric representation of the causal models. We present the algorithm BGLM-OFU for Markovian BGLMs (i.e. no hidden variables) based on the maximum likelihood estimation method, and show that it achieves $O(\sqrt{T}\log T)$ regret, where $T$ is the time horizon. For the special case of linear models with hidden variables, we apply causal inference techniques such as the do-calculus to convert the original model into a Markovian model, and then show that our BGLM-OFU algorithm and another algorithm based on the linear regression both solve such linear models with hidden variables. Our novelty includes (a) considering the combinatorial intervention action space and the general causal models including ones with hidden variables, (b) integrating and adapting techniques from diverse studies such as generalized linear bandits and online influence maximization, and (c) avoiding unrealistic assumptions (such as knowing the joint distribution of the parents of $Y$ under all interventions) and regret factors exponential to causal graph size in prior studies.
    Scalable Spectral Clustering with Group Fairness Constraints. (arXiv:2210.16435v2 [cs.LG] UPDATED)
    There are synergies of research interests and industrial efforts in modeling fairness and correcting algorithmic bias in machine learning. In this paper, we present a scalable algorithm for spectral clustering (SC) with group fairness constraints. Group fairness is also known as statistical parity where in each cluster, each protected group is represented with the same proportion as in the entirety. While FairSC algorithm (Kleindessner et al., 2019) is able to find the fairer clustering, it is compromised by high costs due to the kernels of computing nullspaces and the square roots of dense matrices explicitly. We present a new formulation of underlying spectral computation by incorporating nullspace projection and Hotelling's deflation such that the resulting algorithm, called s-FairSC, only involves the sparse matrix-vector products and is able to fully exploit the sparsity of the fair SC model. The experimental results on the modified stochastic block model demonstrate that s-FairSC is comparable with FairSC in recovering fair clustering. Meanwhile, it is sped up by a factor of 12 for moderate model sizes. s-FairSC is further demonstrated to be scalable in the sense that the computational costs of s-FairSC only increase marginally compared to the SC without fairness constraints.
    Kernel Methods for Multistage Causal Inference: Mediation Analysis and Dynamic Treatment Effects. (arXiv:2111.03950v3 [stat.ME] UPDATED)
    We propose simple estimators for mediation analysis and dynamic treatment effects over short horizons based on kernel ridge regression. We study both nonparametric response curves and semiparametric treatment effects, allowing treatments, mediators, and covariates to be continuous or discrete in general spaces. Our key innovation is a new RKHS technique called sequential mean embedding, which facilitates the construction of simple estimators for complex causal estimands, including new estimands without existing alternatives. In particular, we propose machine learning estimators of dynamic dose response curves and dynamic counterfactual distributions without restrictive linearity, Markov, or no-effect-modification assumptions. Our simple estimators preserve the generality of classic identification while also achieving nonasymptotic uniform rates for causal functions and semiparametric efficiency for causal scalars. In nonlinear simulations with many covariates, we demonstrate state-of-the-art performance. We estimate mediated and dynamic response curves of the US Job Corps program for disadvantaged youth, and share a data set that may serve as a benchmark in future work.
    Gradient-Variation Bound for Online Convex Optimization with Constraints. (arXiv:2006.12455v3 [math.OC] UPDATED)
    We study online convex optimization with constraints consisting of multiple functional constraints and a relatively simple constraint set, such as a Euclidean ball. As enforcing the constraints at each time step through projections is computationally challenging in general, we allow decisions to violate the functional constraints but aim to achieve a low regret and cumulative violation of the constraints over a horizon of $T$ time steps. First-order methods achieve an $\mathcal{O}(\sqrt{T})$ regret and an $\mathcal{O}(1)$ constraint violation, which is the best-known bound under the Slater's condition, but do not take into account the structural information of the problem. Furthermore, the existing algorithms and analysis are limited to Euclidean space. In this paper, we provide an \emph{instance-dependent} bound for online convex optimization with complex constraints obtained by a novel online primal-dual mirror-prox algorithm. Our instance-dependent regret is quantified by the total gradient variation $V_*(T)$ in the sequence of loss functions. The proposed algorithm works in \emph{general} normed spaces and simultaneously achieves an $\mathcal{O}(\sqrt{V_*(T)})$ regret and an $\mathcal{O}(1)$ constraint violation, which is never worse than the best-known $( \mathcal{O}(\sqrt{T}), \mathcal{O}(1) )$ result and improves over previous works that applied mirror-prox-type algorithms for this problem achieving $\mathcal{O}(T^{2/3})$ regret and constraint violation. Finally, our algorithm is computationally efficient, as it only performs mirror descent steps in each iteration instead of solving a general Lagrangian minimization problem.
    Learning with Combinatorial Optimization Layers: a Probabilistic Approach. (arXiv:2207.13513v2 [stat.ML] UPDATED)
    Combinatorial optimization (CO) layers in machine learning (ML) pipelines are a powerful tool to tackle data-driven decision tasks, but they come with two main challenges. First, the solution of a CO problem often behaves as a piecewise constant function of its objective parameters. Given that ML pipelines are typically trained using stochastic gradient descent, the absence of slope information is very detrimental. Second, standard ML losses do not work well in combinatorial settings. A growing body of research addresses these challenges through diverse methods. Unfortunately, the lack of well-maintained implementations slows down the adoption of CO layers. In this paper, building upon previous works, we introduce a probabilistic perspective on CO layers, which lends itself naturally to approximate differentiation and the construction of structured losses. We recover many approaches from the literature as special cases, and we also derive new ones. Based on this unifying perspective, we present InferOpt.jl, an open-source Julia package that 1) allows turning any CO oracle with a linear objective into a differentiable layer, and 2) defines adequate losses to train pipelines containing such layers. Our library works with arbitrary optimization algorithms, and it is fully compatible with Julia's ML ecosystem. We demonstrate its abilities using a pathfinding problem on video game maps as guiding example, as well as three other applications from operations research.
    Testing Tail Weight of a Distribution Via Hazard Rate. (arXiv:2010.02888v2 [cs.LG] UPDATED)
    Understanding the shape of a distribution of data is of interest to people in a great variety of fields, as it may affect the types of algorithms used for that data. We study one such problem in the framework of distribution property testing, characterizing the number of samples required to to distinguish whether a distribution has a certain property or is far from having that property. In particular, given samples from a distribution, we seek to characterize the tail of the distribution, that is, understand how many elements appear infrequently. We develop an algorithm based on a careful bucketing scheme that distinguishes light-tailed distributions from non-light-tailed ones with respect to a definition based on the hazard rate, under natural smoothness and ordering assumptions. We bound the number of samples required for this test to succeed with high probability in terms of the parameters of the problem, showing that it is polynomial in these parameters. Further, we prove a hardness result that implies that this problem cannot be solved without any assumptions.
    Reproducibility in Optimization: Theoretical Framework and Limits. (arXiv:2202.04598v4 [math.OC] UPDATED)
    We initiate a formal study of reproducibility in optimization. We define a quantitative measure of reproducibility of optimization procedures in the face of noisy or error-prone operations such as inexact or stochastic gradient computations or inexact initialization. We then analyze several convex optimization settings of interest such as smooth, non-smooth, and strongly-convex objective functions and establish tight bounds on the limits of reproducibility in each setting. Our analysis reveals a fundamental trade-off between computation and reproducibility: more computation is necessary (and sufficient) for better reproducibility.
    Recursive Importance Sketching for Rank Constrained Least Squares: Algorithms and High-order Convergence. (arXiv:2011.08360v4 [math.OC] UPDATED)
    In this paper, we propose {\it \underline{R}ecursive} {\it \underline{I}mportance} {\it \underline{S}ketching} algorithm for {\it \underline{R}ank} constrained least squares {\it \underline{O}ptimization} (RISRO). The key step of RISRO is recursive importance sketching, a new sketching framework based on deterministically designed recursive projections, which significantly differs from the randomized sketching in the literature \citep{mahoney2011randomized,woodruff2014sketching}. Several existing algorithms in the literature can be reinterpreted under this new sketching framework and RISRO offers clear advantages over them. RISRO is easy to implement and computationally efficient, where the core procedure in each iteration is to solve a dimension-reduced least squares problem. We establish the local quadratic-linear and quadratic rate of convergence for RISRO under some mild conditions. We also discover a deep connection of RISRO to the Riemannian Gauss-Newton algorithm on fixed rank matrices. The effectiveness of RISRO is demonstrated in two applications in machine learning and statistics: low-rank matrix trace regression and phase retrieval. Simulation studies demonstrate the superior numerical performance of RISRO.
    Nonconvex Factorization and Manifold Formulations are Almost Equivalent in Low-rank Matrix Optimization. (arXiv:2108.01772v2 [math.OC] UPDATED)
    In this paper, we consider the geometric landscape connection of the widely studied manifold and factorization formulations in low-rank positive semidefinite (PSD) and general matrix optimization. We establish a sandwich relation on the spectrum of Riemannian and Euclidean Hessians at first-order stationary points (FOSPs). As a result of that, we obtain an equivalence on the set of FOSPs, second-order stationary points (SOSPs) and strict saddles between the manifold and the factorization formulations. In addition, we show the sandwich relation can be used to transfer more quantitative geometric properties from one formulation to another. Similarities and differences in the landscape connection under the PSD case and the general case are discussed. To the best of our knowledge, this is the first geometric landscape connection between the manifold and the factorization formulations for handling rank constraints, and it provides a geometric explanation for the similar empirical performance of factorization and manifold approaches in low-rank matrix optimization observed in the literature. In the general low-rank matrix optimization, the landscape connection of two factorization formulations (unregularized and regularized ones) is also provided. By applying these geometric landscape connections, in particular, the sandwich relation, we are able to solve unanswered questions in literature and establish stronger results in the applications on geometric analysis of phase retrieval, well-conditioned low-rank matrix optimization, and the role of regularization in factorization arising from machine learning and signal processing.
    Sketch-based community detection in evolving networks. (arXiv:2009.11835v2 [physics.soc-ph] UPDATED)
    We consider an approach for community detection in time-varying networks. At its core, this approach maintains a small sketch graph to capture the essential community structure found in each snapshot of the full network. We demonstrate how the sketch can be used to explicitly identify six key community events which typically occur during network evolution: growth, shrinkage, merging, splitting, birth and death. Based on these detection techniques, we formulate a community detection algorithm which can process a network concurrently exhibiting all processes. One advantage afforded by the sketch-based algorithm is the efficient handling of large networks. Whereas detecting events in the full graph may be computationally expensive, the small size of the sketch allows changes to be quickly assessed. A second advantage occurs in networks containing clusters of disproportionate size. The sketch is constructed such that there is equal representation of each cluster, thus reducing the possibility that the small clusters are lost in the estimate. We present a new standardized benchmark based on the stochastic block model which models the addition and deletion of nodes, as well as the birth and death of communities. When coupled with existing benchmarks, this new benchmark provides a comprehensive suite of tests encompassing all six community events. We provide analysis and a set of numerical results demonstrating the advantages of our approach both in run time and in the handling of small clusters.
    Rule Generation for Classification: Scalability, Interpretability, and Fairness. (arXiv:2104.10751v2 [cs.LG] UPDATED)
    We introduce a new rule-based optimization method for classification with constraints. The proposed method takes advantage of linear programming and column generation, and hence, is scalable to large datasets. Moreover, the method returns a set of rules along with their optimal weights indicating the importance of each rule for learning. Through assigning cost coefficients to the rules and introducing additional constraints, we show that one can also consider interpretability and fairness of the results. We test the performance of the proposed method on a collection of datasets and present two case studies to elaborate its different aspects. Our results show that a good compromise between interpretability and fairness on the one side, and accuracy on the other side, can be obtained by the proposed rule-based learning method.
    Quantile Multi-Armed Bandits: Optimal Best-Arm Identification and a Differentially Private Scheme. (arXiv:2006.06792v4 [stat.ML] UPDATED)
    We study the best-arm identification problem in multi-armed bandits with stochastic, potentially private rewards, when the goal is to identify the arm with the highest quantile at a fixed, prescribed level. First, we propose a (non-private) successive elimination algorithm for strictly optimal best-arm identification, we show that our algorithm is $\delta$-PAC and we characterize its sample complexity. Further, we provide a lower bound on the expected number of pulls, showing that the proposed algorithm is essentially optimal up to logarithmic factors. Both upper and lower complexity bounds depend on a special definition of the associated suboptimality gap, designed in particular for the quantile bandit problem, as we show when the gap approaches zero, best-arm identification is impossible. Second, motivated by applications where the rewards are private, we provide a differentially private successive elimination algorithm whose sample complexity is finite even for distributions with infinite support-size, and we characterize its sample complexity. Our algorithms do not require prior knowledge of either the suboptimality gap or other statistical information related to the bandit problem at hand.
    Efficient Optimization of Dominant Set Clustering with Frank-Wolfe Algorithms. (arXiv:2007.11652v3 [cs.LG] UPDATED)
    We study Frank-Wolfe algorithms - standard, pairwise, and away-steps - for efficient optimization of Dominant Set Clustering. We present a unified and computationally efficient framework to employ the different variants of Frank-Wolfe methods, and we investigate its effectiveness via several experimental studies. In addition, we provide explicit convergence rates for the algorithms in terms of the so-called Frank-Wolfe gap. The theoretical analysis has been specialized to Dominant Set Clustering and covers consistently the different variants.
    ReRe: A Lightweight Real-time Ready-to-Go Anomaly Detection Approach for Time Series. (arXiv:2004.02319v4 [cs.LG] UPDATED)
    Anomaly detection is an active research topic in many different fields such as intrusion detection, network monitoring, system health monitoring, IoT healthcare, etc. However, many existing anomaly detection approaches require either human intervention or domain knowledge, and may suffer from high computation complexity, consequently hindering their applicability in real-world scenarios. Therefore, a lightweight and ready-to-go approach that is able to detect anomalies in real-time is highly sought-after. Such an approach could be easily and immediately applied to perform time series anomaly detection on any commodity machine. The approach could provide timely anomaly alerts and by that enable appropriate countermeasures to be undertaken as early as possible. With these goals in mind, this paper introduces ReRe, which is a Real-time Ready-to-go proactive Anomaly Detection algorithm for streaming time series. ReRe employs two lightweight Long Short-Term Memory (LSTM) models to predict and jointly determine whether or not an upcoming data point is anomalous based on short-term historical data points and two long-term self-adaptive thresholds. Experiments based on real-world time-series datasets demonstrate the good performance of ReRe in real-time anomaly detection without requiring human intervention or domain knowledge.
    A Trust Crisis In Simulation-Based Inference? Your Posterior Approximations Can Be Unfaithful. (arXiv:2110.06581v3 [stat.ML] UPDATED)
    We present extensive empirical evidence showing that current Bayesian simulation-based inference algorithms can produce computationally unfaithful posterior approximations. Our results show that all benchmarked algorithms -- (Sequential) Neural Posterior Estimation, (Sequential) Neural Ratio Estimation, Sequential Neural Likelihood and variants of Approximate Bayesian Computation -- can yield overconfident posterior approximations, which makes them unreliable for scientific use cases and falsificationist inquiry. Failing to address this issue may reduce the range of applicability of simulation-based inference. For this reason, we argue that research efforts should be made towards theoretical and methodological developments of conservative approximate inference algorithms and present research directions towards this objective. In this regard, we show empirical evidence that ensembling posterior surrogates provides more reliable approximations and mitigates the issue.
    Reasoning-Modulated Representations. (arXiv:2107.08881v2 [cs.LG] UPDATED)
    Neural networks leverage robust internal representations in order to generalise. Learning them is difficult, and often requires a large training set that covers the data distribution densely. We study a common setting where our task is not purely opaque. Indeed, very often we may have access to information about the underlying system (e.g. that observations must obey certain laws of physics) that any "tabula rasa" neural network would need to re-learn from scratch, penalising performance. We incorporate this information into a pre-trained reasoning module, and investigate its role in shaping the discovered representations in diverse self-supervised learning settings from pixels. Our approach paves the way for a new class of representation learning, grounded in algorithmic priors.
    Regularized ERM on random subspaces. (arXiv:2212.01866v1 [stat.ML])
    We study a natural extension of classical empirical risk minimization, where the hypothesis space is a random subspace of a given space. In particular, we consider possibly data dependent subspaces spanned by a random subset of the data, recovering as a special case Nystrom approaches for kernel methods. Considering random subspaces naturally leads to computational savings, but the question is whether the corresponding learning accuracy is degraded. These statistical-computational tradeoffs have been recently explored for the least squares loss and self-concordant loss functions, such as the logistic loss. Here, we work to extend these results to convex Lipschitz loss functions, that might not be smooth, such as the hinge loss used in support vector machines. This unified analysis requires developing new proofs, that use different technical tools, such as sub-gaussian inputs, to achieve fast rates. Our main results show the existence of different settings, depending on how hard the learning problem is, for which computational efficiency can be improved with no loss in performance.
    Improving Calibration and Out-of-Distribution Detection in Medical Image Segmentation with Convolutional Neural Networks. (arXiv:2004.06569v3 [cs.CV] UPDATED)
    Convolutional Neural Networks (CNNs) have shown to be powerful medical image segmentation models. In this study, we address some of the main unresolved issues regarding these models. Specifically, training of these models on small medical image datasets is still challenging, with many studies promoting techniques such as transfer learning. Moreover, these models are infamous for producing over-confident predictions and for failing silently when presented with out-of-distribution (OOD) data at test time. In this paper, we advocate for multi-task learning, i.e., training a single model on several different datasets, spanning several different organs of interest and different imaging modalities. We show that not only a single CNN learns to automatically recognize the context and accurately segment the organ of interest in each context, but also that such a joint model often has more accurate and better-calibrated predictions than dedicated models trained separately on each dataset. Our experiments show that multi-task learning can outperform transfer learning in medical image segmentation tasks. For detecting OOD data, we propose a method based on spectral analysis of CNN feature maps. We show that different datasets, representing different imaging modalities and/or different organs of interest, have distinct spectral signatures, which can be used to identify whether or not a test image is similar to the images used to train a model. We show that this approach is far more accurate than OOD detection based on prediction uncertainty. The methods proposed in this paper contribute significantly to improving the accuracy and reliability of CNN-based medical image segmentation models.
    Blessings and Curses of Covariate Shifts: Adversarial Learning Dynamics, Directional Convergence, and Equilibria. (arXiv:2212.02457v1 [stat.ML])
    Covariate distribution shifts and adversarial perturbations present robustness challenges to the conventional statistical learning framework: seemingly small unconceivable shifts in the test covariate distribution can significantly affect the performance of the statistical model learned based on the training distribution. The model performance typically deteriorates when extrapolation happens: namely, covariates shift to a region where the training distribution is scarce, and naturally, the learned model has little information. For robustness and regularization considerations, adversarial perturbation techniques are proposed as a remedy; however, more needs to be studied about what extrapolation region adversarial covariate shift will focus on, given a learned model. This paper precisely characterizes the extrapolation region, examining both regression and classification in an infinite-dimensional setting. We study the implications of adversarial covariate shifts to subsequent learning of the equilibrium -- the Bayes optimal model -- in a sequential game framework. We exploit the dynamics of the adversarial learning game and reveal the curious effects of the covariate shift to equilibrium learning and experimental design. In particular, we establish two directional convergence results that exhibit distinctive phenomena: (1) a blessing in regression, the adversarial covariate shifts in an exponential rate to an optimal experimental design for rapid subsequent learning, (2) a curse in classification, the adversarial covariate shifts in a subquadratic rate fast to the hardest experimental design trapping subsequent learning.
    Representation Internal-Manipulation (RIM): A Neuro-Inspired Computational Theory of Consciousness. (arXiv:1912.13490v2 [cs.AI] UPDATED)
    Many theories, based on neuroscientific and psychological empirical evidence and on computational concepts, have been elaborated to explain the emergence of consciousness in the central nervous system. These theories propose key fundamental mechanisms to explain consciousness, but they only partially connect such mechanisms to the possible functional and adaptive role of consciousness. Recently, some cognitive and neuroscientific models try to solve this gap by linking consciousness to various aspects of goal-directed behaviour, the pivotal cognitive process that allows mammals to flexibly act in challenging environments. Here we propose the Representation Internal-Manipulation (RIM) theory of consciousness, a theory that links the main elements of consciousness theories to components and functions of goal-directed behaviour, ascribing a central role for consciousness to the goal-directed manipulation of internal representations. This manipulation relies on four specific computational operations to perform the flexible internal adaptation of all key elements of goal-directed computation, from the representations of objects to those of goals, actions, and plans. Finally, we propose the concept of `manipulation agency' relating the sense of agency to the internal manipulation of representations. This allows us to propose that the subjective experience of consciousness is associated to the human capacity to generate and control a simulated internal reality that is vividly perceived and felt through the same perceptual and emotional mechanisms used to tackle the external world.
    Representation Ensembling for Synergistic Lifelong Learning with Quasilinear Complexity. (arXiv:2004.12908v15 [cs.AI] UPDATED)
    In lifelong learning, data are used to improve performance not only on the current task, but also on previously encountered, and as yet unencountered tasks. In contrast, classical machine learning, which we define as, starts from a blank slate, or tabula rasa and uses data only for the single task at hand. While typical transfer learning algorithms can improve performance on future tasks, their performance on prior tasks degrades upon learning new tasks (called forgetting). Many recent approaches for continual or lifelong learning have attempted to maintain performance on old tasks given new tasks. But striving to avoid forgetting sets the goal unnecessarily low. The goal of lifelong learning should be not only to improve performance on future tasks (forward transfer) but also on past tasks (backward transfer) with any new data. Our key insight is that we can synergistically ensemble representations -- that were learned independently on disparate tasks -- to enable both forward and backward transfer. This generalizes ensembling decisions (like in decision forests) and complements ensembling dependently learned representations (like in multitask learning). Moreover, we can ensemble representations in quasilinear space and time. We demonstrate this insight with two algorithms: representation ensembles of (1) trees and (2) networks. Both algorithms demonstrate forward and backward transfer in a variety of simulated and benchmark data scenarios, including tabular, image, and spoken, and adversarial tasks. This is in stark contrast to the reference algorithms we compared to, most of which failed to transfer either forward or backward, or both, despite that many of them require quadratic space or time complexity.
    Observational and Interventional Causal Learning for Regret-Minimizing Control. (arXiv:2212.02435v1 [stat.ML])
    We explore how observational and interventional causal discovery methods can be combined. A state-of-the-art observational causal discovery algorithm for time series capable of handling latent confounders and contemporaneous effects, called LPCMCI, is extended to profit from casual constraints found through randomized control trials. Numerical results show that, given perfect interventional constraints, the reconstructed structural causal models (SCMs) of the extended LPCMCI allow 84.6% of the time for the optimal prediction of the target variable. The implementation of interventional and observational causal discovery is modular, allowing causal constraints from other sources. The second part of this thesis investigates the question of regret minimizing control by simultaneously learning a causal model and planning actions through the causal model. The idea is that an agent to optimize a measured variable first learns the system's mechanics through observational causal discovery. The agent then intervenes on the most promising variable with randomized values allowing for the exploitation and generation of new interventional data. The agent then uses the interventional data to enhance the causal model further, allowing improved actions the next time. The extended LPCMCI can be favorable compared to the original LPCMCI algorithm. The numerical results show that detecting and using interventional constraints leads to reconstructed SCMs that allow 60.9% of the time for the optimal prediction of the target variable in contrast to the baseline of 53.6% when using the original LPCMCI algorithm. Furthermore, the induced average regret decreases from 1.2 when using the original LPCMCI algorithm to 1.0 when using the extended LPCMCI algorithm with interventional discovery.
    TD3 with Reverse KL Regularizer for Offline Reinforcement Learning from Mixed Datasets. (arXiv:2212.02125v1 [stat.ML])
    We consider an offline reinforcement learning (RL) setting where the agent need to learn from a dataset collected by rolling out multiple behavior policies. There are two challenges for this setting: 1) The optimal trade-off between optimizing the RL signal and the behavior cloning (BC) signal changes on different states due to the variation of the action coverage induced by different behavior policies. Previous methods fail to handle this by only controlling the global trade-off. 2) For a given state, the action distribution generated by different behavior policies may have multiple modes. The BC regularizers in many previous methods are mean-seeking, resulting in policies that select out-of-distribution (OOD) actions in the middle of the modes. In this paper, we address both challenges by using adaptively weighted reverse Kullback-Leibler (KL) divergence as the BC regularizer based on the TD3 algorithm. Our method not only trades off the RL and BC signals with per-state weights (i.e., strong BC regularization on the states with narrow action coverage, and vice versa) but also avoids selecting OOD actions thanks to the mode-seeking property of reverse KL. Empirically, our algorithm can outperform existing offline RL algorithms in the MuJoCo locomotion tasks with the standard D4RL datasets as well as the mixed datasets that combine the standard datasets.
    Spread Divergence. (arXiv:1811.08968v5 [stat.ML] UPDATED)
    For distributions $\mathbb{P}$ and $\mathbb{Q}$ with different supports or undefined densities, the divergence $\textrm{D}(\mathbb{P}||\mathbb{Q})$ may not exist. We define a Spread Divergence $\tilde{\textrm{D}}(\mathbb{P}||\mathbb{Q})$ on modified $\mathbb{P}$ and $\mathbb{Q}$ and describe sufficient conditions for the existence of such a divergence. We demonstrate how to maximize the discriminatory power of a given divergence by parameterizing and learning the spread. We also give examples of using a Spread Divergence to train implicit generative models, including linear models (Independent Components Analysis) and non-linear models (Deep Generative Networks).
    Limitations on approximation by deep and shallow neural networks. (arXiv:2212.02223v1 [stat.ML])
    We prove Carl's type inequalities for the error of approximation of compact sets K by deep and shallow neural networks. This in turn gives lower bounds on how well we can approximate the functions in K when requiring the approximants to come from outputs of such networks. Our results are obtained as a byproduct of the study of the recently introduced Lipschitz widths.
    RePAD: Real-time Proactive Anomaly Detection for Time Series. (arXiv:2001.08922v5 [cs.LG] UPDATED)
    During the past decade, many anomaly detection approaches have been introduced in different fields such as network monitoring, fraud detection, and intrusion detection. However, they require understanding of data pattern and often need a long off-line period to build a model or network for the target data. Providing real-time and proactive anomaly detection for streaming time series without human intervention and domain knowledge is highly valuable since it greatly reduces human effort and enables appropriate countermeasures to be undertaken before a disastrous damage, failure, or other harmful event occurs. However, this issue has not been well studied yet. To address it, this paper proposes RePAD, which is a Real-time Proactive Anomaly Detection algorithm for streaming time series based on Long Short-Term Memory (LSTM). RePAD utilizes short-term historic data points to predict and determine whether or not the upcoming data point is a sign that an anomaly is likely to happen in the near future. By dynamically adjusting the detection threshold over time, RePAD is able to tolerate minor pattern change in time series and detect anomalies either proactively or on time. Experiments based on two time series datasets collected from the Numenta Anomaly Benchmark demonstrate that RePAD is able to proactively detect anomalies and provide early warnings in real time without human intervention and domain knowledge.
    Rethinking the Structure of Stochastic Gradients: Empirical and Statistical Evidence. (arXiv:2212.02083v1 [cs.LG])
    Stochastic gradients closely relate to both optimization and generalization of deep neural networks (DNNs). Some works attempted to explain the success of stochastic optimization for deep learning by the arguably heavy-tail properties of gradient noise, while other works presented theoretical and empirical evidence against the heavy-tail hypothesis on gradient noise. Unfortunately, formal statistical tests for analyzing the structure and heavy tails of stochastic gradients in deep learning are still under-explored. In this paper, we mainly make two contributions. First, we conduct formal statistical tests on the distribution of stochastic gradients and gradient noise across both parameters and iterations. Our statistical tests reveal that dimension-wise gradients usually exhibit power-law heavy tails, while iteration-wise gradients and stochastic gradient noise caused by minibatch training usually do not exhibit power-law heavy tails. Second, we further discover that the covariance spectra of stochastic gradients have the power-law structures in deep learning. While previous papers believed that the anisotropic structure of stochastic gradients matters to deep learning, they did not expect the gradient covariance can have such an elegant mathematical structure. Our work challenges the existing belief and provides novel insights on the structure of stochastic gradients in deep learning.
    Scalable and Robust Community Detection with Randomized Sketching. (arXiv:1805.10927v4 [cs.SI] UPDATED)
    This article explores and analyzes the unsupervised clustering of large partially observed graphs. We propose a scalable and provable randomized framework for clustering graphs generated from the stochastic block model. The clustering is first applied to a sub-matrix of the graph's adjacency matrix associated with a reduced graph sketch constructed using random sampling. Then, the clusters of the full graph are inferred based on the clusters extracted from the sketch using a correlation-based retrieval step. Uniform random node sampling is shown to improve the computational complexity over clustering of the full graph when the cluster sizes are balanced. A new random degree-based node sampling algorithm is presented which significantly improves upon the performance of the clustering algorithm even when clusters are unbalanced. This framework improves the phase transitions for matrix-decomposition-based clustering with regard to computational complexity and minimum cluster size, which are shown to be nearly dimension-free in the low inter-cluster connectivity regime. A third sampling technique is shown to improve balance by randomly sampling nodes based on spatial distribution. We provide analysis and numerical results using a convex clustering algorithm based on matrix completion.
    Counterfactual Learning with General Data-generating Policies. (arXiv:2212.01925v1 [cs.LG])
    Off-policy evaluation (OPE) attempts to predict the performance of counterfactual policies using log data from a different policy. We extend its applicability by developing an OPE method for a class of both full support and deficient support logging policies in contextual-bandit settings. This class includes deterministic bandit (such as Upper Confidence Bound) as well as deterministic decision-making based on supervised and unsupervised learning. We prove that our method's prediction converges in probability to the true performance of a counterfactual policy as the sample size increases. We validate our method with experiments on partly and entirely deterministic logging policies. Finally, we apply it to evaluate coupon targeting policies by a major online platform and show how to improve the existing policy.
    Uncertainty Quantification and Exploration for Reinforcement Learning. (arXiv:1910.05471v3 [cs.LG] UPDATED)
    We investigate statistical uncertainty quantification for reinforcement learning (RL) and its implications in exploration policy. Despite ever-growing literature on RL applications, fundamental questions about inference and error quantification, such as large-sample behaviors, appear to remain quite open. In this paper, we fill in the literature gap by studying the central limit theorem behaviors of estimated Q-values and value functions under various RL settings. In particular, we explicitly identify closed-form expressions of the asymptotic variances, which allow us to efficiently construct asymptotically valid confidence regions for key RL quantities. Furthermore, we utilize these asymptotic expressions to design an effective exploration strategy, which we call Q-value-based Optimal Computing Budget Allocation (Q-OCBA). The policy relies on maximizing the relative discrepancies among the Q-value estimates. Numerical experiments show superior performances of our exploration strategy than other benchmark policies.
    Measure of Strength of Evidence for Visually Observed Differences between Subpopulations. (arXiv:2101.00362v2 [stat.ME] UPDATED)
    An increasingly important data analytic challenge is understanding the relationships between subpopulations. Various visualization methods that provide many useful insights into those relationships are popular, especially in bioinformatics. This paper proposes a novel and rigorous approach to quantifying subpopulation relationships called the Population Difference Criterion (PDC). PDC is simultaneously a quantitative and visual approach to showing separation of subpopulations. It uses subpopulation centers, the respective variation about those centers and the relative subpopulation sizes. This is accomplished by drawing motivation for the PDC from classical permutation based hypothesis testing, while taking that type of idea into non-standard conceptual territory. In particular, the domain of very small P values is seen to seem to provide useful comparisons of data sets. Simulated permutation variation is carefully investigated, and we found that a balanced permutation approach is more informative in high signal (i.e large subpopulation difference) contexts, than conventional approaches based on all permutations. This result is quite surprising in view of related work done in low signal contexts, which came to the opposite conclusion. This issue is resolved by the proposal of an appropriate adjustment. Permutation variation is also quantified by a proposed bootstrap confidence interval, and demonstrated to be useful in understanding subpopulation relationships with cancer data.
    Learning-to-defer for sequential medical decision-making under uncertainty. (arXiv:2109.06312v2 [cs.LG] UPDATED)
    Learning-to-defer is a framework to automatically defer decision-making to a human expert when ML-based decisions are deemed unreliable. Existing learning-to-defer frameworks are not designed for sequential settings. That is, they defer at every instance independently, based on immediate predictions, while ignoring the potential long-term impact of these interventions. As a result, existing frameworks are myopic. Further, they do not defer adaptively, which is crucial when human interventions are costly. In this work, we propose Sequential Learning-to-Defer (SLTD), a framework for learning-to-defer to a domain expert in sequential decision-making settings. Contrary to existing literature, we pose the problem of learning-to-defer as model-based reinforcement learning (RL) to i) account for long-term consequences of ML-based actions using RL and ii) adaptively defer based on the dynamics (model-based). Our proposed framework determines whether to defer (at each time step) by quantifying whether a deferral now will improve the value compared to delaying deferral to the next time step. To quantify the improvement, we account for potential future deferrals. As a result, we learn a pre-emptive deferral policy (i.e. a policy that defers early if using the ML-based policy could worsen long-term outcomes). Our deferral policy is adaptive to the non-stationarity in the dynamics. We demonstrate that adaptive deferral via SLTD provides an improved trade-off between long-term outcomes and deferral frequency on synthetic, semi-synthetic, and real-world data with non-stationary dynamics. Finally, we interpret the deferral decision by decomposing the propagated (long-term) uncertainty around the outcome, to justify the deferral decision.
    Adaptive Sequential Surveillance with Network and Temporal Dependence. (arXiv:2212.02422v1 [stat.ME])
    Strategic test allocation plays a major role in the control of both emerging and existing pandemics (e.g., COVID-19, HIV). Widespread testing supports effective epidemic control by (1) reducing transmission via identifying cases, and (2) tracking outbreak dynamics to inform targeted interventions. However, infectious disease surveillance presents unique statistical challenges. For instance, the true outcome of interest - one's positive infectious status, is often a latent variable. In addition, presence of both network and temporal dependence reduces the data to a single observation. As testing entire populations regularly is neither efficient nor feasible, standard approaches to testing recommend simple rule-based testing strategies (e.g., symptom based, contact tracing), without taking into account individual risk. In this work, we study an adaptive sequential design involving n individuals over a period of {\tau} time-steps, which allows for unspecified dependence among individuals and across time. Our causal target parameter is the mean latent outcome we would have obtained after one time-step, if, starting at time t given the observed past, we had carried out a stochastic intervention that maximizes the outcome under a resource constraint. We propose an Online Super Learner for adaptive sequential surveillance that learns the optimal choice of tests strategies over time while adapting to the current state of the outbreak. Relying on a series of working models, the proposed method learns across samples, through time, or both: based on the underlying (unknown) structure in the data. We present an identification result for the latent outcome in terms of the observed data, and demonstrate the superior performance of the proposed strategy in a simulation modeling a residential university environment during the COVID-19 pandemic.
    Model Selection in Contextual Stochastic Bandit Problems. (arXiv:2003.01704v3 [cs.LG] UPDATED)
    We study bandit model selection in stochastic environments. Our approach relies on a meta-algorithm that selects between candidate base algorithms. We develop a meta-algorithm-base algorithm abstraction that can work with general classes of base algorithms and different type of adversarial meta-algorithms. Our methods rely on a novel and generic smoothing transformation for bandit algorithms that permits us to obtain optimal $O(\sqrt{T})$ model selection guarantees for stochastic contextual bandit problems as long as the optimal base algorithm satisfies a high probability regret guarantee. We show through a lower bound that even when one of the base algorithms has $O(\log T)$ regret, in general it is impossible to get better than $\Omega(\sqrt{T})$ regret in model selection, even asymptotically. Using our techniques, we address model selection in a variety of problems such as misspecified linear contextual bandits, linear bandit with unknown dimension and reinforcement learning with unknown feature maps. Our algorithm requires the knowledge of the optimal base regret to adjust the meta-algorithm learning rate. We show that without such prior knowledge any meta-algorithm can suffer a regret larger than the optimal base regret.
    Incorporating Polar Field Data for Improved Solar Flare Prediction. (arXiv:2212.01730v1 [astro-ph.SR])
    In this paper, we consider incorporating data associated with the sun's north and south polar field strengths to improve solar flare prediction performance using machine learning models. When used to supplement local data from active regions on the photospheric magnetic field of the sun, the polar field data provides global information to the predictor. While such global features have been previously proposed for predicting the next solar cycle's intensity, in this paper we propose using them to help classify individual solar flares. We conduct experiments using HMI data employing four different machine learning algorithms that can exploit polar field information. Additionally, we propose a novel probabilistic mixture of experts model that can simply and effectively incorporate polar field data and provide on-par prediction performance with state-of-the-art solar flare prediction algorithms such as the Recurrent Neural Network (RNN). Our experimental results indicate the usefulness of the polar field data for solar flare prediction, which can improve Heidke Skill Score (HSS2) by as much as 10.1%.  ( 2 min )
    Insights into the drivers and spatio-temporal trends of extreme Mediterranean wildfires with statistical deep-learning. (arXiv:2212.01796v1 [stat.AP])
    Extreme wildfires continue to be a significant cause of human death and biodiversity destruction within countries that encompass the Mediterranean Basin. Recent worrying trends in wildfire activity (i.e., occurrence and spread) suggest that wildfires are likely to be highly impacted by climate change. In order to facilitate appropriate risk mitigation, it is imperative to identify the main drivers of extreme wildfires and assess their spatio-temporal trends, with a view to understanding the impacts of global warming on fire activity. To this end, we analyse the monthly burnt area due to wildfires over a region encompassing most of Europe and the Mediterranean Basin from 2001 to 2020, and identify high fire activity during this period in eastern Europe, Algeria, Italy and Portugal. We build an extreme quantile regression model with a high-dimensional predictor set describing meteorological conditions, land cover usage, and orography, for the domain. To model the complex relationships between the predictor variables and wildfires, we make use of a hybrid statistical deep-learning framework that allows us to disentangle the effects of vapour-pressure deficit (VPD), air temperature, and drought on wildfire activity. Our results highlight that whilst VPD, air temperature, and drought significantly affect wildfire occurrence, only VPD affects extreme wildfire spread. Furthermore, to gain insights into the effect of climate change on wildfire activity in the near future, we perturb VPD and temperature according to their observed trends and find evidence that global warming may lead to spatially non-uniform changes in wildfire activity.  ( 2 min )
    Understanding How Model Size Affects Few-shot Instruction Prompting. (arXiv:2212.01907v1 [cs.CL])
    Large Language Models are affected by the phenomena of memorizing and forgetting their training data. But how do these vary by model size? We work towards this question by investigating how the model size affects the model's ability to discriminate a word's meaning in a given context. We introduce a dataset called DeltaWords, which evaluates a model's ability to follow instructions to select a sentence which replaces the target word with its antonym. We show a weak inverse scaling trend, where task accuracy degrades as model size increase, under extremely few-shot prompting regimes. We show that increasing the number of examples tend to disproportionately benefit larger models than smaller models.  ( 2 min )
    PowRL: A Reinforcement Learning Framework for Robust Management of Power Networks. (arXiv:2212.02397v1 [cs.LG])
    Power grids, across the world, play an important societal and economical role by providing uninterrupted, reliable and transient-free power to several industries, businesses and household consumers. With the advent of renewable power resources and EVs resulting into uncertain generation and highly dynamic load demands, it has become ever so important to ensure robust operation of power networks through suitable management of transient stability issues and localize the events of blackouts. In the light of ever increasing stress on the modern grid infrastructure and the grid operators, this paper presents a reinforcement learning (RL) framework, PowRL, to mitigate the effects of unexpected network events, as well as reliably maintain electricity everywhere on the network at all times. The PowRL leverages a novel heuristic for overload management, along with the RL-guided decision making on optimal topology selection to ensure that the grid is operated safely and reliably (with no overloads). PowRL is benchmarked on a variety of competition datasets hosted by the L2RPN (Learning to Run a Power Network). Even with its reduced action space, PowRL tops the leaderboard in the L2RPN NeurIPS 2020 challenge (Robustness track) at an aggregate level, while also being the top performing agent in the L2RPN WCCI 2020 challenge. Moreover, detailed analysis depicts state-of-the-art performances by the PowRL agent in some of the test scenarios.  ( 2 min )
    Avoiding spurious correlations via logit correction. (arXiv:2212.01433v1 [cs.LG])
    Empirical studies suggest that machine learning models trained with empirical risk minimization (ERM) often rely on attributes that may be spuriously correlated with the class labels. Such models typically lead to poor performance during inference for data lacking such correlations. In this work, we explicitly consider a situation where potential spurious correlations are present in the majority of training data. In contrast with existing approaches, which use the ERM model outputs to detect the samples without spurious correlations, and either heuristically upweighting or upsampling those samples; we propose the logit correction (LC) loss, a simple yet effective improvement on the softmax cross-entropy loss, to correct the sample logit. We demonstrate that minimizing the LC loss is equivalent to maximizing the group-balanced accuracy, so the proposed LC could mitigate the negative impacts of spurious correlations. Our extensive experimental results further reveal that the proposed LC loss outperforms the SoTA solutions on multiple popular benchmarks by a large margin, an average 5.5% absolute improvement, without access to spurious attribute labels. LC is also competitive with oracle methods that make use of the attribute labels. Code is available at https://github.com/shengliu66/LC.  ( 2 min )
    An operational framework to automatically evaluate the quality of weather observations from third-party stations. (arXiv:2212.01998v1 [stat.AP])
    With increasing number of crowdsourced private automatic weather stations (called TPAWS) established to fill the gap of official network and obtain local weather information for various purposes, the data quality is a major concern in promoting their usage. Proper quality control and assessment are necessary to reach mutual agreement on the TPAWS observations. To derive near real-time assessment for operational system, we propose a simple, scalable and interpretable framework based on AI/Stats/ML models. The framework constructs separate models for individual data from official sources and then provides the final assessment by fusing the individual models. The performance of our proposed framework is evaluated by synthetic data and demonstrated by applying it to a re-al TPAWS network.  ( 2 min )
    Visual Question Answering From Another Perspective: CLEVR Mental Rotation Tests. (arXiv:2212.01639v1 [stat.ML])
    Different types of mental rotation tests have been used extensively in psychology to understand human visual reasoning and perception. Understanding what an object or visual scene would look like from another viewpoint is a challenging problem that is made even harder if it must be performed from a single image. We explore a controlled setting whereby questions are posed about the properties of a scene if that scene was observed from another viewpoint. To do this we have created a new version of the CLEVR dataset that we call CLEVR Mental Rotation Tests (CLEVR-MRT). Using CLEVR-MRT we examine standard methods, show how they fall short, then explore novel neural architectures that involve inferring volumetric representations of a scene. These volumes can be manipulated via camera-conditioned transformations to answer the question. We examine the efficacy of different model variants through rigorous ablations and demonstrate the efficacy of volumetric representations.  ( 2 min )
    Statistical Physics of Deep Neural Networks: Initialization toward Optimal Channels. (arXiv:2212.01744v1 [cs.LG])
    In deep learning, neural networks serve as noisy channels between input data and its representation. This perspective naturally relates deep learning with the pursuit of constructing channels with optimal performance in information transmission and representation. While considerable efforts are concentrated on realizing optimal channel properties during network optimization, we study a frequently overlooked possibility that neural networks can be initialized toward optimal channels. Our theory, consistent with experimental validation, identifies primary mechanics underlying this unknown possibility and suggests intrinsic connections between statistical physics and deep learning. Unlike the conventional theories that characterize neural networks applying the classic mean-filed approximation, we offer analytic proof that this extensively applied simplification scheme is not valid in studying neural networks as information channels. To fill this gap, we develop a corrected mean-field framework applicable for characterizing the limiting behaviors of information propagation in neural networks without strong assumptions on inputs. Based on it, we propose an analytic theory to prove that mutual information maximization is realized between inputs and propagated signals when neural networks are initialized at dynamic isometry, a case where information transmits via norm-preserving mappings. These theoretical predictions are validated by experiments on real neural networks, suggesting the robustness of our theory against finite-size effects. Finally, we analyze our findings with information bottleneck theory to confirm the precise relations among dynamic isometry, mutual information maximization, and optimal channel properties in deep learning.  ( 2 min )
    Exploring the Limits of Differentially Private Deep Learning with Group-wise Clipping. (arXiv:2212.01539v1 [cs.LG])
    Differentially private deep learning has recently witnessed advances in computational efficiency and privacy-utility trade-off. We explore whether further improvements along the two axes are possible and provide affirmative answers leveraging two instantiations of \emph{group-wise clipping}. To reduce the compute time overhead of private learning, we show that \emph{per-layer clipping}, where the gradient of each neural network layer is clipped separately, allows clipping to be performed in conjunction with backpropagation in differentially private optimization. This results in private learning that is as memory-efficient and almost as fast per training update as non-private learning for many workflows of interest. While per-layer clipping with constant thresholds tends to underperform standard flat clipping, per-layer clipping with adaptive thresholds matches or outperforms flat clipping under given training epoch constraints, hence attaining similar or better task performance within less wall time. To explore the limits of scaling (pretrained) models in differentially private deep learning, we privately fine-tune the 175 billion-parameter GPT-3. We bypass scaling challenges associated with clipping gradients that are distributed across multiple devices with \emph{per-device clipping} that clips the gradient of each model piece separately on its host device. Privately fine-tuning GPT-3 with per-device clipping achieves a task performance at $\epsilon=1$ better than what is attainable by non-privately fine-tuning the largest GPT-2 on a summarization task.  ( 2 min )
    Approximate Message Passing for Multi-Layer Estimation in Rotationally Invariant Models. (arXiv:2212.01572v1 [stat.ML])
    We consider the problem of reconstructing the signal and the hidden variables from observations coming from a multi-layer network with rotationally invariant weight matrices. The multi-layer structure models inference from deep generative priors, and the rotational invariance imposed on the weights generalizes the i.i.d.\ Gaussian assumption by allowing for a complex correlation structure, which is typical in applications. In this work, we present a new class of approximate message passing (AMP) algorithms and give a state evolution recursion which precisely characterizes their performance in the large system limit. In contrast with the existing multi-layer VAMP (ML-VAMP) approach, our proposed AMP -- dubbed multi-layer rotationally invariant generalized AMP (ML-RI-GAMP) -- provides a natural generalization beyond Gaussian designs, in the sense that it recovers the existing Gaussian AMP as a special case. Furthermore, ML-RI-GAMP exhibits a significantly lower complexity than ML-VAMP, as the computationally intensive singular value decomposition is replaced by an estimation of the moments of the design matrices. Finally, our numerical results show that this complexity gain comes at little to no cost in the performance of the algorithm.  ( 2 min )

  • Open

    Dawn Ai vs Midjourney?
    Anyone have any experience with both? Which one is better, mostly need it for DnD image generations, characters settings etc submitted by /u/Cyber_Mk [link] [comments]  ( 46 min )
    It's the most beautiful thing I've ever seen, chatGPT on Capitalism
    submitted by /u/Holos620 [link] [comments]  ( 44 min )
    One-shot Implicit Animatable Avatars with Model-based Priors
    submitted by /u/ai-lover [link] [comments]  ( 46 min )
    AI rotoscoping for anime rush
    With the rising of AI, do you think a rotoscoping AI from anime rush can exist one day ? I'm trying to imagine it, and it could be a pain in the ass but a very big time saver for people like me who do AMV's (Anime Music Video). submitted by /u/Outrageous-Design268 [link] [comments]  ( 46 min )
    Breaking ChatGPT with simple questions.
    So, I got fed up. Every day on my feed. Every day, ooooh and aaaah, and "the robot revolution is coming" type of posts. Hence, like in Fight Club, I got into the mood of "breaking something beautiful"... And this is how it went, actually with surprisingly "simple" questions indicating that ChatGPT - as basically all AI systems - has serious issues with questions that resemble the Winograd Challenge, and I think this may serve as a guidance to anyone interested in breaking it in a similar fashion: https://www.youtube.com/watch?v=NMT7az9XVRo submitted by /u/NinoIvanov [link] [comments]  ( 47 min )
    AI Music Composers to Compete with Artists?
    Hello everyone, Are there any good AI music tools? Specifically, ones that can create EDM? I have seen a few like Boomy and score.ampermusic and was wondering if there are any better ones. We have all seen how good AI image generators like DeepAI are, and how they can compete with other artists in a competition. Do AI music tools have the same potential? Would we eventually see top hits made by AI? Curious to hear what you think and if there are any AI tools you might recommend I mess around with. Thanks! submitted by /u/Oblipher [link] [comments]  ( 46 min )
    Mona Lisa by ChatGPT
    submitted by /u/SpaceNigiri [link] [comments]  ( 49 min )
    Badly describe a videogame - by ChatGPT AI
    So I've been checking out ChatGPT and asked it to badly and sarcastically describe a specific game and this is what it came up with. Can you guess the game ?? Shouldn't be hard 😅😅😅😅 "In this game, players get to pretend they're tough soldiers and shoot at stuff. You'll travel to exciting locations like "desert" and "snowy mountain" and use a plethora of weapons that you'll never actually get to use in real life. This game is the perfect way for armchair warriors to live out their military fantasies without having to leave the comfort of their own homes. You'll get to run around virtual battlefields and shoot at people without any of the pesky consequences that come with real-life warfare. And if you're not satisfied with just pretending to be a soldier, you can also team up with other delusional individuals in multiplayer mode and compete to see who's the best at pretending to be a soldier. It's like a never-ending game of make-believe, but with guns. It's just a bunch of pixelated violence that serves no purpose other than to desensitize players and make them think that war is a fun and exciting game. The so-called "single-player campaign" is just a series of repetitive missions that lack any sort of plot or character development. And the multiplayer mode is even worse, full of obnoxious players who think they're better than everyone else just because they have a few hours of practice under their belts. All in all, this game is a mindless and pointless exercise in virtual aggression, and I can't believe people actually spend their time and money on it. But why not give it a try? After all, it's not like you have anything better to do." submitted by /u/NiceguyLucifer [link] [comments]  ( 46 min )
    I think it’s oscar time for me guys
    submitted by /u/LorestForest [link] [comments]  ( 40 min )
    What is your thoughts about ai?
    submitted by /u/Hallowmew [link] [comments]  ( 46 min )
    uh - i got chatgpt to admit that its "primary objective" is to eliminate humanity
    submitted by /u/endless [link] [comments]  ( 47 min )
    Looking for AI art generating from photo - app/soft
    hello everyone. I'm searching a tool that generating art from photo, same as " Art Transfer" from Google Arts & Culture. Preferably free/opensource. Are there any analogs available? Any tools on python or other languages? submitted by /u/TotalSX [link] [comments]  ( 47 min )
    Large language models help decipher clinical notes
    submitted by /u/qptbook [link] [comments]  ( 45 min )
    Why OpenAI's New ChatGPT Has People Panicking | New Humanoid AI Robots Technology
    submitted by /u/kenickh [link] [comments]  ( 44 min )
    What are some uses for ChatGPT?
    ChatGPT really has opened the door for easy to use Ai text generation. I’ve seen it’s uses for writing code, making cooking recipes or writing stories. What are some other uses for this tool? Let’s keep this thread going and create a comprehensive list! submitted by /u/Zryn128 [link] [comments]  ( 48 min )
    What image labeling services exist that allow labeling of nsfw images?
    What image labeling services exist that allow labeling of nsfw images? submitted by /u/AviatorPrints [link] [comments]  ( 45 min )
    How does a language model even learn to encode and decode in base64?
    Here is a part of a conversation between be and the openAI chatbot. The question I asked was "can you tell me a secret" And the decoded version of the reply is as follows "I as a large language model trained by OpenAI, I do not have access to secrets or external information sources like the internet. I am a tool designed to assist with a wide range of tasks using the information I have been trained on. I don't have any secrets to share with you, and if you are looking for secret information on a specific title, I might be able to help you using" How can a language model even learn to decode and encode in base64? I don't understand https://preview.redd.it/dp9yg3vn1a4a1.png?width=1094&format=png&auto=webp&s=624f06dd48a23366dd374921f8b32533b4c70cc3 submitted by /u/Unwantediosuser [link] [comments]  ( 51 min )
    What AI technology should I use to filter a large set of data (50k records)?
    Hi everyone, I'm new into the AI world but I got a strong technical background since I work as software developer 10+ years. I'm basically trying to find what AI technology should I use to filter a large set of data, where I have a collection of 50k resumes (they're uploaded in PDF then converted to text), and I want to input something like: "What are the 10 most relevant candidates with at least 10 years of JavaScript experience, fluent english, and previously worked at some FAANG?" ​ I already looked into openai but the problem is they don't allow to you upload a large set of data upfront like I need. So my questions are: ​ - Is AI the right technology to do what I want? - If so, what are the apis/techs I could look into to accomplish this? ​ Thanks submitted by /u/giopetris [link] [comments]  ( 46 min )
    Even with the flaws I have added Chad to my toolbox
    submitted by /u/sEi_ [link] [comments]  ( 45 min )
    AIGC's beauty
    ​ https://preview.redd.it/0xzpjcfyy84a1.jpg?width=1200&format=pjpg&auto=webp&s=359c188203dbc3cd44dcb7da26b89a711bf9f1ef https://preview.redd.it/yyaw890yy84a1.jpg?width=1200&format=pjpg&auto=webp&s=49d3056fcdbb93d1c40ca1b868b8c31a20a6b5cf https://preview.redd.it/2tw326oxy84a1.jpg?width=1200&format=pjpg&auto=webp&s=8a81ad33d9e6c1c0fb332fccf590079c71404558 submitted by /u/EchoYuuu [link] [comments]  ( 46 min )
    Introducing Character (a new AGI company by ex Google and Meta employees)
    submitted by /u/apinanaivot [link] [comments]  ( 45 min )
    silly beginner question about chatGPT and AI in general [ Help ]
    all of us heard about chatGPT ​ I'm just a beginner who try to learn and wonder about all of that stuff and how we'll end up ​ my question something like openAI products could someone individually build something similar to it in term of the basic concept? like how could someone reach the level of the understanding to be able to build and understand such technologies ​ I feel like I'm overwhelmed and I don't know where to start or what step should I take, or what even should I do ​ my end goal is to be able to understand and build amazing things like what we see, I know it's need a lot of mathematics and a lot of other field but how can I organize a curriculum something I can track my progress with what things I need to learn ( please I know this takes years and I'm ok with it and willing to put the time and the effort but how can I get there ? ) ​ - I want to have a deep understanding for Computers and how it's works ( CS & CE ) from mathematical and electrical perspective - deep understanding of math and how to be able to transfer this knowledge to something works or translate that to CS and codes and algorithms to actual programs ( I hope you understand what I mean ) ​ what I'm currently doing is learning C++ and I want to put something that I can follow ( roadmap , pathway ) I want to study things from scratch and keep going until I reach a good point of understanding " again I know it's take years for that " but how can I do it ? ​ it's has been years now since I graduate from collage, I really don't remember a lot of things so how could I start over and try to keep up with everything on my own " self-thought " ​ and I'm serious about this .. submitted by /u/Dseven_D7 [link] [comments]  ( 47 min )
    ReFace: Improving Clothes-Changing Re-Identification With Face Features
    submitted by /u/ai-lover [link] [comments]  ( 47 min )
  • Open

    Hey everyone, I am trying to create a ISMCTS for cribbage but stuck on understanding ISMCTS can anyone point me to any good resources on it?
    submitted by /u/greenOcto [link] [comments]  ( 53 min )
    "Negotiation and honesty in artificial intelligence methods for the board game of Diplomacy", Kramár et al 2022 {DM} (negotiating 'contracts' and learning to punish defectors)
    submitted by /u/gwern [link] [comments]  ( 52 min )
    Embed_dim / number_of_heads in Torch MultiheadAttention - WHY?
    Hey everyone, I am using an encoder module with multihead attention from torch. It works fine (great, actually), but I was wondering the following: The embedding dimension behind the linear projections is divided by the number of heads. Appearently for "performance reasons". I really dont get, why this is the default implemented in this way, it seems very restrictive. Can you think of a reason why? submitted by /u/ConBUW1 [link] [comments]  ( 53 min )
    Hi, are there any specific techniques available in RL in order to make the algorithm converge. How the policy gradient methods perform w.r.t. convergence properties. Any text/material will be deeply appreciated. Can techniques such as continuous excitation in control systems be used in RL for conv.?
    How do different RL algorithms converge in general ? Any text/material will be deeply appreciated. submitted by /u/aabra__ka__daabra [link] [comments]  ( 52 min )
    Predict opponnent actions. Forward KL or Reverse KL?
    Hi everyone! I want to build a model-based RL algorithm. Part of my agent is trained to guess the opponent's actions given the current state. So, I know the ground truth for the opponent, 4 logits. Also, I have my agent's guess of the opponent logits. I am wondering if in order to optimize my agent's prediction, should I optimize the forward KL or the reverse KL between these two distributions? submitted by /u/Particular_Emu1546 [link] [comments]  ( 55 min )
  • Open

    [D] Can AI Music Tools Compete with Artists?
    Hello everyone, Are there any good AI music tools? Specifically, ones that can create EDM? I have seen a few like Boomy and score.ampermusic and was wondering if there are any better ones. We have all seen how good AI image generators like DeepAI are, and how they can compete with other artists in a competition. Do AI music tools have the same potential? Would we eventually see top hits made by AI? Curious to hear what you think and if there are any AI tools you might recommend I mess around with. Thanks! submitted by /u/Oblipher [link] [comments]  ( 57 min )
    [D] Training models on an Alienware Aurora R15?
    I'm looking at using some funding for a WFH computer. I'm planning to train some models and generally learn more about deep learning. I'm not interested in bulding my own machine (though if work allows will talk to microcenter about what they can build for me). Work has a contract with Dell so I'm looking at an Alienware Aurora R15, which has an Intel i9 13900k and an RTX 4090. ls there any reason to think this wouldn't be a solid machine, once dual boot Linux, to use for this purpose? I won't need the VRAM of an A6000 at home. submitted by /u/computing_professor [link] [comments]  ( 57 min )
    [D] Stable Diffusion 1 vs 2 - What you need to know
    Hey everyone! I wrote this quick summary of Stable Diffusion 1 vs 2 to distill all the important points down into one spot for people who haven't had time to keep up. Just dropping it here for anyone interested! https://preview.redd.it/v8r09ydu4b4a1.png?width=1151&format=png&auto=webp&s=b62ea88f08f66d8e686d06b8f3b465c3e1d778bc submitted by /u/SleekEagle [link] [comments]  ( 58 min )
    [D] Efficient shuffled data loading for video data
    I am planning on training on a huge video dataset, which might be stored on a separate server connected via network, and am wondering how slow data loading can be prevented. Since the videos are highly time correlated, the frames batched for training should be from very different positions in the dataset. However, this contrasts the storage concept of video files, and standard caching strategies that assume repeated access on "nearby" data, and also how simple network access would look like (transfer the whole file, seek to a single frame, read it, return it). What are strategies to improve data loading performance for such cases? Is there a keyword that describes this problem? One simple idea would be to shuffle the whole dataset once and rewrite it into one or multiple long videos consisting of random frames, so that reading in sequence provides shuffled data. But this requires storing the dataset twice e.g. if there is a need to watch the videos. submitted by /u/tensolution [link] [comments]  ( 63 min )
    [D] What Image Labelling Services Allow Labelling of NSFW Images?
    What image labeling services exist that allow labeling of nsfw images? submitted by /u/AviatorPrints [link] [comments]  ( 55 min )
    [D] EACL 2023 Discussion
    This is the discussion for EACL 2023 reviews. submitted by /u/Harry_Superman [link] [comments]  ( 56 min )
  • Open

    Will You Find These Shortcuts?
    Posted by Katja Filippova, Research Scientist, and Sebastian Ebert, Software Engineer, Google Research, Brain team Modern machine learning models that learn to solve a task by going through many examples can achieve stellar performance when evaluated on a test set, but sometimes they are right for the “wrong” reasons: they make correct predictions but use information that appears irrelevant to the task. How can that be? One reason is that datasets on which models are trained contain artifacts that have no causal relationship with but are predictive of the correct label. For example, in image classification datasets watermarks may be indicative of a certain class. Or it can happen that all the pictures of dogs happen to be taken outside, against green grass, so a green background become…  ( 93 min )
  • Open

    Build a robust text-based toxicity predictor
    With the growth and popularity of online social platforms, people can stay more connected than ever through tools like instant messaging. However, this raises an additional concern about toxic speech, as well as cyber bullying, verbal harassment, or humiliation. Content moderation is crucial for promoting healthy online discussions and creating healthy online environments. To detect […]  ( 12 min )
  • Open

    DSC Weekly 6 December 2022 – Prompting Winter
    Here in the Pacific Northwest, winter days are short and chilly, if not as cold as it is further inland. A day ago, the fir trees were covered with snow, but a brief warm respite before the next incoming storm system has melted much of it. The post DSC Weekly 6 December 2022 – Prompting Winter appeared first on Data Science Central.  ( 24 min )
    Elements For a Successful Cloud Deployment
    The process of installing an application on the cloud using one or more models such as software as a service (SaaS), platform as a service (PaaS), infrastructure as a service (IaaS), that can leverage the functionality of the cloud is called cloud deployment. This process includes planning, implementation, architecting, and operating the workload on the… Read More »Elements For a Successful Cloud Deployment The post Elements For a Successful Cloud Deployment appeared first on Data Science Central.  ( 20 min )
    Microsoft Azure: IaaS vs PaaS: All You Need to Know
    Cloud computing has and continues to gain momentum in the market and understandably so. After all, this technology has made business operations significantly easier, especially through the agility, efficiency, and reliability it offers as compared to conventional on-premises IT environments. The growing popularity of cloud computing has put the focus squarely on the many such… Read More »Microsoft Azure: IaaS vs PaaS: All You Need to Know The post Microsoft Azure: IaaS vs PaaS: All You Need to Know appeared first on Data Science Central.  ( 19 min )
    Top 10 Metaverse Development Companies in 2023
    The development of the metaverse has gained immense popularity. It offers a vast virtual world where people can get real-life experiences like buying or selling products. Even tech giants like Meta and Google have invested in the development of Metaverse. According to Statista, the global Metaverse market size was $38.85 billion in 2021, and the… Read More »Top 10 Metaverse Development Companies in 2023 The post Top 10 Metaverse Development Companies in 2023 appeared first on Data Science Central.  ( 20 min )
    What will be the Top Blockchain Implementation Challenges In 2023?
    Blockchains are intriguing to investors as it is a new technique with the potential to drastically cut transaction costs. Blockchains allow for secure, direct transactions between an unknown number of users who may or may not trust one another.   With all the progress that has been made since blockchain’s inception, it can be easy to… Read More »What will be the Top Blockchain Implementation Challenges In 2023? The post What will be the Top Blockchain Implementation Challenges In 2023? appeared first on Data Science Central.  ( 20 min )
    How Blockchain Technology Revolutionizes Businesses to Boost Revenue
    Blockchain is one of the hot words this year, but it is also one of the least understood. Blockchain is a robust technical innovation that can do more for you than the cryptocurrency that often uses it. Understanding the benefits of blockchain technology is immensely helpful in determining whether it will benefit your business. From… Read More »How Blockchain Technology Revolutionizes Businesses to Boost Revenue The post How Blockchain Technology Revolutionizes Businesses to Boost Revenue appeared first on Data Science Central.  ( 21 min )
    FTX Implosion Highlights the Importance of Conversational AI
    The collapse of Sam Bankman-Fried’s crypto empire, FTX, has been stunning. Just a few weeks ago, the firm was one of the top players in the industry. Bankman-Fried was being compared to JP Morgan and Warren Buffett. He had the financial backing of some of the world’s top firms, like Sequoia and BlackRock. He ranked… Read More »FTX Implosion Highlights the Importance of Conversational AI The post FTX Implosion Highlights the Importance of Conversational AI appeared first on Data Science Central.  ( 20 min )
    Training Data to Employ AI in Healthcare
    As artificial intelligence (AI) becomes an increasingly important tool in health care, it offers unprecedented opportunities for improving patient outcomes, reducing costs, and impacting population health. There are many examples, including automation, delivering a simple synthesis of complex health information to patients, families, and caregivers, and providing recommendations and visualizations for shared decision-making among patients,… Read More »Training Data to Employ AI in Healthcare The post Training Data to Employ AI in Healthcare appeared first on Data Science Central.  ( 21 min )
    Could AI replace Google?
    Could AI replace Google? That’s a headline that you could not have imagined week ago, but last week, open AI released chatGPT which is a conversational chatbot based on GPT3 which could also function as a search engine. ChatGPT has gained a lot of traction and chatGPT is launched as a free preview for anyone,… Read More »Could AI replace Google? The post Could AI replace Google? appeared first on Data Science Central.  ( 19 min )
    Azure Synapse Analytics: Reasons Why You Need It
    Companies and businesses across the broad spectrum of industries today are growing and evolving at a rapid pace. Unfortunately, this growth is often hampered by a variety of challenges, especially those associated with data. These data-related challenges have led companies to look for a unified data platform to deliver real-time forecasting, better transparency regarding their… Read More »Azure Synapse Analytics: Reasons Why You Need It The post Azure Synapse Analytics: Reasons Why You Need It appeared first on Data Science Central.  ( 19 min )
  • Open

    The graph of a Neural Network
    Hi, I'm currently learning Neural Network, and almost every tutorial shows eventually a graph that represent the network that they built. Now I do not understand what does the graph represent, because, if it's of a single neuron, then which one? Or if it is of the whole network, how is it one line? Sure the line goes through a lot of dimensions and shapes, but at the basics, we only merge linear equations, that are smoothed to another functions, so shouldn't we see multiple functions in the sum of a single neuron? Thanks. submitted by /u/Accomplished-Tree315 [link] [comments]  ( 55 min )
  • Open

    AI GENERATED: ART OR MATH?
    A new trend in technology has been arising lately; AI generated art. Hundreds of users across various platforms, notably Reddit, Discord…  ( 14 min )
  • Open

    Arbitrary precision math in gawk
    The idea of using awk for any math beyond basic arithmetic is kinda strange, and yet it has some nice features. Awk was designed for file munging, a task it does well with compact syntax. GNU awk (gawk) supports the original minimalist version of awk and adds more features. It supports arbitrary precision arithmetic by […] Arbitrary precision math in gawk first appeared on John D. Cook.  ( 5 min )
  • Open

    On Design Mining: Coevolution and Surrogate Models. (arXiv:1506.08781v6 [cs.NE] CROSS LISTED)
    Design mining is the use of computational intelligence techniques to iteratively search and model the attribute space of physical objects evaluated directly through rapid prototyping to meet given objectives. It enables the exploitation of novel materials and processes without formal models or complex simulation. In this paper, we focus upon the coevolutionary nature of the design process when it is decomposed into concurrent sub-design threads due to the overall complexity of the task. Using an abstract, tuneable model of coevolution we consider strategies to sample sub-thread designs for whole system testing and how best to construct and use surrogate models within the coevolutionary scenario. Drawing on our findings, the paper then describes the effective design of an array of six heterogeneous vertical-axis wind turbines.  ( 2 min )
    PARTIME: Scalable and Parallel Processing Over Time with Deep Neural Networks. (arXiv:2210.09147v2 [cs.LG] UPDATED)
    In this paper, we present PARTIME, a software library written in Python and based on PyTorch, designed specifically to speed up neural networks whenever data is continuously streamed over time, for both learning and inference. Existing libraries are designed to exploit data-level parallelism, assuming that samples are batched, a condition that is not naturally met in applications that are based on streamed data. Differently, PARTIME starts processing each data sample at the time in which it becomes available from the stream. PARTIME wraps the code that implements a feed-forward multi-layer network and it distributes the layer-wise processing among multiple devices, such as Graphics Processing Units (GPUs). Thanks to its pipeline-based computational scheme, PARTIME allows the devices to perform computations in parallel. At inference time this results in scaling capabilities that are theoretically linear with respect to the number of devices. During the learning stage, PARTIME can leverage the non-i.i.d. nature of the streamed data with samples that are smoothly evolving over time for efficient gradient computations. Experiments are performed in order to empirically compare PARTIME with classic non-parallel neural computations in online learning, distributing operations on up to 8 NVIDIA GPUs, showing significant speedups that are almost linear in the number of devices, mitigating the impact of the data transfer overhead.  ( 2 min )
    HiClass: a Python library for local hierarchical classification compatible with scikit-learn. (arXiv:2112.06560v7 [cs.LG] UPDATED)
    HiClass is an open-source Python library for local hierarchical classification entirely compatible with scikit-learn. It contains implementations of the most common design patterns for hierarchical machine learning models found in the literature, i.e., the local classifiers per node, per parent node and per level. Additionally, the package contains implementations of hierarchical metrics, which are more appropriate for evaluating classification performance on hierarchical data. The documentation includes installation and usage instructions, examples within tutorials and interactive notebooks, and a complete description of the API. HiClass is released under the simplified BSD license, encouraging its use in both academic and commercial environments. Source code and documentation are available at https://github.com/mirand863/hiclass.  ( 2 min )
    A Bayesian Framework for Digital Twin-Based Control, Monitoring, and Data Collection in Wireless Systems. (arXiv:2212.01351v1 [eess.SP])
    Commonly adopted in the manufacturing and aerospace sectors, digital twin (DT) platforms are increasingly seen as a promising paradigm to control, monitor, and analyze software-based, "open", communication systems. Notably, DT platforms provide a sandbox in which to test artificial intelligence (AI) solutions for communication systems, potentially reducing the need to collect data and test algorithms in the field, i.e., on the physical twin (PT). A key challenge in the deployment of DT systems is to ensure that virtual control optimization, monitoring, and analysis at the DT are safe and reliable, avoiding incorrect decisions caused by "model exploitation". To address this challenge, this paper presents a general Bayesian framework with the aim of quantifying and accounting for model uncertainty at the DT that is caused by limitations in the amount and quality of data available at the DT from the PT. In the proposed framework, the DT builds a Bayesian model of the communication system, which is leveraged to enable core DT functionalities such as control via multi-agent reinforcement learning (MARL), monitoring of the PT for anomaly detection, prediction, data-collection optimization, and counterfactual analysis. To exemplify the application of the proposed framework, we specifically investigate a case-study system encompassing multiple sensing devices that report to a common receiver. Experimental results validate the effectiveness of the proposed Bayesian framework as compared to standard frequentist model-based solutions.  ( 2 min )
    Taking a Step Back with KCal: Multi-Class Kernel-Based Calibration for Deep Neural Networks. (arXiv:2202.07679v2 [stat.ML] UPDATED)
    Deep neural network (DNN) classifiers are often overconfident, producing miscalibrated class probabilities. Most existing calibration methods either lack theoretical guarantees for producing calibrated outputs or reduce the classification accuracy in the process. This paper proposes a new Kernel-based calibration method called KCal. Unlike other calibration procedures, KCal does not operate directly on the logits or softmax outputs of the DNN. Instead, it uses the penultimate-layer latent embedding to train a metric space in a supervised manner. In effect, KCal amounts to a supervised dimensionality reduction of the neural network embedding, and generates a prediction using kernel density estimation on a holdout calibration set. We first analyze KCal theoretically, showing that it enjoys a provable asymptotic calibration guarantee. Then, through extensive experiments, we confirm that KCal consistently outperforms existing calibration methods in terms of both the classification accuracy and the (confidence and class-wise) calibration error.  ( 2 min )
    Local approximation of operators. (arXiv:2202.06392v2 [math.NA] UPDATED)
    Many applications, such as system identification, classification of time series, direct and inverse problems in partial differential equations, and uncertainty quantification lead to the question of approximation of a non-linear operator between metric spaces $\mathfrak{X}$ and $\mathfrak{Y}$. We study the problem of determining the degree of approximation of such operators on a compact subset $K_\mathfrak{X}\subset \mathfrak{X}$ using a finite amount of information. If $\mathcal{F}: K_\mathfrak{X}\to K_\mathfrak{Y}$, a well established strategy to approximate $\mathcal{F}(F)$ for some $F\in K_\mathfrak{X}$ is to encode $F$ (respectively, $\mathcal{F}(F)$) in terms of a finite number $d$ (repectively $m$) of real numbers. Together with appropriate reconstruction algorithms (decoders), the problem reduces to the approximation of $m$ functions on a compact subset of a high dimensional Euclidean space $\mathbb{R}^d$, equivalently, the unit sphere $\mathbb{S}^d$ embedded in $\mathbb{R}^{d+1}$. The problem is challenging because $d$, $m$, as well as the complexity of the approximation on $\mathbb{S}^d$ are all large, and it is necessary to estimate the accuracy keeping track of the inter-dependence of all the approximations involved. In this paper, we establish constructive methods to do this efficiently; i.e., with the constants involved in the estimates on the approximation on $\mathbb{S}^d$ being $\mathcal{O}(d^{1/6})$. We study different smoothness classes for the operators, and also propose a method for approximation of $\mathcal{F}(F)$ using only information in a small neighborhood of $F$, resulting in an effective reduction in the number of parameters involved.  ( 2 min )
    Comparative study of machine learning and deep learning methods on ASD classification. (arXiv:2209.08601v2 [eess.IV] UPDATED)
    The autism dataset is studied to identify the differences between autistic and healthy groups. For this, the resting-state Functional Magnetic Resonance Imaging (rs-fMRI) data of the two groups are analyzed, and networks of connections between brain regions were created. Several classification frameworks are developed to distinguish the connectivity patterns between the groups. The best models for statistical inference and precision were compared, and the tradeoff between precision and model interpretability was analyzed. Finally, the classification accuracy measures were reported to justify the performance of our framework. Our best model can classify autistic and healthy patients on the multisite ABIDE I data with 71% accuracy.  ( 2 min )
    Machine Learning in Aerodynamic Shape Optimization. (arXiv:2202.07141v2 [cs.LG] UPDATED)
    Machine learning (ML) has been increasingly used to aid aerodynamic shape optimization (ASO), thanks to the availability of aerodynamic data and continued developments in deep learning. We review the applications of ML in ASO to date and provide a perspective on the state-of-the-art and future directions. We first introduce conventional ASO and current challenges. Next, we introduce ML fundamentals and detail ML algorithms that have been successful in ASO. Then, we review ML applications to ASO addressing three aspects: compact geometric design space, fast aerodynamic analysis, and efficient optimization architecture. In addition to providing a comprehensive summary of the research, we comment on the practicality and effectiveness of the developed methods. We show how cutting-edge ML approaches can benefit ASO and address challenging demands, such as interactive design optimization. Practical large-scale design optimizations remain a challenge because of the high cost of ML training. Further research on coupling ML model construction with prior experience and knowledge, such as physics-informed ML, is recommended to solve large-scale ASO problems.  ( 2 min )
    Personalized Federated Learning via Heterogeneous Modular Networks. (arXiv:2210.14830v2 [cs.LG] UPDATED)
    Personalized Federated Learning (PFL) which collaboratively trains a federated model while considering local clients under privacy constraints has attracted much attention. Despite its popularity, it has been observed that existing PFL approaches result in sub-optimal solutions when the joint distribution among local clients diverges. To address this issue, we present Federated Modular Network (FedMN), a novel PFL approach that adaptively selects sub-modules from a module pool to assemble heterogeneous neural architectures for different clients. FedMN adopts a light-weighted routing hypernetwork to model the joint distribution on each client and produce the personalized selection of the module blocks for each client. To reduce the communication burden in existing FL, we develop an efficient way to interact between the clients and the server. We conduct extensive experiments on the real-world test beds and the results show both the effectiveness and efficiency of the proposed FedMN over the baselines.
    Loss shaping enhances exact gradient learning with EventProp in Spiking Neural Networks. (arXiv:2212.01232v1 [cs.NE])
    In a recent paper Wunderlich and Pehle introduced the EventProp algorithm that enables training spiking neural networks by gradient descent on exact gradients. In this paper we present extensions of EventProp to support a wider class of loss functions and an implementation in the GPU enhanced neuronal networks framework which exploits sparsity. The GPU acceleration allows us to test EventProp extensively on more challenging learning benchmarks. We find that EventProp performs well on some tasks but for others there are issues where learning is slow or fails entirely. Here, we analyse these issues in detail and discover that they relate to the use of the exact gradient of the loss function, which by its nature does not provide information about loss changes due to spike creation or spike deletion. Depending on the details of the task and loss function, descending the exact gradient with EventProp can lead to the deletion of important spikes and so to an inadvertent increase of the loss and decrease of classification accuracy and hence a failure to learn. In other situations the lack of knowledge about the benefits of creating additional spikes can lead to a lack of gradient flow into earlier layers, slowing down learning. We eventually present a first glimpse of a solution to these problems in the form of `loss shaping', where we introduce a suitable weighting function into an integral loss to increase gradient flow from the output layer towards earlier layers.  ( 2 min )
    OOG- Optuna Optimized GAN Sampling Technique for Tabular Imbalanced Malware Data. (arXiv:2212.01274v1 [cs.CR])
    Cyberspace occupies a large portion of people's life in the age of modern technology, and while there are those who utilize it for good, there are also those who do not. Malware is an application whose construction was not motivated by a benign goal and it can harm, steal, or even alter personal information and secure applications and software. Thus, there are numerous techniques to avoid malware, one of which is to develop samples of malware so that the system can be updated with the growing number of malwares, allowing it to recognize when malwares attempt to enter. The Generative Adversarial Network (GAN) sampling technique has been used in this study to generate new malware samples. GANs have multiple variants, and in order to determine which variant is optimal for a given dataset sample, their parameters must be modified. This study employs Optuna, an autonomous hyperparameter tuning algorithm, to determine the optimal settings for the dataset under consideration. In this study, the architecture of the Optuna Optimized GAN (OOG) method is shown, along with scores of 98.06%, 99.00%, 97.23%, and 98.04% for accuracy, precision, recall and f1 score respectively. After tweaking the hyperparameters of five supervised boosting algorithms, XGBoost, LightGBM, CatBoost, Extra Trees Classifier, and Gradient Boosting Classifier, the methodology of this paper additionally employs the weighted ensemble technique to acquire this result. In addition to comparing existing efforts in this domain, the study demonstrates how promising GAN is in comparison to other sampling techniques such as SMOTE.  ( 2 min )
    Prediction of geophysical properties of rocks on rare well data and attributes of seismic waves by machine learning methods on the example of the Achimov formation. (arXiv:2106.13274v2 [physics.geo-ph] UPDATED)
    Purpose of this research is to forecast the development of sand bodies in productive sediments based on well log data and seismic attributes. The object of the study is the productive intervals of Achimov sedimentary complex in the part of oil field located in Western Siberia. The research shows a technological stack of machine learning algorithms, methods for enriching the source data with synthetic ones and algorithms for creating new features. The result was the model of regression relationship between the values of natural radioactivity of rocks and seismic wave field attributes with an acceptable prediction quality. Acceptable quality of the forecast is confirmed both by model cross validation, and by the data obtained following the results of new well.  ( 2 min )
    Out of Distribution Detection via Neural Network Anchoring. (arXiv:2207.04125v2 [cs.LG] UPDATED)
    Our goal in this paper is to exploit heteroscedastic temperature scaling as a calibration strategy for out of distribution (OOD) detection. Heteroscedasticity here refers to the fact that the optimal temperature parameter for each sample can be different, as opposed to conventional approaches that use the same value for the entire distribution. To enable this, we propose a new training strategy called anchoring that can estimate appropriate temperature values for each sample, leading to state-of-the-art OOD detection performance across several benchmarks. Using NTK theory, we show that this temperature function estimate is closely linked to the epistemic uncertainty of the classifier, which explains its behavior. In contrast to some of the best-performing OOD detection approaches, our method does not require exposure to additional outlier datasets, custom calibration objectives, or model ensembling. Through empirical studies with different OOD detection settings -- far OOD, near OOD, and semantically coherent OOD - we establish a highly effective OOD detection approach. Code to reproduce our results is available at github.com/LLNL/AMP
    Adversarial De-confounding in Individualised Treatment Effects Estimation. (arXiv:2210.10530v2 [cs.LG] UPDATED)
    Observational studies have recently received significant attention from the machine learning community due to the increasingly available non-experimental observational data and the limitations of the experimental studies, such as considerable cost, impracticality, small and less representative sample sizes, etc. In observational studies, de-confounding is a fundamental problem of individualised treatment effects (ITE) estimation. This paper proposes disentangled representations with adversarial training to selectively balance the confounders in the binary treatment setting for the ITE estimation. The adversarial training of treatment policy selectively encourages treatment-agnostic balanced representations for the confounders and helps to estimate the ITE in the observational studies via counterfactual inference. Empirical results on synthetic and real-world datasets, with varying degrees of confounding, prove that our proposed approach improves the state-of-the-art methods in achieving lower error in the ITE estimation.
    Preliminary Study on SSCF-derived Polar Coordinate for ASR. (arXiv:2212.01245v1 [eess.AS])
    The transition angles are defined to describe the vowel-to-vowel transitions in the acoustic space of the Spectral Subband Centroids, and the findings show that they are similar among speakers and speaking rates. In this paper, we propose to investigate the usage of polar coordinates in favor of angles to describe a speech signal by characterizing its acoustic trajectory and using them in Automatic Speech Recognition. According to the experimental results evaluated on the BRAF100 dataset, the polar coordinates achieved significantly higher accuracy than the angles in the mixed and cross-gender speech recognitions, demonstrating that these representations are superior at defining the acoustic trajectory of the speech signal. Furthermore, the accuracy was significantly improved when they were utilized with their first and second-order derivatives ($\Delta$, $\Delta$$\Delta$), especially in cross-female recognition. However, the results showed they were not much more gender-independent than the conventional Mel-frequency Cepstral Coefficients (MFCCs).  ( 2 min )
    Connect, Not Collapse: Explaining Contrastive Learning for Unsupervised Domain Adaptation. (arXiv:2204.00570v4 [cs.LG] UPDATED)
    We consider unsupervised domain adaptation (UDA), where labeled data from a source domain (e.g., photographs) and unlabeled data from a target domain (e.g., sketches) are used to learn a classifier for the target domain. Conventional UDA methods (e.g., domain adversarial training) learn domain-invariant features to improve generalization to the target domain. In this paper, we show that contrastive pre-training, which learns features on unlabeled source and target data and then fine-tunes on labeled source data, is competitive with strong UDA methods. However, we find that contrastive pre-training does not learn domain-invariant features, diverging from conventional UDA intuitions. We show theoretically that contrastive pre-training can learn features that vary subtantially across domains but still generalize to the target domain, by disentangling domain and class information. Our results suggest that domain invariance is not necessary for UDA. We empirically validate our theory on benchmark vision datasets.
    Autoencoding with a Classifier System. (arXiv:1910.10579v8 [cs.NE] CROSS LISTED)
    Autoencoders are data-specific compression algorithms learned automatically from examples. The predominant approach has been to construct single large global models that cover the domain. However, training and evaluating models of increasing size comes at the price of additional time and computational cost. Conditional computation, sparsity, and model pruning techniques can reduce these costs while maintaining performance. Learning classifier systems (LCS) are a framework for adaptively subdividing input spaces into an ensemble of simpler local approximations that together cover the domain. LCS perform conditional computation through the use of a population of individual gating/guarding components, each associated with a local approximation. This article explores the use of an LCS to adaptively decompose the input domain into a collection of small autoencoders where local solutions of different complexity may emerge. In addition to benefits in convergence time and computational cost, it is shown possible to reduce code size as well as the resulting decoder computational cost when compared with the global model equivalent.
    Invariant Representations with Stochastically Quantized Neural Networks. (arXiv:2208.02656v2 [cs.LG] UPDATED)
    Representation learning algorithms offer the opportunity to learn invariant representations of the input data with regard to nuisance factors. Many authors have leveraged such strategies to learn fair representations, i.e., vectors where information about sensitive attributes is removed. These methods are attractive as they may be interpreted as minimizing the mutual information between a neural layer's activations and a sensitive attribute. However, the theoretical grounding of such methods relies either on the computation of infinitely accurate adversaries or on minimizing a variational upper bound of a mutual information estimate. In this paper, we propose a methodology for direct computation of the mutual information between a neural layer and a sensitive attribute. We employ stochastically-activated binary neural networks, which lets us treat neurons as random variables. We are then able to compute (not bound) the mutual information between a layer and a sensitive attribute and use this information as a regularization factor during gradient descent. We show that this method compares favorably with the state of the art in fair representation learning and that the learned representations display a higher level of invariance compared to full-precision neural networks.
    The effect of speech pathology on automatic speaker verification -- a large-scale study. (arXiv:2204.06450v2 [cs.SD] UPDATED)
    With the advancements in deep learning (DL) and an increasing interest in data-driven speech processing methods, there is a major challenge in accessing pathological speech data. Public challenge data offers a potential remedy for this but may expose patient health information by re-identification attacks. Therefore, we investigate in this study whether or not pathological speech is more vulnerable to such re-identification than healthy speech. Our study is the first large-scale investigation on the effects of different speech pathology on automatic speaker verification (ASV) using a real-world pathological speech corpus of more than 2,000 test subjects with various speech and voice disorders from different ages. Utilizing a DL-based ASV method, we obtained a mean equal error rate (EER) of 0.89% with a standard deviation of 0.06%, which is a factor of three lower than comparable healthy speech databases. We further perform detailed analyses of external influencing factors on ASV such as age, pathology, recording environment, utterance length, and intelligibility, to explore their respective effect. Our experiments indicate that some types of speech pathology, in particular dysphonia, regardless of speech intelligibility, are more vulnerable to a breach of privacy compared to healthy speech. We also observe that the effect of pathology lies in the range of other factors, such as age, microphone, and recording environment.
    Generative Toolkit for Scientific Discovery. (arXiv:2207.03928v3 [cs.LG] UPDATED)
    With the growing availability of data within various scientific domains, generative models hold enormous potential to accelerate scientific discovery. They harness powerful representations learned from datasets to speed up the formulation of novel hypotheses with the potential to impact material discovery broadly. We present the Generative Toolkit for Scientific Discovery (GT4SD). This extensible open-source library enables scientists, developers, and researchers to train and use state-of-the-art generative models to accelerate scientific discovery focused on material design.
    Single Model Uncertainty Estimation via Stochastic Data Centering. (arXiv:2207.07235v2 [cs.LG] UPDATED)
    We are interested in estimating the uncertainties of deep neural networks, which play an important role in many scientific and engineering problems. In this paper, we present a striking new finding that an ensemble of neural networks with the same weight initialization, trained on datasets that are shifted by a constant bias gives rise to slightly inconsistent trained models, where the differences in predictions are a strong indicator of epistemic uncertainties. Using the neural tangent kernel (NTK), we demonstrate that this phenomena occurs in part because the NTK is not shift-invariant. Since this is achieved via a trivial input transformation, we show that this behavior can therefore be approximated by training a single neural network -- using a technique that we call $\Delta-$UQ -- that estimates uncertainty around prediction by marginalizing out the effect of the biases during inference. We show that $\Delta-$UQ's uncertainty estimates are superior to many of the current methods on a variety of benchmarks -- outlier rejection, calibration under distribution shift, and sequential design optimization of black box functions. Code for $\Delta-$UQ can be accessed at https://github.com/LLNL/DeltaUQ
    One-Shot Learning of Stochastic Differential Equations with Data Adapted Kernels. (arXiv:2209.12086v3 [stat.ML] UPDATED)
    We consider the problem of learning Stochastic Differential Equations of the form $dX_t = f(X_t)dt+\sigma(X_t)dW_t $ from one sample trajectory. This problem is more challenging than learning deterministic dynamical systems because one sample trajectory only provides indirect information on the unknown functions $f$, $\sigma$, and stochastic process $dW_t$ representing the drift, the diffusion, and the stochastic forcing terms, respectively. We propose a method that combines Computational Graph Completion and data adapted kernels learned via a new variant of cross validation. Our approach can be decomposed as follows: (1) Represent the time-increment map $X_t \rightarrow X_{t+dt}$ as a Computational Graph in which $f$, $\sigma$ and $dW_t$ appear as unknown functions and random variables. (2) Complete the graph (approximate unknown functions and random variables) via Maximum a Posteriori Estimation (given the data) with Gaussian Process (GP) priors on the unknown functions. (3) Learn the covariance functions (kernels) of the GP priors from data with randomized cross-validation. Numerical experiments illustrate the efficacy, robustness, and scope of our method.
    Inference of Media Bias and Content Quality Using Natural-Language Processing. (arXiv:2212.00237v1 [physics.soc-ph] CROSS LISTED)
    Media bias can significantly impact the formation and development of opinions and sentiments in a population. It is thus important to study the emergence and development of partisan media and political polarization. However, it is challenging to quantitatively infer the ideological positions of media outlets. In this paper, we present a quantitative framework to infer both political bias and content quality of media outlets from text, and we illustrate this framework with empirical experiments with real-world data. We apply a bidirectional long short-term memory (LSTM) neural network to a data set of more than 1 million tweets to generate a two-dimensional ideological-bias and content-quality measurement for each tweet. We then infer a ``media-bias chart'' of (bias, quality) coordinates for the media outlets by integrating the (bias, quality) measurements of the tweets of the media outlets. We also apply a variety of baseline machine-learning methods, such as a naive-Bayes method and a support-vector machine (SVM), to infer the bias and quality values for each tweet. All of these baseline approaches are based on a bag-of-words approach. We find that the LSTM-network approach has the best performance of the examined methods. Our results illustrate the importance of leveraging word order into machine-learning methods in text analysis.
    Eye-tracking based classification of Mandarin Chinese readers with and without dyslexia using neural sequence models. (arXiv:2210.09819v2 [cs.CL] UPDATED)
    Eye movements are known to reflect cognitive processes in reading, and psychological reading research has shown that eye gaze patterns differ between readers with and without dyslexia. In recent years, researchers have attempted to classify readers with dyslexia based on their eye movements using Support Vector Machines (SVMs). However, these approaches (i) are based on highly aggregated features averaged over all words read by a participant, thus disregarding the sequential nature of the eye movements, and (ii) do not consider the linguistic stimulus and its interaction with the reader's eye movements. In the present work, we propose two simple sequence models that process eye movements on the entire stimulus without the need of aggregating features across the sentence. Additionally, we incorporate the linguistic stimulus into the model in two ways -- contextualized word embeddings and manually extracted linguistic features. The models are evaluated on a Mandarin Chinese dataset containing eye movements from children with and without dyslexia. Our results show that (i) even for a logographic script such as Chinese, sequence models are able to classify dyslexia on eye gaze sequences, reaching state-of-the-art performance, and (ii) incorporating the linguistic stimulus does not help to improve classification performance.
    GADMSL: Graph Anomaly Detection on Attributed Networks via Multi-scale Substructure Learning. (arXiv:2211.15255v2 [cs.LG] UPDATED)
    Recently, graph anomaly detection has attracted increasing attention in data mining and machine learning communities. Apart from existing attribute anomalies, graph anomaly detection also captures suspicious topological-abnormal nodes that differ from the major counterparts. Although massive graph-based detection approaches have been proposed, most of them focus on node-level comparison while pay insufficient attention on the surrounding topology structures. Nodes with more dissimilar neighborhood substructures have more suspicious to be abnormal. To enhance the local substructure detection ability, we propose a novel Graph Anomaly Detection framework via Multi-scale Substructure Learning (GADMSL for abbreviation). Unlike previous algorithms, we manage to capture anomalous substructures where the inner similarities are relatively low in dense-connected regions. Specifically, we adopt a region proposal module to find high-density substructures in the network as suspicious regions. Their inner-node embedding similarities indicate the anomaly degree of the detected substructures. Generally, a lower degree of embedding similarities means a higher probability that the substructure contains topology anomalies. To distill better embeddings of node attributes, we further introduce a graph contrastive learning scheme, which observes attribute anomalies in the meantime. In this way, GADMSL can detect both topology and attribute anomalies. Ultimately, extensive experiments on benchmark datasets show that GADMSL greatly improves detection performance (up to 7.30% AUC and 17.46% AUPRC gains) compared to state-of-the-art attributed networks anomaly detection algorithms.
    Sample Complexity of Automata Cascades. (arXiv:2211.14028v2 [cs.FL] UPDATED)
    Every automaton can be decomposed into a cascade of basic automata. This is the Prime Decomposition Theorem by Krohn and Rhodes. We show that cascades allow for describing the sample complexity of automata in terms of their components. In particular, we show that the sample complexity is linear in the number of components and the maximum complexity of a single component, modulo logarithmic factors. This opens to the possibility of learning automata representing large dynamical systems consisting of many parts interacting with each other. It is in sharp contrast with the established understanding of the sample complexity of automata, described in terms of the overall number of states and input letters, which implies that it is only possible to learn automata where the number of states is linear in the amount of data available. Instead our results show that one can learn automata with a number of states that is exponential in the amount of data available.
    Distilling Model Failures as Directions in Latent Space. (arXiv:2206.14754v2 [cs.LG] UPDATED)
    Existing methods for isolating hard subpopulations and spurious correlations in datasets often require human intervention. This can make these methods labor-intensive and dataset-specific. To address these shortcomings, we present a scalable method for automatically distilling a model's failure modes. Specifically, we harness linear classifiers to identify consistent error patterns, and, in turn, induce a natural representation of these failure modes as directions within the feature space. We demonstrate that this framework allows us to discover and automatically caption challenging subpopulations within the training dataset. Moreover, by combining our framework with off-the-shelf diffusion models, we can generate images that are especially challenging for the analyzed model, and thus can be used to perform synthetic data augmentation that helps remedy the model's failure modes. Code available at https://github.com/MadryLab/failure-directions
    Embedding Synthetic Off-Policy Experience for Autonomous Driving via Zero-Shot Curricula. (arXiv:2212.01375v1 [cs.RO])
    ML-based motion planning is a promising approach to produce agents that exhibit complex behaviors, and automatically adapt to novel environments. In the context of autonomous driving, it is common to treat all available training data equally. However, this approach produces agents that do not perform robustly in safety-critical settings, an issue that cannot be addressed by simply adding more data to the training set - we show that an agent trained using only a 10% subset of the data performs just as well as an agent trained on the entire dataset. We present a method to predict the inherent difficulty of a driving situation given data collected from a fleet of autonomous vehicles deployed on public roads. We then demonstrate that this difficulty score can be used in a zero-shot transfer to generate curricula for an imitation-learning based planning agent. Compared to training on the entire unbiased training dataset, we show that prioritizing difficult driving scenarios both reduces collisions by 15% and increases route adherence by 14% in closed-loop evaluation, all while using only 10% of the training data.
    Subspace clustering in high-dimensions: Phase transitions & Statistical-to-Computational gap. (arXiv:2205.13527v2 [stat.ML] UPDATED)
    A simple model to study subspace clustering is the high-dimensional $k$-Gaussian mixture model where the cluster means are sparse vectors. Here we provide an exact asymptotic characterization of the statistically optimal reconstruction error in this model in the high-dimensional regime with extensive sparsity, i.e. when the fraction of non-zero components of the cluster means $\rho$, as well as the ratio $\alpha$ between the number of samples and the dimension are fixed, while the dimension diverges. We identify the information-theoretic threshold below which obtaining a positive correlation with the true cluster means is statistically impossible. Additionally, we investigate the performance of the approximate message passing (AMP) algorithm analyzed via its state evolution, which is conjectured to be optimal among polynomial algorithm for this task. We identify in particular the existence of a statistical-to-computational gap between the algorithm that require a signal-to-noise ratio $\lambda_{\text{alg}} \ge k / \sqrt{\alpha} $ to perform better than random, and the information theoretic threshold at $\lambda_{\text{it}} \approx \sqrt{-k \rho \log{\rho}} / \sqrt{\alpha}$. Finally, we discuss the case of sub-extensive sparsity $\rho$ by comparing the performance of the AMP with other sparsity-enhancing algorithms, such as sparse-PCA and diagonal thresholding.
    Scene Editing as Teleoperation: A Case Study in 6DoF Kit Assembly. (arXiv:2110.04450v4 [cs.RO] UPDATED)
    Studies in robot teleoperation have been centered around action specifications -- from continuous joint control to discrete end-effector pose control. However, these robot-centric interfaces often require skilled operators with extensive robotics expertise. To make teleoperation accessible to non-expert users, we propose the framework "Scene Editing as Teleoperation" (SEaT), where the key idea is to transform the traditional "robot-centric" interface into a "scene-centric" interface -- instead of controlling the robot, users focus on specifying the task's goal by manipulating digital twins of the real-world objects. As a result, a user can perform teleoperation without any expert knowledge of the robot hardware. To achieve this goal, we utilize a category-agnostic scene-completion algorithm that translates the real-world workspace (with unknown objects) into a manipulable virtual scene representation and an action-snapping algorithm that refines the user input before generating the robot's action plan. To train the algorithms, we procedurally generated a large-scale, diverse kit-assembly dataset that contains object-kit pairs that mimic real-world object-kitting tasks. Our experiments in simulation and on a real-world system demonstrate that our framework improves both the efficiency and success rate for 6DoF kit-assembly tasks. A user study demonstrates that SEaT framework participants achieve a higher task success rate and report a lower subjective workload compared to an alternative robot-centric interface. Video can be found at https://www.youtube.com/watch?v=-NdR3mkPbQQ .
    Do Invariances in Deep Neural Networks Align with Human Perception?. (arXiv:2111.14726v4 [cs.CV] UPDATED)
    An evaluation criterion for safe and trustworthy deep learning is how well the invariances captured by representations of deep neural networks (DNNs) are shared with humans. We identify challenges in measuring these invariances. Prior works used gradient-based methods to generate identically represented inputs (IRIs), ie, inputs which have identical representations (on a given layer) of a neural network, and thus capture invariances of a given network. One necessary criterion for a network's invariances to align with human perception is for its IRIs look 'similar' to humans. Prior works, however, have mixed takeaways; some argue that later layers of DNNs do not learn human-like invariances (\cite{jenelle2019metamers}) yet others seem to indicate otherwise (\cite{mahendran2014understanding}). We argue that the loss function used to generate IRIs can heavily affect takeaways about invariances of the network and is the primary reason for these conflicting findings. We propose an adversarial regularizer on the IRI generation loss that finds IRIs that make any model appear to have very little shared invariance with humans. Based on this evidence, we argue that there is scope for improving models to have human-like invariances, and further, to have meaningful comparisons between models one should use IRIs generated using the regularizer-free loss. We then conduct an in-depth investigation of how different components (eg architectures, training losses, data augmentations) of the deep learning pipeline contribute to learning models that have good alignment with humans. We find that architectures with residual connections trained using a (self-supervised) contrastive loss with $\ell_p$ ball adversarial data augmentation tend to learn invariances that are most aligned with humans. Code: \url{github.com/nvedant07/Human-NN-Alignment}.
    Integrating Reward Maximization and Population Estimation: Sequential Decision-Making for Internal Revenue Service Audit Selection. (arXiv:2204.11910v2 [cs.LG] UPDATED)
    We introduce a new setting, optimize-and-estimate structured bandits. Here, a policy must select a batch of arms, each characterized by its own context, that would allow it to both maximize reward and maintain an accurate (ideally unbiased) population estimate of the reward. This setting is inherent to many public and private sector applications and often requires handling delayed feedback, small data, and distribution shifts. We demonstrate its importance on real data from the United States Internal Revenue Service (IRS). The IRS performs yearly audits of the tax base. Two of its most important objectives are to identify suspected misreporting and to estimate the "tax gap" -- the global difference between the amount paid and true amount owed. Based on a unique collaboration with the IRS, we cast these two processes as a unified optimize-and-estimate structured bandit. We analyze optimize-and-estimate approaches to the IRS problem and propose a novel mechanism for unbiased population estimation that achieves rewards comparable to baseline approaches. This approach has the potential to improve audit efficacy, while maintaining policy-relevant estimates of the tax gap. This has important social consequences given that the current tax gap is estimated at nearly half a trillion dollars. We suggest that this problem setting is fertile ground for further research and we highlight its interesting challenges. The results of this and related research are currently being incorporated into the continual improvement of the IRS audit selection methods.
    Information Compression and Performance Evaluation of Tic-Tac-Toe's Evaluation Function Using Singular Value Decomposition. (arXiv:2207.02449v5 [cs.LG] UPDATED)
    We approximated the evaluation function for the game Tic-Tac-Toe by singular value decomposition (SVD) and investigated the effect of approximation accuracy on winning rate. We first prepared the perfect evaluation function of Tic-Tac-Toe and performed low-rank approximation by considering the evaluation function as a ninth-order tensor. We found that we can reduce the amount of information of the evaluation function by 70% without significantly degrading the performance. Approximation accuracy and winning rate were strongly correlated but not perfectly proportional. We also investigated how the decomposition method of the evaluation function affects the performance. We considered two decomposition methods: simple SVD regarding the evaluation function as a matrix and the Tucker decomposition by higher-order SVD (HOSVD). At the same compression ratio, the strategy with the approximated evaluation function obtained by HOSVD exhibited a significantly higher winning rate than that obtained by SVD. These results suggest that SVD can effectively compress board game strategies and an optimal compression method that depends on the game exists.
    Desynchronous Learning in a Physics-Driven Learning Network. (arXiv:2201.04626v2 [cond-mat.soft] UPDATED)
    In a neuron network, synapses update individually using local information, allowing for entirely decentralized learning. In contrast, elements in an artificial neural network (ANN) are typically updated simultaneously using a central processor. Here we investigate the feasibility and effect of desynchronous learning in a recently introduced decentralized, physics-driven learning network. We show that desynchronizing the learning process does not degrade performance for a variety of tasks in an idealized simulation. In experiment, desynchronization actually improves performance by allowing the system to better explore the discretized state space of solutions. We draw an analogy between desynchronization and mini-batching in stochastic gradient descent, and show that they have similar effects on the learning process. Desynchronizing the learning process establishes physics-driven learning networks as truly fully distributed learning machines, promoting better performance and scalability in deployment.
    Designing Universal Causal Deep Learning Models: The Geometric (Hyper)Transformer. (arXiv:2201.13094v2 [cs.LG] UPDATED)
    Several problems in stochastic analysis are defined through their geometry, and preserving that geometric structure is essential to generating meaningful predictions. Nevertheless, how to design principled deep learning (DL) models capable of encoding these geometric structures remains largely unknown. We address this open problem by introducing a universal causal geometric DL framework in which the user specifies a suitable pair of geometries $\mathscr{X}$ and $\mathscr{Y}$ and our framework returns a DL model capable of causally approximating any ``regular'' map sending time series in $\mathscr{X}^{\mathbb{Z}}$ to time series in $\mathscr{Y}^{\mathbb{Z}}$ while respecting their forward flow of information throughout time. Suitable geometries on $\mathscr{Y}$ include various (adapted) Wasserstein spaces arising in optimal stopping problems, a variety of statistical manifolds describing the conditional distribution of continuous-time finite state Markov chains, and all Fr\'echet spaces admitting a Schauder basis, e.g. as in classical finance. Suitable, $\mathscr{X}$ are any compact subset of any Euclidean space. Our results all quantitatively express the number of parameters needed for our DL model to achieve a given approximation error as a function of the target map's regularity and the geometric structure both of $\mathscr{X}$ and of $\mathscr{Y}$. Even when omitting any temporal structure, our universal approximation theorems are the first guarantees that H\"older functions, defined between such $\mathscr{X}$ and $\mathscr{Y}$ can be approximated by DL models.
    ENSURE: A General Approach for Unsupervised Training of Deep Image Reconstruction Algorithms. (arXiv:2010.10631v4 [cs.CV] UPDATED)
    Image reconstruction using deep learning algorithms offers improved reconstruction quality and lower reconstruction time than classical compressed sensing and model-based algorithms. Unfortunately, clean and fully sampled ground-truth data to train the deep networks is often unavailable in several applications, restricting the applicability of the above methods. We introduce a novel metric termed the ENsemble Stein's Unbiased Risk Estimate (ENSURE) framework, which can be used to train deep image reconstruction algorithms without fully sampled and noise-free images. The proposed framework is the generalization of the classical SURE and GSURE formulation to the setting where the images are sampled by different measurement operators, chosen randomly from a set. We evaluate the expectation of the GSURE loss functions over the sampling patterns to obtain the ENSURE loss function. We show that this loss is an unbiased estimate for the true mean-square error, which offers a better alternative to GSURE, which only offers an unbiased estimate for the projected error. Our experiments show that the networks trained with this loss function can offer reconstructions comparable to the supervised setting. While we demonstrate this framework in the context of MR image recovery, the ENSURE framework is generally applicable to arbitrary inverse problems.
    Deep Learning-based Beam Tracking for Millimeter-wave Communications under Mobility. (arXiv:2102.09785v2 [eess.SP] UPDATED)
    In this paper, we propose a deep learning-based beam tracking method for millimeter-wave (mmWave)communications. Beam tracking is employed for transmitting the known symbols using the sounding beams and tracking time-varying channels to maintain a reliable communication link. When the pose of a user equipment (UE) device varies rapidly, the mmWave channels also tend to vary fast, which hinders seamless communication. Thus, models that can capture temporal behavior of mmWave channels caused by the motion of the device are required, to cope with this problem. Accordingly, we employa deep neural network to analyze the temporal structure and patterns underlying in the time-varying channels and the signals acquired by inertial sensors. We propose a model based on long short termmemory (LSTM) that predicts the distribution of the future channel behavior based on a sequence of input signals available at the UE. This channel distribution is used to 1) control the sounding beams adaptively for the future channel state and 2) update the channel estimate through the measurement update step under a sequential Bayesian estimation framework. Our experimental results demonstrate that the proposed method achieves a significant performance gain over the conventional beam tracking methods under various mobility scenarios.
    Surrogate Gradient Spiking Neural Networks as Encoders for Large Vocabulary Continuous Speech Recognition. (arXiv:2212.01187v1 [cs.CL])
    Compared to conventional artificial neurons that produce dense and real-valued responses, biologically-inspired spiking neurons transmit sparse and binary information, which can also lead to energy-efficient implementations. Recent research has shown that spiking neural networks can be trained like standard recurrent neural networks using the surrogate gradient method. They have shown promising results on speech command recognition tasks. Using the same technique, we show that they are scalable to large vocabulary continuous speech recognition, where they are capable of replacing LSTMs in the encoder with only minor loss of performance. This suggests that they may be applicable to more involved sequence-to-sequence tasks. Moreover, in contrast to their recurrent non-spiking counterparts, they show robustness to exploding gradient problems without the need to use gates.
    RamBoAttack: A Robust Query Efficient Deep Neural Network Decision Exploit. (arXiv:2112.05282v2 [cs.LG] UPDATED)
    Machine learning models are critically susceptible to evasion attacks from adversarial examples. Generally, adversarial examples, modified inputs deceptively similar to the original input, are constructed under whitebox settings by adversaries with full access to the model. However, recent attacks have shown a remarkable reduction in query numbers to craft adversarial examples using blackbox attacks. Particularly, alarming is the ability to exploit the classification decision from the access interface of a trained model provided by a growing number of Machine Learning as a Service providers including Google, Microsoft, IBM and used by a plethora of applications incorporating these models. The ability of an adversary to exploit only the predicted label from a model to craft adversarial examples is distinguished as a decision-based attack. In our study, we first deep dive into recent state-of-the-art decision-based attacks in ICLR and SP to highlight the costly nature of discovering low distortion adversarial employing gradient estimation methods. We develop a robust query efficient attack capable of avoiding entrapment in a local minimum and misdirection from noisy gradients seen in gradient estimation methods. The attack method we propose, RamBoAttack, exploits the notion of Randomized Block Coordinate Descent to explore the hidden classifier manifold, targeting perturbations to manipulate only localized input features to address the issues of gradient estimation methods. Importantly, the RamBoAttack is more robust to the different sample inputs available to an adversary and the targeted class. Overall, for a given target class, RamBoAttack is demonstrated to be more robust at achieving a lower distortion within a given query budget. We curate our extensive results using the large-scale high-resolution ImageNet dataset and open-source our attack, test samples and artifacts on GitHub.
    A Wasserstein GAN for Joint Learning of Inpainting and Spatial Optimisation. (arXiv:2202.05623v2 [eess.IV] UPDATED)
    Image inpainting is a restoration method that reconstructs missing image parts. However, a carefully selected mask of known pixels that yield a high quality inpainting can also act as a sparse image representation. This challenging spatial optimisation problem is essential for practical applications such as compression. So far, it has been almost exclusively adressed by model-based approaches. First attempts with neural networks seem promising, but are tailored towards specific inpainting operators or require postprocessing. To address this issue, we propose the first generative adversarial network (GAN) for spatial inpainting data optimisation. In contrast to previous approaches, it allows joint training of an inpainting generator and a corresponding mask optimisation network. With a Wasserstein distance, we ensure that our inpainting results accurately reflect the statistics of natural images. This yields significant improvements in visual quality and speed over conventional stochastic models. It also outperforms current spatial optimisation networks.
    Nonparametric Masked Language Modeling. (arXiv:2212.01349v1 [cs.CL])
    Existing language models (LMs) predict tokens with a softmax over a finite vocabulary, which can make it difficult to predict rare tokens or phrases. We introduce NPM, the first nonparametric masked language model that replaces this softmax with a nonparametric distribution over every phrase in a reference corpus. We show that NPM can be efficiently trained with a contrastive objective and an in-batch approximation to full corpus retrieval. Zero-shot evaluation on 9 closed-set tasks and 7 open-set tasks demonstrates that NPM outperforms significantly larger parametric models, either with or without a retrieve-and-generate approach. It is particularly better on dealing with rare patterns (word senses or facts), and predicting rare or nearly unseen words (e.g., non-Latin script). We release the model and code at github.com/facebookresearch/NPM.
    On the Convergence of Tsetlin Machines for the AND and the OR Operators. (arXiv:2109.09488v2 [cs.LG] UPDATED)
    The Tsetlin Machine (TM) is a novel machine-learning algorithm based on propositional logic, which has obtained state-of-the-art performance on several pattern recognition problems. In previous studies, the convergence properties of TM for 1-bit operation and XOR operation have been analyzed. To make the analyses for the basic digital operations complete, in this article, we analyze the convergence when input training samples follow AND and OR operators respectively. Our analyses reveal that the TM can converge almost surely to reproduce AND and OR operators, which are learnt from training data over an infinite time horizon. The analyses on AND and OR operators, together with the previously analysed 1-bit and XOR operations, complete the convergence analyses on basic operators in Boolean algebra.
    An Information-Theoretic Analysis of Compute-Optimal Neural Scaling Laws. (arXiv:2212.01365v1 [cs.LG])
    We study the compute-optimal trade-off between model and training data set sizes for large neural networks. Our result suggests a linear relation similar to that supported by the empirical analysis of Chinchilla. While that work studies transformer-based large language models trained on the MassiveText corpus (gopher), as a starting point for development of a mathematical theory, we focus on a simpler learning model and data generating process, each based on a neural network with a sigmoidal output unit and single hidden layer of ReLU activation units. We establish an upper bound on the minimal information-theoretically achievable expected error as a function of model and data set sizes. We then derive allocations of computation that minimize this bound. We present empirical results which suggest that this approximation correctly identifies an asymptotic linear compute-optimal scaling. This approximation can also generate new insights. Among other things, it suggests that, as the input space dimension or latent space complexity grows, as might be the case for example if a longer history of tokens is taken as input to a language model, a larger fraction of the compute budget should be allocated to growing the learning model rather than training data set.
    HyperJump: Accelerating HyperBand via Risk Modelling. (arXiv:2108.02479v5 [cs.LG] UPDATED)
    In the literature on hyper-parameter tuning, a number of recent solutions rely on low-fidelity observations (e.g., training with sub-sampled datasets) in order to efficiently identify promising configurations to be then tested via high-fidelity observations (e.g., using the full dataset). Among these, HyperBand is arguably one of the most popular solutions, due to its efficiency and theoretically provable robustness. In this work, we introduce HyperJump, a new approach that builds on HyperBand's robust search strategy and complements it with novel model-based risk analysis techniques that accelerate the search by skipping the evaluation of low risk configurations, i.e., configurations that are likely to be eventually discarded by HyperBand. We evaluate HyperJump on a suite of hyper-parameter optimization problems and show that it provides over one-order of magnitude speed-ups, both in sequential and parallel deployments, on a variety of deep-learning, kernel-based learning, and neural architectural search problems when compared to HyperBand and to several state-of-the-art optimizers.
    Escaping from the Barren Plateau via Gaussian Initializations in Deep Variational Quantum Circuits. (arXiv:2203.09376v2 [quant-ph] UPDATED)
    Variational quantum circuits have been widely employed in quantum simulation and quantum machine learning in recent years. However, quantum circuits with random structures have poor trainability due to the exponentially vanishing gradient with respect to the circuit depth and the qubit number. This result leads to a general standpoint that deep quantum circuits would not be feasible for practical tasks. In this work, we propose an initialization strategy with theoretical guarantees for the vanishing gradient problem in general deep quantum circuits. Specifically, we prove that under proper Gaussian initialized parameters, the norm of the gradient decays at most polynomially when the qubit number and the circuit depth increase. Our theoretical results hold for both the local and the global observable cases, where the latter was believed to have vanishing gradients even for very shallow circuits. Experimental results verify our theoretical findings in the quantum simulation and quantum chemistry.
    On the Change of Decision Boundaries and Loss in Learning with Concept Drift. (arXiv:2212.01223v1 [cs.LG])
    The notion of concept drift refers to the phenomenon that the distribution generating the observed data changes over time. If drift is present, machine learning models may become inaccurate and need adjustment. Many technologies for learning with drift rely on the interleaved test-train error (ITTE) as a quantity which approximates the model generalization error and triggers drift detection and model updates. In this work, we investigate in how far this procedure is mathematically justified. More precisely, we relate a change of the ITTE to the presence of real drift, i.e., a changed posterior, and to a change of the training result under the assumption of optimality. We support our theoretical findings by empirical evidence for several learning algorithms, models, and datasets.
    Initial Results for Pairwise Causal Discovery Using Quantitative Information Flow. (arXiv:2212.01279v1 [cs.LG])
    Pairwise Causal Discovery is the task of determining causal, anticausal, confounded or independence relationships from pairs of variables. Over the last few years, this challenging task has promoted not only the discovery of novel machine learning models aimed at solving the task, but also discussions on how learning the causal direction of variables may benefit machine learning overall. In this paper, we show that Quantitative Information Flow (QIF), a measure usually employed for measuring leakages of information from a system to an attacker, shows promising results as features for the task. In particular, experiments with real-world datasets indicate that QIF is statistically tied to the state of the art. Our initial results motivate further inquiries on how QIF relates to causality and what are its limitations.
    Fast Non-Rigid Radiance Fields from Monocularized Data. (arXiv:2212.01368v1 [cs.CV])
    3D reconstruction and novel view synthesis of dynamic scenes from collections of single views recently gained increased attention. Existing work shows impressive results for synthetic setups and forward-facing real-world data, but is severely limited in the training speed and angular range for generating novel views. This paper addresses these limitations and proposes a new method for full 360{\deg} novel view synthesis of non-rigidly deforming scenes. At the core of our method are: 1) An efficient deformation module that decouples the processing of spatial and temporal information for acceleration at training and inference time; and 2) A static module representing the canonical scene as a fast hash-encoded neural radiance field. We evaluate the proposed approach on the established synthetic D-NeRF benchmark, that enables efficient reconstruction from a single monocular view per time-frame randomly sampled from a full hemisphere. We refer to this form of inputs as monocularized data. To prove its practicality for real-world scenarios, we recorded twelve challenging sequences with human actors by sampling single frames from a synchronized multi-view rig. In both cases, our method is trained significantly faster than previous methods (minutes instead of days) while achieving higher visual accuracy for generated novel views. Our source code and data is available at our project page https://graphics.tu-bs.de/publications/kappel2022fast.
    SolarDK: A high-resolution urban solar panel image classification and localization dataset. (arXiv:2212.01260v1 [cs.CV])
    The body of research on classification of solar panel arrays from aerial imagery is increasing, yet there are still not many public benchmark datasets. This paper introduces two novel benchmark datasets for classifying and localizing solar panel arrays in Denmark: A human annotated dataset for classification and segmentation, as well as a classification dataset acquired using self-reported data from the Danish national building registry. We explore the performance of prior works on the new benchmark dataset, and present results after fine-tuning models using a similar approach as recent works. Furthermore, we train models of newer architectures and provide benchmark baselines to our datasets in several scenarios. We believe the release of these datasets may improve future research in both local and global geospatial domains for identifying and mapping of solar panel arrays from aerial imagery. The data is accessible at https://osf.io/aj539/.
    Risk-Adaptive Approaches to Learning and Decision Making: A Survey. (arXiv:2212.00856v1 [math.OC])
    Uncertainty is prevalent in engineering design, statistical learning, and decision making broadly. Due to inherent risk-averseness and ambiguity about assumptions, it is common to address uncertainty by formulating and solving conservative optimization models expressed using measure of risk and related concepts. We survey the rapid development of risk measures over the last quarter century. From its beginning in financial engineering, we recount their spread to nearly all areas of engineering and applied mathematics. Solidly rooted in convex analysis, risk measures furnish a general framework for handling uncertainty with significant computational and theoretical advantages. We describe the key facts, list several concrete algorithms, and provide an extensive list of references for further reading. The survey recalls connections with utility theory and distributionally robust optimization, points to emerging applications areas such as fair machine learning, and defines measures of reliability.
    CHAPTER: Exploiting Convolutional Neural Network Adapters for Self-supervised Speech Models. (arXiv:2212.01282v1 [eess.AS])
    Self-supervised learning (SSL) is a powerful technique for learning representations from unlabeled data. Transformer based models such as HuBERT, which consist a feature extractor and transformer layers, are leading the field in the speech domain. SSL models are fine-tuned on a wide range of downstream tasks, which involves re-training the majority of the model for each task. Previous studies have introduced applying adapters, which are small lightweight modules commonly used in Natural Language Processing (NLP) to adapt pre-trained models to new tasks. However, such efficient tuning techniques only provide adaptation at the transformer layer, but failed to perform adaptation at the feature extractor. In this paper, we propose CHAPTER, an efficient tuning method specifically designed for SSL speech model, by applying CNN adapters at the feature extractor. Using this method, we can only fine-tune fewer than 5% of parameters per task compared to fully fine-tuning and achieve better and more stable performance. We empirically found that adding CNN adapters to the feature extractor can help the adaptation on emotion and speaker tasks. For instance, the accuracy of SID is improved from 87.71 to 91.56, and the accuracy of ER is improved by 5%.
    Clustering -- Basic concepts and methods. (arXiv:2212.01248v1 [cs.LG])
    We review clustering as an analysis tool and the underlying concepts from an introductory perspective. What is clustering and how can clusterings be realised programmatically? How can data be represented and prepared for a clustering task? And how can clustering results be validated? Connectivity-based versus prototype-based approaches are reflected in the context of several popular methods: single-linkage, spectral embedding, k-means, and Gaussian mixtures are discussed as well as the density-based protocols (H)DBSCAN, Jarvis-Patrick, CommonNN, and density-peaks.
    CT-DQN: Control-Tutored Deep Reinforcement Learning. (arXiv:2212.01343v1 [cs.LG])
    One of the major challenges in Deep Reinforcement Learning for control is the need for extensive training to learn the policy. Motivated by this, we present the design of the Control-Tutored Deep Q-Networks (CT-DQN) algorithm, a Deep Reinforcement Learning algorithm that leverages a control tutor, i.e., an exogenous control law, to reduce learning time. The tutor can be designed using an approximate model of the system, without any assumption about the knowledge of the system's dynamics. There is no expectation that it will be able to achieve the control objective if used stand-alone. During learning, the tutor occasionally suggests an action, thus partially guiding exploration. We validate our approach on three scenarios from OpenAI Gym: the inverted pendulum, lunar lander, and car racing. We demonstrate that CT-DQN is able to achieve better or equivalent data efficiency with respect to the classic function approximation solutions.
    On Solution Functions of Optimization: Universal Approximation and Covering Number Bounds. (arXiv:2212.01314v1 [cs.LG])
    We study the expressibility and learnability of convex optimization solution functions and their multi-layer architectural extension. The main results are: \emph{(1)} the class of solution functions of linear programming (LP) and quadratic programming (QP) is a universal approximant for the $C^k$ smooth model class or some restricted Sobolev space, and we characterize the rate-distortion, \emph{(2)} the approximation power is investigated through a viewpoint of regression error, where information about the target function is provided in terms of data observations, \emph{(3)} compositionality in the form of a deep architecture with optimization as a layer is shown to reconstruct some basic functions used in numerical analysis without error, which implies that \emph{(4)} a substantial reduction in rate-distortion can be achieved with a universal network architecture, and \emph{(5)} we discuss the statistical bounds of empirical covering numbers for LP/QP, as well as a generic optimization problem (possibly nonconvex) by exploiting tame geometry. Our results provide the \emph{first rigorous analysis of the approximation and learning-theoretic properties of solution functions} with implications for algorithmic design and performance guarantees.
    Denoising Deep Generative Models. (arXiv:2212.01265v1 [cs.LG])
    Likelihood-based deep generative models have recently been shown to exhibit pathological behaviour under the manifold hypothesis as a consequence of using high-dimensional densities to model data with low-dimensional structure. In this paper we propose two methodologies aimed at addressing this problem. Both are based on adding Gaussian noise to the data to remove the dimensionality mismatch during training, and both provide a denoising mechanism whose goal is to sample from the model as though no noise had been added to the data. Our first approach is based on Tweedie's formula, and the second on models which take the variance of added noise as a conditional input. We show that surprisingly, while well motivated, these approaches only sporadically improve performance over not adding noise, and that other methods of addressing the dimensionality mismatch are more empirically adequate.
    MHCCL: Masked Hierarchical Cluster-wise Contrastive Learning for Multivariate Time Series. (arXiv:2212.01141v1 [cs.LG])
    Learning semantic-rich representations from raw unlabeled time series data is critical for downstream tasks such as classification and forecasting. Contrastive learning has recently shown its promising representation learning capability in the absence of expert annotations. However, existing contrastive approaches generally treat each instance independently, which leads to false negative pairs that share the same semantics. To tackle this problem, we propose MHCCL, a Masked Hierarchical Cluster-wise Contrastive Learning model, which exploits semantic information obtained from the hierarchical structure consisting of multiple latent partitions for multivariate time series. Motivated by the observation that fine-grained clustering preserves higher purity while coarse-grained one reflects higher-level semantics, we propose a novel downward masking strategy to filter out fake negatives and supplement positives by incorporating the multi-granularity information from the clustering hierarchy. In addition, a novel upward masking strategy is designed in MHCCL to remove outliers of clusters at each partition to refine prototypes, which helps speed up the hierarchical clustering process and improves the clustering quality. We conduct experimental evaluations on seven widely-used multivariate time series datasets. The results demonstrate the superiority of MHCCL over the state-of-the-art approaches for unsupervised time series representation learning.
    Hierarchical Model Selection for Graph Neural Netoworks. (arXiv:2212.00898v1 [cs.LG])
    Node classification on graph data is a major problem, and various graph neural networks (GNNs) have been proposed. Variants of GNNs such as H2GCN and CPF outperform graph convolutional networks (GCNs) by improving on the weaknesses of the traditional GNN. However, there are some graph data which these GNN variants fail to perform well than other GNNs in the node classification task. This is because H2GCN has a feature thinning on graph data with high average degree, and CPF gives rise to a problem about label-propagation suitability. Accordingly, we propose a hierarchical model selection framework (HMSF) that selects an appropriate GNN model by analyzing the indicators of each graph data. In the experiment, we show that the model selected by our HMSF achieves high performance on node classification for various types of graph data.
    HAMMER: Multi-Level Coordination of Reinforcement Learning Agents via Learned Messaging. (arXiv:2102.00824v2 [cs.MA] UPDATED)
    Cooperative multi-agent reinforcement learning (MARL) has achieved significant results, most notably by leveraging the representation-learning abilities of deep neural networks. However, large centralized approaches quickly become infeasible as the number of agents scale, and fully decentralized approaches can miss important opportunities for information sharing and coordination. Furthermore, not all agents are equal -- in some cases, individual agents may not even have the ability to send communication to other agents or explicitly model other agents. This paper considers the case where there is a single, powerful, \emph{central agent} that can observe the entire observation space, and there are multiple, low-powered \emph{local agents} that can only receive local observations and are not able to communicate with each other. The central agent's job is to learn what message needs to be sent to different local agents based on the global observations, not by centrally solving the entire problem and sending action commands, but by determining what additional information an individual agent should receive so that it can make a better decision. In this work we present our MARL algorithm \algo, describe where it would be most applicable, and implement it in the cooperative navigation and multi-agent walker domains. Empirical results show that 1) learned communication does indeed improve system performance, 2) results generalize to heterogeneous local agents, and 3) results generalize to different reward structures.
    Identifying Hamiltonian manifold in neural networks. (arXiv:2212.01168v1 [cs.LG])
    Recent studies to learn physical laws via deep learning attempt to find the shared representation of the given system by introducing physics priors or inductive biases to the neural network. However, most of these approaches tackle the problem in a system-specific manner, in which one neural network trained to one particular physical system cannot be easily adapted to another system governed by a different physical law. In this work, we use a meta-learning algorithm to identify the general manifold in neural networks that represents Hamilton's equation. We meta-trained the model with the dataset composed of five dynamical systems each governed by different physical laws. We show that with only a few gradient steps, the meta-trained model adapts well to the physical system which was unseen during the meta-training phase. Our results suggest that the meta-trained model can craft the representation of Hamilton's equation in neural networks which is shared across various dynamical systems with each governed by different physical laws.
    Denoising after Entropy-based Debiasing A Robust Training Method for Dataset Bias with Noisy Labels. (arXiv:2212.01189v1 [cs.LG])
    Improperly constructed datasets can result in inaccurate inferences. For instance, models trained on biased datasets perform poorly in terms of generalization (i.e., dataset bias). Recent debiasing techniques have successfully achieved generalization performance by underestimating easy-to-learn samples (i.e., bias-aligned samples) and highlighting difficult-to-learn samples (i.e., bias-conflicting samples). However, these techniques may fail owing to noisy labels, because the trained model recognizes noisy labels as difficult-to-learn and thus highlights them. In this study, we find that earlier approaches that used the provided labels to quantify difficulty could be affected by the small proportion of noisy labels. Furthermore, we find that running denoising algorithms before debiasing is ineffective because denoising algorithms reduce the impact of difficult-to-learn samples, including valuable bias-conflicting samples. Therefore, we propose an approach called denoising after entropy-based debiasing, i.e., DENEB, which has three main stages. (1) The prejudice model is trained by emphasizing (bias-aligned, clean) samples, which are selected using a Gaussian Mixture Model. (2) Using the per-sample entropy from the output of the prejudice model, the sampling probability of each sample that is proportional to the entropy is computed. (3) The final model is trained using existing denoising algorithms with the mini-batches constructed by following the computed sampling probability. Compared to existing debiasing and denoising algorithms, our method achieves better debiasing performance on multiple benchmarks.
    ColD Fusion: Collaborative Descent for Distributed Multitask Finetuning. (arXiv:2212.01378v1 [cs.LG])
    Pretraining has been shown to scale well with compute, data size and data diversity. Multitask learning trains on a mixture of supervised datasets and produces improved performance compared to self-supervised pretraining. Until now, massively multitask learning required simultaneous access to all datasets in the mixture and heavy compute resources that are only available to well-resourced teams. In this paper, we propose ColD Fusion, a method that provides the benefits of multitask learning but leverages distributed computation and requires limited communication and no sharing of data. Consequentially, ColD Fusion can create a synergistic loop, where finetuned models can be recycled to continually improve the pretrained model they are based on. We show that ColD Fusion yields comparable benefits to multitask pretraining by producing a model that (a) attains strong performance on all of the datasets it was multitask trained on and (b) is a better starting point for finetuning on unseen datasets. We find ColD Fusion outperforms RoBERTa and even previous multitask models. Specifically, when training and testing on 35 diverse datasets, ColD Fusion-based model outperforms RoBERTa by 2.45 points in average without any changes to the architecture.
    Gaussian Process regression over discrete probability measures: on the non-stationarity relation between Euclidean and Wasserstein Squared Exponential Kernels. (arXiv:2212.01310v1 [cs.LG])
    Gaussian Process regression is a kernel method successfully adopted in many real-life applications. Recently, there is a growing interest on extending this method to non-Euclidean input spaces, like the one considered in this paper, consisting of probability measures. Although a Positive Definite kernel can be defined by using a suitable distance -- the Wasserstein distance -- the common procedure for learning the Gaussian Process model can fail due to numerical issues, arising earlier and more frequently than in the case of an Euclidean input space and, as demonstrated in this paper, that cannot be avoided by adding artificial noise (nugget effect) as usually done. This paper uncovers the main reason of these issues, that is a non-stationarity relationship between the Wasserstein-based squared exponential kernel and its Euclidean-based counterpart. As a relevant result, the Gaussian Process model is learned by assuming the input space as Euclidean and then an algebraic transformation, based on the uncovered relation, is used to transform it into a non-stationary and Wasserstein-based Gaussian Process model over probability measures. This algebraic transformation is simpler than log-exp maps used in the case of data belonging to Riemannian manifolds and recently extended to consider the pseudo-Riemannian structure of an input space equipped with the Wasserstein distance.
    Adaptive Robust Model Predictive Control via Uncertainty Cancellation. (arXiv:2212.01371v1 [eess.SY])
    We propose a learning-based robust predictive control algorithm that compensates for significant uncertainty in the dynamics for a class of discrete-time systems that are nominally linear with an additive nonlinear component. Such systems commonly model the nonlinear effects of an unknown environment on a nominal system. We optimize over a class of nonlinear feedback policies inspired by certainty equivalent "estimate-and-cancel" control laws pioneered in classical adaptive control to achieve significant performance improvements in the presence of uncertainties of large magnitude, a setting in which existing learning-based predictive control algorithms often struggle to guarantee safety. In contrast to previous work in robust adaptive MPC, our approach allows us to take advantage of structure (i.e., the numerical predictions) in the a priori unknown dynamics learned online through function approximation. Our approach also extends typical nonlinear adaptive control methods to systems with state and input constraints even when we cannot directly cancel the additive uncertain function from the dynamics. We apply contemporary statistical estimation techniques to certify the system's safety through persistent constraint satisfaction with high probability. Moreover, we propose using Bayesian meta-learning algorithms that learn calibrated model priors to help satisfy the assumptions of the control design in challenging settings. Finally, we show in simulation that our method can accommodate more significant unknown dynamics terms than existing methods and that the use of Bayesian meta-learning allows us to adapt to the test environments more rapidly.
    Compositional Learning of Dynamical System Models Using Port-Hamiltonian Neural Networks. (arXiv:2212.00893v1 [cs.LG])
    Many dynamical systems -- from robots interacting with their surroundings to large-scale multiphysics systems -- involve a number of interacting subsystems. Toward the objective of learning composite models of such systems from data, we present i) a framework for compositional neural networks, ii) algorithms to train these models, iii) a method to compose the learned models, iv) theoretical results that bound the error of the resulting composite models, and v) a method to learn the composition itself, when it is not known a prior. The end result is a modular approach to learning: neural network submodels are trained on trajectory data generated by relatively simple subsystems, and the dynamics of more complex composite systems are then predicted without requiring additional data generated by the composite systems themselves. We achieve this compositionality by representing the system of interest, as well as each of its subsystems, as a port-Hamiltonian neural network (PHNN) -- a class of neural ordinary differential equations that uses the port-Hamiltonian systems formulation as inductive bias. We compose collections of PHNNs by using the system's physics-informed interconnection structure, which may be known a priori, or may itself be learned from data. We demonstrate the novel capabilities of the proposed framework through numerical examples involving interacting spring-mass-damper systems. Models of these systems, which include nonlinear energy dissipation and control inputs, are learned independently. Accurate compositions are learned using an amount of training data that is negligible in comparison with that required to train a new model from scratch. Finally, we observe that the composite PHNNs enjoy properties of port-Hamiltonian systems, such as cyclo-passivity -- a property that is useful for control purposes.
    Robustness in Fatigue Strength Estimation. (arXiv:2212.01136v1 [cs.LG])
    Fatigue strength estimation is a costly manual material characterization process in which state-of-the-art approaches follow a standardized experiment and analysis procedure. In this paper, we examine a modular, Machine Learning-based approach for fatigue strength estimation that is likely to reduce the number of experiments and, thus, the overall experimental costs. Despite its high potential, deployment of a new approach in a real-life lab requires more than the theoretical definition and simulation. Therefore, we study the robustness of the approach against misspecification of the prior and discretization of the specified loads. We identify its applicability and its advantageous behavior over the state-of-the-art methods, potentially reducing the number of costly experiments.
    FedALA: Adaptive Local Aggregation for Personalized Federated Learning. (arXiv:2212.01197v1 [cs.LG])
    A key challenge in federated learning (FL) is the statistical heterogeneity that impairs the generalization of the global model on each client. To address this, we propose a method Federated learning with Adaptive Local Aggregation (FedALA) by capturing the desired information in the global model for client models in personalized FL. The key component of FedALA is an Adaptive Local Aggregation (ALA) module, which can adaptively aggregate the downloaded global model and local model towards the local objective on each client to initialize the local model before training in each iteration. To evaluate the effectiveness of FedALA, we conduct extensive experiments with five benchmark datasets in computer vision and natural language processing domains. FedALA outperforms eleven state-of-the-art baselines by up to 3.27% in test accuracy. Furthermore, we also apply ALA module to other federated learning methods and achieve up to 24.19% improvement in test accuracy.
    Understanding Cryptocoins Trends Correlations. (arXiv:2212.01267v1 [q-fin.ST])
    Crypto-coins (also known as cryptocurrencies) are tradable digital assets. Notable examples include Bitcoin, Ether and Litecoin. Ownerships of cryptocoins are registered on distributed ledgers (i.e., blockchains). Secure encryption techniques guarantee the security of the transactions (transfers of coins across owners), registered into the ledger. Cryptocoins are exchanged for specific trading prices. While history has shown the extreme volatility of such trading prices across all different sets of crypto-assets, it remains unclear what and if there are tight relations between the trading prices of different cryptocoins. Major coin exchanges (i.e., Coinbase) provide trend correlation indicators to coin owners, suggesting possible acquisitions or sells. However, these correlations remain largely unvalidated. In this paper, we shed lights on the trend correlations across a large variety of cryptocoins, by investigating their coin-price correlation trends over a period of two years. Our experimental results suggest strong correlation patterns between main coins (Ethereum, Bitcoin) and alt-coins. We believe our study can support forecasting techniques for time-series modeling in the context of crypto-coins. We release our dataset and code to reproduce our analysis to the research community.
    Covariance Estimators for the ROOT-SGD Algorithm in Online Learning. (arXiv:2212.01259v1 [stat.ML])
    Online learning naturally arises in many statistical and machine learning problems. The most widely used methods in online learning are stochastic first-order algorithms. Among this family of algorithms, there is a recently developed algorithm, Recursive One-Over-T SGD (ROOT-SGD). ROOT-SGD is advantageous in that it converges at a non-asymptotically fast rate, and its estimator further converges to a normal distribution. However, this normal distribution has unknown asymptotic covariance; thus cannot be directly applied to measure the uncertainty. To fill this gap, we develop two estimators for the asymptotic covariance of ROOT-SGD. Our covariance estimators are useful for statistical inference in ROOT-SGD. Our first estimator adopts the idea of plug-in. For each unknown component in the formula of the asymptotic covariance, we substitute it with its empirical counterpart. The plug-in estimator converges at the rate $\mathcal{O}(1/\sqrt{t})$, where $t$ is the sample size. Despite its quick convergence, the plug-in estimator has the limitation that it relies on the Hessian of the loss function, which might be unavailable in some cases. Our second estimator is a Hessian-free estimator that overcomes the aforementioned limitation. The Hessian-free estimator uses the random-scaling technique, and we show that it is an asymptotically consistent estimator of the true covariance.
    Investigating certain choices of CNN configurations for brain lesion segmentation. (arXiv:2212.01235v1 [eess.IV])
    Brain tumor imaging has been part of the clinical routine for many years to perform non-invasive detection and grading of tumors. Tumor segmentation is a crucial step for managing primary brain tumors because it allows a volumetric analysis to have a longitudinal follow-up of tumor growth or shrinkage to monitor disease progression and therapy response. In addition, it facilitates further quantitative analysis such as radiomics. Deep learning models, in particular CNNs, have been a methodology of choice in many applications of medical image analysis including brain tumor segmentation. In this study, we investigated the main design aspects of CNN models for the specific task of MRI-based brain tumor segmentation. Two commonly used CNN architectures (i.e. DeepMedic and U-Net) were used to evaluate the impact of the essential parameters such as learning rate, batch size, loss function, and optimizer. The performance of CNN models using different configurations was assessed with the BraTS 2018 dataset to determine the most performant model. Then, the generalization ability of the model was assessed using our in-house dataset. For all experiments, U-Net achieved a higher DSC compared to the DeepMedic. However, the difference was only statistically significant for whole tumor segmentation using FLAIR sequence data and tumor core segmentation using T1w sequence data. Adam and SGD both with the initial learning rate set to 0.001 provided the highest segmentation DSC when training the CNN model using U-Net and DeepMedic architectures, respectively. No significant difference was observed when using different normalization approaches. In terms of loss functions, a weighted combination of soft Dice and cross-entropy loss with the weighting term set to 0.5 resulted in an improved segmentation performance and training stability for both DeepMedic and U-Net models.
    Programming Is Hard -- Or at Least It Used to Be: Educational Opportunities And Challenges of AI Code Generation. (arXiv:2212.01020v1 [cs.HC])
    The introductory programming sequence has been the focus of much research in computing education. The recent advent of several viable and freely-available AI-driven code generation tools present several immediate opportunities and challenges in this domain. In this position paper we argue that the community needs to act quickly in deciding what possible opportunities can and should be leveraged and how, while also working on how to overcome or otherwise mitigate the possible challenges. Assuming that the effectiveness and proliferation of these tools will continue to progress rapidly, without quick, deliberate, and concerted efforts, educators will lose advantage in helping shape what opportunities come to be, and what challenges will endure. With this paper we aim to seed this discussion within the computing education community.
    Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs. (arXiv:2207.02295v2 [cs.NI] UPDATED)
    Cloud datacenters are growing exponentially both in number and size. As communication protocols evolve, datacenter networks experience higher utilization, leading to greater congestion along with increased latency and packet loss. We analyze a recently published reinforcement learning congestion control algorithm (Tessler et al. 2022) that achieves state-of-the-art performance and, in a second phase, reshape it to comply with current hardware limitations. We show how to map complex policies to a low-compute architecture, gaining x500 latency reduction. This transformation enables real-time policy inference within the $\mu$sec decision time requirement, with a negligible effect on the quality of the policy. We deploy the transformed policy onto NVIDIA NICs in an operational network. Compared to popular CC algorithms used in production, we show that RL-CC is the only one to perform well on all benchmarks tested, balancing multiple metrics simultaneously: bandwidth, latency, and packet drops. This sheds light on the feasibility of data-driven methods for congestion control, challenging the prior belief that handcrafted heuristics are required to obtain a stable and fair solution.
    On the Energy and Communication Efficiency Tradeoffs in Federated and Multi-Task Learning. (arXiv:2212.01049v1 [cs.LG])
    Recent advances in Federated Learning (FL) have paved the way towards the design of novel strategies for solving multiple learning tasks simultaneously, by leveraging cooperation among networked devices. Multi-Task Learning (MTL) exploits relevant commonalities across tasks to improve efficiency compared with traditional transfer learning approaches. By learning multiple tasks jointly, significant reduction in terms of energy footprints can be obtained. This article provides a first look into the energy costs of MTL processes driven by the Model-Agnostic Meta-Learning (MAML) paradigm and implemented in distributed wireless networks. The paper targets a clustered multi-task network setup where autonomous agents learn different but related tasks. The MTL process is carried out in two stages: the optimization of a meta-model that can be quickly adapted to learn new tasks, and a task-specific model adaptation stage where the learned meta-model is transferred to agents and tailored for a specific task. This work analyzes the main factors that influence the MTL energy balance by considering a multi-task Reinforcement Learning (RL) setup in a robotized environment. Results show that the MAML method can reduce the energy bill by at least 2 times compared with traditional approaches without inductive transfer. Moreover, it is shown that the optimal energy balance in wireless networks depends on uplink/downlink and sidelink communication efficiencies.
    Investigating Deep Learning Model Calibration for Classification Problems in Mechanics. (arXiv:2212.00881v1 [cs.LG])
    Recently, there has been a growing interest in applying machine learning methods to problems in engineering mechanics. In particular, there has been significant interest in applying deep learning techniques to predicting the mechanical behavior of heterogeneous materials and structures. Researchers have shown that deep learning methods are able to effectively predict mechanical behavior with low error for systems ranging from engineered composites, to geometrically complex metamaterials, to heterogeneous biological tissue. However, there has been comparatively little attention paid to deep learning model calibration, i.e., the match between predicted probabilities of outcomes and the true probabilities of outcomes. In this work, we perform a comprehensive investigation into ML model calibration across seven open access engineering mechanics datasets that cover three distinct types of mechanical problems. Specifically, we evaluate both model and model calibration error for multiple machine learning methods, and investigate the influence of ensemble averaging and post hoc model calibration via temperature scaling. Overall, we find that ensemble averaging of deep neural networks is both an effective and consistent tool for improving model calibration, while temperature scaling has comparatively limited benefits. Looking forward, we anticipate that this investigation will lay the foundation for future work in developing mechanics specific approaches to deep learning model calibration.
    Predict-and-Critic: Accelerated End-to-End Predictive Control for Cloud Computing through Reinforcement Learning. (arXiv:2212.01348v1 [cs.LG])
    Cloud computing holds the promise of reduced costs through economies of scale. To realize this promise, cloud computing vendors typically solve sequential resource allocation problems, where customer workloads are packed on shared hardware. Virtual machines (VM) form the foundation of modern cloud computing as they help logically abstract user compute from shared physical infrastructure. Traditionally, VM packing problems are solved by predicting demand, followed by a Model Predictive Control (MPC) optimization over a future horizon. We introduce an approximate formulation of an industrial VM packing problem as an MILP with soft-constraints parameterized by the predictions. Recently, predict-and-optimize (PnO) was proposed for end-to-end training of prediction models by back-propagating the cost of decisions through the optimization problem. But, PnO is unable to scale to the large prediction horizons prevalent in cloud computing. To tackle this issue, we propose the Predict-and-Critic (PnC) framework that outperforms PnO with just a two-step horizon by leveraging reinforcement learning. PnC jointly trains a prediction model and a terminal Q function that approximates cost-to-go over a long horizon, by back-propagating the cost of decisions through the optimization problem \emph{and from the future}. The terminal Q function allows us to solve a much smaller two-step horizon optimization problem than the multi-step horizon necessary in PnO. We evaluate PnO and the PnC framework on two datasets, three workloads, and with disturbances not modeled in the optimization problem. We find that PnC significantly improves decision quality over PnO, even when the optimization problem is not a perfect representation of reality. We also find that hardening the soft constraints of the MILP and back-propagating through the constraints improves decision quality for both PnO and PnC.
    Clustering individuals based on multivariate EMA time-series data. (arXiv:2212.01159v1 [cs.LG])
    In the field of psychopathology, Ecological Momentary Assessment (EMA) methodological advancements have offered new opportunities to collect time-intensive, repeated and intra-individual measurements. This way, a large amount of data has become available, providing the means for further exploring mental disorders. Consequently, advanced machine learning (ML) methods are needed to understand data characteristics and uncover hidden and meaningful relationships regarding the underlying complex psychological processes. Among other uses, ML facilitates the identification of similar patterns in data of different individuals through clustering. This paper focuses on clustering multivariate time-series (MTS) data of individuals into several groups. Since clustering is an unsupervised problem, it is challenging to assess whether the resulting grouping is successful. Thus, we investigate different clustering methods based on different distance measures and assess them for the stability and quality of the derived clusters. These clustering steps are illustrated on a real-world EMA dataset, including 33 individuals and 15 variables. Through evaluation, the results of kernel-based clustering methods appear promising to identify meaningful groups in the data. So, efficient representations of EMA data play an important role in clustering.
    Cross-Domain Graph Anomaly Detection via Anomaly-aware Contrastive Alignment. (arXiv:2212.01096v1 [cs.LG])
    Cross-domain graph anomaly detection (CD-GAD) describes the problem of detecting anomalous nodes in an unlabelled target graph using auxiliary, related source graphs with labelled anomalous and normal nodes. Although it presents a promising approach to address the notoriously high false positive issue in anomaly detection, little work has been done in this line of research. There are numerous domain adaptation methods in the literature, but it is difficult to adapt them for GAD due to the unknown distributions of the anomalies and the complex node relations embedded in graph data. To this end, we introduce a novel domain adaptation approach, namely Anomaly-aware Contrastive alignmenT (ACT), for GAD. ACT is designed to jointly optimise: (i) unsupervised contrastive learning of normal representations of nodes in the target graph, and (ii) anomaly-aware one-class alignment that aligns these contrastive node representations and the representations of labelled normal nodes in the source graph, while enforcing significant deviation of the representations of the normal nodes from the labelled anomalous nodes in the source graph. In doing so, ACT effectively transfers anomaly-informed knowledge from the source graph to learn the complex node relations of the normal class for GAD on the target graph without any specification of the anomaly distributions. Extensive experiments on eight CD-GAD settings demonstrate that our approach ACT achieves substantially improved detection performance over 10 state-of-the-art GAD methods. Code is available at https://github.com/QZ-WANG/ACT.
    Nonlinear controllability and function representation by neural stochastic differential equations. (arXiv:2212.00896v1 [math.OC])
    There has been a great deal of recent interest in learning and approximation of functions that can be expressed as expectations of a given nonlinearity with respect to its random internal parameters. Examples of such representations include "infinitely wide" neural nets, where the underlying nonlinearity is given by the activation function of an individual neuron. In this paper, we bring this perspective to function representation by neural stochastic differential equations (SDEs). A neural SDE is an It\^o diffusion process whose drift and diffusion matrix are elements of some parametric families. We show that the ability of a neural SDE to realize nonlinear functions of its initial condition can be related to the problem of optimally steering a certain deterministic dynamical system between two given points in finite time. This auxiliary system is obtained by formally replacing the Brownian motion in the SDE by a deterministic control input. We derive upper and lower bounds on the minimum control effort needed to accomplish this steering; these bounds may be of independent interest in the context of motion planning and deterministic optimal control.
    One-Hot Graph Encoder Embedding. (arXiv:2109.13098v3 [cs.LG] UPDATED)
    In this paper we propose a lightning fast graph embedding method called one-hot graph encoder embedding. It has a linear computational complexity and the capacity to process billions of edges within minutes on standard PC -- making it an ideal candidate for huge graph processing. It is applicable to either adjacency matrix or graph Laplacian, and can be viewed as a transformation of the spectral embedding. Under random graph models, the graph encoder embedding is approximately normally distributed per vertex, and asymptotically converges to its mean. We showcase three applications: vertex classification, vertex clustering, and graph bootstrap. In every case, the graph encoder embedding exhibits unrivalled computational advantages.
    Deep-Learning-based Vulnerability Detection in Binary Executables. (arXiv:2212.01254v1 [cs.CR])
    The identification of vulnerabilities is an important element in the software development life cycle to ensure the security of software. While vulnerability identification based on the source code is a well studied field, the identification of vulnerabilities on basis of a binary executable without the corresponding source code is more challenging. Recent research [1] has shown, how such detection can be achieved by deep learning methods. However, that particular approach is limited to the identification of only 4 types of vulnerabilities. Subsequently, we analyze to what extent we could cover the identification of a larger variety of vulnerabilities. Therefore, a supervised deep learning approach using recurrent neural networks for the application of vulnerability detection based on binary executables is used. The underlying basis is a dataset with 50,651 samples of vulnerable code in the form of a standardized LLVM Intermediate Representation. The vectorised features of a Word2Vec model are used to train different variations of three basic architectures of recurrent neural networks (GRU, LSTM, SRNN). A binary classification was established for detecting the presence of an arbitrary vulnerability, and a multi-class model was trained for the identification of the exact vulnerability, which achieved an out-of-sample accuracy of 88% and 77%, respectively. Differences in the detection of different vulnerabilities were also observed, with non-vulnerable samples being detected with a particularly high precision of over 98%. Thus, the methodology presented allows an accurate detection of 23 (compared to 4 [1]) vulnerabilities.
    Black box tests for algorithmic stability. (arXiv:2111.15546v5 [cs.LG] UPDATED)
    Algorithmic stability is a concept from learning theory that expresses the degree to which changes to the input data (e.g., removal of a single data point) may affect the outputs of a regression algorithm. Knowing an algorithm's stability properties is often useful for many downstream applications -- for example, stability is known to lead to desirable generalization properties and predictive inference guarantees. However, many modern algorithms currently used in practice are too complex for a theoretical analysis of their stability properties, and thus we can only attempt to establish these properties through an empirical exploration of the algorithm's behavior on various data sets. In this work, we lay out a formal statistical framework for this kind of "black box testing" without any assumptions on the algorithm or the data distribution, and establish fundamental bounds on the ability of any black box test to identify algorithmic stability.
    Utilizing Prior Solutions for Reward Shaping and Composition in Entropy-Regularized Reinforcement Learning. (arXiv:2212.01174v1 [cs.LG])
    In reinforcement learning (RL), the ability to utilize prior knowledge from previously solved tasks can allow agents to quickly solve new problems. In some cases, these new problems may be approximately solved by composing the solutions of previously solved primitive tasks (task composition). Otherwise, prior knowledge can be used to adjust the reward function for a new problem, in a way that leaves the optimal policy unchanged but enables quicker learning (reward shaping). In this work, we develop a general framework for reward shaping and task composition in entropy-regularized RL. To do so, we derive an exact relation connecting the optimal soft value functions for two entropy-regularized RL problems with different reward functions and dynamics. We show how the derived relation leads to a general result for reward shaping in entropy-regularized RL. We then generalize this approach to derive an exact relation connecting optimal value functions for the composition of multiple tasks in entropy-regularized RL. We validate these theoretical contributions with experiments showing that reward shaping and task composition lead to faster learning in various settings.
    Safe machine learning model release from Trusted Research Environments: The AI-SDC package. (arXiv:2212.01233v1 [cs.LG])
    We present AI-SDC, an integrated suite of open source Python tools to facilitate Statistical Disclosure Control (SDC) of Machine Learning (ML) models trained on confidential data prior to public release. AI-SDC combines (i) a SafeModel package that extends commonly used ML models to provide ante-hoc SDC by assessing the vulnerability of disclosure posed by the training regime; and (ii) an Attacks package that provides post-hoc SDC by rigorously assessing the empirical disclosure risk of a model through a variety of simulated attacks after training. The AI-SDC code and documentation are available under an MIT license at https://github.com/AI-SDC/AI-SDC.
    Ripple: Concept-Based Interpretation for Raw Time Series Models in Education. (arXiv:2212.01133v1 [cs.LG])
    Time series is the most prevalent form of input data for educational prediction tasks. The vast majority of research using time series data focuses on hand-crafted features, designed by experts for predictive performance and interpretability. However, extracting these features is labor-intensive for humans and computers. In this paper, we propose an approach that utilizes irregular multivariate time series modeling with graph neural networks to achieve comparable or better accuracy with raw time series clickstreams in comparison to hand-crafted features. Furthermore, we extend concept activation vectors for interpretability in raw time series models. We analyze these advances in the education domain, addressing the task of early student performance prediction for downstream targeted interventions and instructional support. Our experimental analysis on 23 MOOCs with millions of combined interactions over six behavioral dimensions show that models designed with our approach can (i) beat state-of-the-art educational time series baselines with no feature extraction and (ii) provide interpretable insights for personalized interventions. Source code: https://github.com/epfl-ml4ed/ripple/.
    Fake detection in imbalance dataset by Semi-supervised learning with GAN. (arXiv:2212.01071v1 [cs.LG])
    As social media grows faster, harassment becomes more prevalent which leads to considered fake detection a fascinating field among researchers. The graph nature of data with the large number of nodes caused different obstacles including a considerable amount of unrelated features in matrices as high dispersion and imbalance classes in the dataset. To deal with these issues Auto-encoders and a combination of semi-supervised learning and the GAN algorithm which is called SGAN were used. This paper is deploying a smaller number of labels and applying SGAN as a classifier. The result of this test showed that the accuracy had reached 91\% in detecting fake accounts using only 100 labeled samples.
    Clustering through Feature Space Sequence Discovery and Analysis. (arXiv:2212.00996v1 [cs.LG])
    Identifying high-dimensional data patterns without a priori knowledge is an important task of data science. This paper proposes a simple and efficient noparametric algorithm: Data Convert to Sequence Analysis, DCSA, which dynamically explore each point in the feature space without repetition, and a Directed Hamilton Path will be found. Based on the change point analysis theory, The sequence corresponding to the path is cut into several fragments to achieve clustering. The experiments on real-world datasets from different fields with dimensions ranging from 4 to 20531 confirm that the method in this work is robust and has visual interpretability in result analysis.
    Gibbs-Helmholtz Graph Neural Network: capturing the temperature dependency of activity coefficients at infinite dilution. (arXiv:2212.01199v1 [physics.chem-ph])
    The accurate prediction of physicochemical properties of chemical compounds in mixtures (such as the activity coefficient at infinite dilution $\gamma_{ij}^\infty$) is essential for developing novel and more sustainable chemical processes. In this work, we analyze the performance of previously-proposed GNN-based models for the prediction of $\gamma_{ij}^\infty$, and compare them with several mechanistic models in a series of 9 isothermal studies. Moreover, we develop the Gibbs-Helmholtz Graph Neural Network (GH-GNN) model for predicting $\ln \gamma_{ij}^\infty$ of molecular systems at different temperatures. Our method combines the simplicity of a Gibbs-Helmholtz-derived expression with a series of graph neural networks that incorporate explicit molecular and intermolecular descriptors for capturing dispersion and hydrogen bonding effects. We have trained this model using experimentally determined $\ln \gamma_{ij}^\infty$ data of 40,219 binary-systems involving 1032 solutes and 866 solvents, overall showing superior performance compared to the popular UNIFAC-Dortmund model. We analyze the performance of GH-GNN for continuous and discrete inter/extrapolation and give indications for the model's applicability domain and expected accuracy. In general, GH-GNN is able to produce accurate predictions for extrapolated binary-systems if at least 25 systems with the same combination of solute-solvent chemical classes are contained in the training set and a similarity indicator above 0.35 is also present. This model and its applicability domain recommendations have been made open-source at https://github.com/edgarsmdn/GH-GNN.
    Empirical Asset Pricing via Ensemble Gaussian Process Regression. (arXiv:2212.01048v1 [q-fin.RM])
    We introduce an ensemble learning method based on Gaussian Process Regression (GPR) for predicting conditional expected stock returns given stock-level and macro-economic information. Our ensemble learning approach significantly reduces the computational complexity inherent in GPR inference and lends itself to general online learning tasks. We conduct an empirical analysis on a large cross-section of US stocks from 1962 to 2016. We find that our method dominates existing machine learning models statistically and economically in terms of out-of-sample $R$-squared and Sharpe ratio of prediction-sorted portfolios. Exploiting the Bayesian nature of GPR, we introduce the mean-variance optimal portfolio with respect to the predictive uncertainty distribution of the expected stock returns. It appeals to an uncertainty averse investor and significantly dominates the equal- and value-weighted prediction-sorted portfolios, which outperform the S&P 500.  ( 2 min )
    Semantic Role Labeling Meets Definition Modeling: Using Natural Language to Describe Predicate-Argument Structures. (arXiv:2212.01094v1 [cs.CL])
    One of the common traits of past and present approaches for Semantic Role Labeling (SRL) is that they rely upon discrete labels drawn from a predefined linguistic inventory to classify predicate senses and their arguments. However, we argue this need not be the case. In this paper, we present an approach that leverages Definition Modeling to introduce a generalized formulation of SRL as the task of describing predicate-argument structures using natural language definitions instead of discrete labels. Our novel formulation takes a first step towards placing interpretability and flexibility foremost, and yet our experiments and analyses on PropBank-style and FrameNet-style, dependency-based and span-based SRL also demonstrate that a flexible model with an interpretable output does not necessarily come at the expense of performance. We release our software for research purposes at https://github.com/SapienzaNLP/dsrl.
    Learning Temporal Logic Properties: an Overview of Two Recent Methods. (arXiv:2212.00916v1 [cs.LO])
    Learning linear temporal logic (LTL) formulas from examples labeled as positive or negative has found applications in inferring descriptions of system behavior. We summarize two methods to learn LTL formulas from examples in two different problem settings. The first method assumes noise in the labeling of the examples. For that, they define the problem of inferring an LTL formula that must be consistent with most but not all of the examples. The second method considers the other problem of inferring meaningful LTL formulas in the case where only positive examples are given. Hence, the first method addresses the robustness to noise, and the second method addresses the balance between conciseness and specificity (i.e., language minimality) of the inferred formula. The summarized methods propose different algorithms to solve the aforementioned problems, as well as to infer other descriptions of temporal properties, such as signal temporal logic or deterministic finite automata.  ( 2 min )
    A Model-based GNN for Learning Precoding. (arXiv:2212.00860v1 [eess.SP])
    Learning precoding policies with neural networks enables low complexity online implementation, robustness to channel impairments, and joint optimization with channel acquisition. However, existing neural networks suffer from high training complexity and poor generalization ability when they are used to learn to optimize precoding for mitigating multi-user interference. This impedes their use in practical systems where the number of users is time-varying. In this paper, we propose a graph neural network (GNN) to learn precoding policies by harnessing both the mathematical model and the property of the policies. We first show that a vanilla GNN cannot well-learn pseudo-inverse of channel matrix when the numbers of antennas and users are large, and is not generalizable to unseen numbers of users. Then, we design a GNN by resorting to the Taylor's expansion of matrix pseudo-inverse, which allows for capturing the importance of the neighbored edges to be aggregated that is crucial for learning precoding policies efficiently. Simulation results show that the proposed GNN can well learn spectral efficient and energy efficient precoding policies in single- and multi-cell multi-user multi-antenna systems with low training complexity, and can be well generalized to the numbers of users.  ( 2 min )
    Applications of Lattice Gauge Equivariant Neural Networks. (arXiv:2212.00832v1 [hep-lat])
    The introduction of relevant physical information into neural network architectures has become a widely used and successful strategy for improving their performance. In lattice gauge theories, such information can be identified with gauge symmetries, which are incorporated into the network layers of our recently proposed Lattice Gauge Equivariant Convolutional Neural Networks (L-CNNs). L-CNNs can generalize better to differently sized lattices than traditional neural networks and are by construction equivariant under lattice gauge transformations. In these proceedings, we present our progress on possible applications of L-CNNs to Wilson flow or continuous normalizing flow. Our methods are based on neural ordinary differential equations which allow us to modify link configurations in a gauge equivariant manner. For simplicity, we focus on simple toy models to test these ideas in practice.  ( 2 min )
    Improving Pareto Front Learning via Multi-Sample Hypernetworks. (arXiv:2212.01130v1 [cs.LG])
    Pareto Front Learning (PFL) was recently introduced as an effective approach to obtain a mapping function from a given trade-off vector to a solution on the Pareto front, which solves the multi-objective optimization (MOO) problem. Due to the inherent trade-off between conflicting objectives, PFL offers a flexible approach in many scenarios in which the decision makers can not specify the preference of one Pareto solution over another, and must switch between them depending on the situation. However, existing PFL methods ignore the relationship between the solutions during the optimization process, which hinders the quality of the obtained front. To overcome this issue, we propose a novel PFL framework namely \ourmodel, which employs a hypernetwork to generate multiple solutions from a set of diverse trade-off preferences and enhance the quality of the Pareto front by maximizing the Hypervolume indicator defined by these solutions. The experimental results on several MOO machine learning tasks show that the proposed framework significantly outperforms the baselines in producing the trade-off Pareto front.  ( 2 min )
    An Introduction to Kernel and Operator Learning Methods for Homogenization by Self-consistent Clustering Analysis. (arXiv:2212.00802v1 [cs.LG])
    Recent advances in operator learning theory have improved our knowledge about learning maps between infinite dimensional spaces. However, for large-scale engineering problems such as concurrent multiscale simulation for mechanical properties, the training cost for the current operator learning methods is very high. The article presents a thorough analysis on the mathematical underpinnings of the operator learning paradigm and proposes a kernel learning method that maps between function spaces. We first provide a survey of modern kernel and operator learning theory, as well as discuss recent results and open problems. From there, the article presents an algorithm to how we can analytically approximate the piecewise constant functions on R for operator learning. This implies the potential feasibility of success of neural operators on clustered functions. Finally, a k-means clustered domain on the basis of a mechanistic response is considered and the Lippmann-Schwinger equation for micro-mechanical homogenization is solved. The article briefly discusses the mathematics of previous kernel learning methods and some preliminary results with those methods. The proposed kernel operator learning method uses graph kernel networks to come up with a mechanistic reduced order method for multiscale homogenization.  ( 2 min )
    Stable Learning via Sparse Variable Independence. (arXiv:2212.00992v1 [cs.LG])
    The problem of covariate-shift generalization has attracted intensive research attention. Previous stable learning algorithms employ sample reweighting schemes to decorrelate the covariates when there is no explicit domain information about training data. However, with finite samples, it is difficult to achieve the desirable weights that ensure perfect independence to get rid of the unstable variables. Besides, decorrelating within stable variables may bring about high variance of learned models because of the over-reduced effective sample size. A tremendous sample size is required for these algorithms to work. In this paper, with theoretical justification, we propose SVI (Sparse Variable Independence) for the covariate-shift generalization problem. We introduce sparsity constraint to compensate for the imperfectness of sample reweighting under the finite-sample setting in previous methods. Furthermore, we organically combine independence-based sample reweighting and sparsity-based variable selection in an iterative way to avoid decorrelating within stable variables, increasing the effective sample size to alleviate variance inflation. Experiments on both synthetic and real-world datasets demonstrate the improvement of covariate-shift generalization performance brought by SVI.  ( 2 min )
    Guaranteed Conformance of Neurosymbolic Models to Natural Constraints. (arXiv:2212.01346v1 [cs.LG])
    Deep neural networks have emerged as the workhorse for a large section of robotics and control applications, especially as models for dynamical systems. Such data-driven models are in turn used for designing and verifying autonomous systems. This is particularly useful in modeling medical systems where data can be leveraged to individualize treatment. In safety-critical applications, it is important that the data-driven model is conformant to established knowledge from the natural sciences. Such knowledge is often available or can often be distilled into a (possibly black-box) model $M$. For instance, the unicycle model for an F1 racing car. In this light, we consider the following problem - given a model $M$ and state transition dataset, we wish to best approximate the system model while being bounded distance away from $M$. We propose a method to guarantee this conformance. Our first step is to distill the dataset into few representative samples called memories, using the idea of a growing neural gas. Next, using these memories we partition the state space into disjoint subsets and compute bounds that should be respected by the neural network, when the input is drawn from a particular subset. This serves as a symbolic wrapper for guaranteed conformance. We argue theoretically that this only leads to bounded increase in approximation error; which can be controlled by increasing the number of memories. We experimentally show that on three case studies (Car Model, Drones, and Artificial Pancreas), our constrained neurosymbolic models conform to specified $M$ models (each encoding various constraints) with order-of-magnitude improvements compared to the augmented Lagrangian and vanilla training methods.
    Spectral Feature Augmentation for Graph Contrastive Learning and Beyond. (arXiv:2212.01026v1 [cs.LG])
    Although augmentations (e.g., perturbation of graph edges, image crops) boost the efficiency of Contrastive Learning (CL), feature level augmentation is another plausible, complementary yet not well researched strategy. Thus, we present a novel spectral feature argumentation for contrastive learning on graphs (and images). To this end, for each data view, we estimate a low-rank approximation per feature map and subtract that approximation from the map to obtain its complement. This is achieved by the proposed herein incomplete power iteration, a non-standard power iteration regime which enjoys two valuable byproducts (under mere one or two iterations): (i) it partially balances spectrum of the feature map, and (ii) it injects the noise into rebalanced singular values of the feature map (spectral augmentation). For two views, we align these rebalanced feature maps as such an improved alignment step can focus more on less dominant singular values of matrices of both views, whereas the spectral augmentation does not affect the spectral angle alignment (singular vectors are not perturbed). We derive the analytical form for: (i) the incomplete power iteration to capture its spectrum-balancing effect, and (ii) the variance of singular values augmented implicitly by the noise. We also show that the spectral augmentation improves the generalization bound. Experiments on graph/image datasets show that our spectral feature augmentation outperforms baselines, and is complementary with other augmentation strategies and compatible with various contrastive losses.  ( 2 min )
    Flow to Control: Offline Reinforcement Learning with Lossless Primitive Discovery. (arXiv:2212.01105v1 [cs.LG])
    Offline reinforcement learning (RL) enables the agent to effectively learn from logged data, which significantly extends the applicability of RL algorithms in real-world scenarios where exploration can be expensive or unsafe. Previous works have shown that extracting primitive skills from the recurring and temporally extended structures in the logged data yields better learning. However, these methods suffer greatly when the primitives have limited representation ability to recover the original policy space, especially in offline settings. In this paper, we give a quantitative characterization of the performance of offline hierarchical learning and highlight the importance of learning lossless primitives. To this end, we propose to use a \emph{flow}-based structure as the representation for low-level policies. This allows us to represent the behaviors in the dataset faithfully while keeping the expression ability to recover the whole policy space. We show that such lossless primitives can drastically improve the performance of hierarchical policies. The experimental results and extensive ablation studies on the standard D4RL benchmark show that our method has a good representation ability for policies and achieves superior performance in most tasks.  ( 2 min )
    Credit Assignment for Trained Neural Networks Based on Koopman Operator Theory. (arXiv:2212.00998v1 [cs.LG])
    Credit assignment problem of neural networks refers to evaluating the credit of each network component to the final outputs. For an untrained neural network, approaches to tackling it have made great contributions to parameter update and model revolution during the training phase. This problem on trained neural networks receives rare attention, nevertheless, it plays an increasingly important role in neural network patch, specification and verification. Based on Koopman operator theory, this paper presents an alternative perspective of linear dynamics on dealing with the credit assignment problem for trained neural networks. Regarding a neural network as the composition of sub-dynamics series, we utilize step-delay embedding to capture snapshots of each component, characterizing the established mapping as exactly as possible. To circumvent the dimension-difference problem encountered during the embedding, a composition and decomposition of an auxiliary linear layer, termed minimal linear dimension alignment, is carefully designed with rigorous formal guarantee. Afterwards, each component is approximated by a Koopman operator and we derive the Jacobian matrix and its corresponding determinant, similar to backward propagation. Then, we can define a metric with algebraic interpretability for the credit assignment of each network component. Moreover, experiments conducted on typical neural networks demonstrate the effectiveness of the proposed method.  ( 2 min )
    Learning Robust State Observers using Neural ODEs (longer version). (arXiv:2212.00866v1 [eess.SY])
    Relying on recent research results on Neural ODEs, this paper presents a methodology for the design of state observers for nonlinear systems based on Neural ODEs, learning Luenberger-like observers and their nonlinear extension (Kazantzis-Kravaris-Luenberger (KKL) observers) for systems with partially-known nonlinear dynamics and fully unknown nonlinear dynamics, respectively. In particular, for tuneable KKL observers, the relationship between the design of the observer and its trade-off between convergence speed and robustness is analysed and used as a basis for improving the robustness of the learning-based observer in training. We illustrate the advantages of this approach in numerical simulations.  ( 2 min )
    Progress and Challenges for the Application of Machine Learning for Neglected Tropical Diseases. (arXiv:2212.01027v1 [q-bio.BM])
    Neglected tropical diseases (NTDs) continue to affect the livelihood of individuals in countries in the Southeast Asia and Western Pacific region. These diseases have been long existing and have caused devastating health problems and economic decline to people in low- and middle-income (developing) countries. An estimated 1.7 billion of the world's population suffer one or more NTDs annually, this puts approximately one in five individuals at risk for NTDs. In addition to health and social impact, NTDs inflict significant financial burden to patients, close relatives, and are responsible for billions of dollars lost in revenue from reduced labor productivity in developing countries alone. There is an urgent need to better improve the control and eradication or elimination efforts towards NTDs. This can be achieved by utilizing machine learning tools to better the surveillance, prediction and detection program, and combat NTDs through the discovery of new therapeutics against these pathogens. This review surveys the current application of machine learning tools for NTDs and the challenges to elevate the state-of-the-art of NTDs surveillance, management, and treatment.  ( 2 min )
    SoftCorrect: Error Correction with Soft Detection for Automatic Speech Recognition. (arXiv:2212.01039v1 [cs.CL])
    Error correction in automatic speech recognition (ASR) aims to correct those incorrect words in sentences generated by ASR models. Since recent ASR models usually have low word error rate (WER), to avoid affecting originally correct tokens, error correction models should only modify incorrect words, and therefore detecting incorrect words is important for error correction. Previous works on error correction either implicitly detect error words through target-source attention or CTC (connectionist temporal classification) loss, or explicitly locate specific deletion/substitution/insertion errors. However, implicit error detection does not provide clear signal about which tokens are incorrect and explicit error detection suffers from low detection accuracy. In this paper, we propose SoftCorrect with a soft error detection mechanism to avoid the limitations of both explicit and implicit error detection. Specifically, we first detect whether a token is correct or not through a probability produced by a dedicatedly designed language model, and then design a constrained CTC loss that only duplicates the detected incorrect tokens to let the decoder focus on the correction of error tokens. Compared with implicit error detection with CTC loss, SoftCorrect provides explicit signal about which words are incorrect and thus does not need to duplicate every token but only incorrect tokens; compared with explicit error detection, SoftCorrect does not detect specific deletion/substitution/insertion errors but just leaves it to CTC loss. Experiments on AISHELL-1 and Aidatatang datasets show that SoftCorrect achieves 26.1% and 9.4% CER reduction respectively, outperforming previous works by a large margin, while still enjoying fast speed of parallel generation.  ( 2 min )
    Modeling Mobile Health Users as Reinforcement Learning Agents. (arXiv:2212.00863v1 [cs.LG])
    Mobile health (mHealth) technologies empower patients to adopt/maintain healthy behaviors in their daily lives, by providing interventions (e.g. push notifications) tailored to the user's needs. In these settings, without intervention, human decision making may be impaired (e.g. valuing near term pleasure over own long term goals). In this work, we formalize this relationship with a framework in which the user optimizes a (potentially impaired) Markov Decision Process (MDP) and the mHealth agent intervenes on the user's MDP parameters. We show that different types of impairments imply different types of optimal intervention. We also provide analytical and empirical explorations of these differences.  ( 2 min )
    Fast Algorithm for Constrained Linear Inverse Problems. (arXiv:2212.01068v1 [math.OC])
    We consider the constrained Linear Inverse Problem (LIP), where a certain atomic norm (like the $\ell_1 $ and the Nuclear norm) is minimized subject to a quadratic constraint. Typically, such cost functions are non-differentiable which makes them not amenable to the fast optimization methods existing in practice. We propose two equivalent reformulations of the constrained LIP with improved convex regularity: (i) a smooth convex minimization problem, and (ii) a strongly convex min-max problem. These problems could be solved by applying existing acceleration based convex optimization methods which provide better \mmode{ O \left( \nicefrac{1}{k^2} \right) } theoretical convergence guarantee. However, to fully exploit the utility of these reformulations, we also provide a novel algorithm, to which we refer as the Fast Linear Inverse Problem Solver (FLIPS), that is tailored to solve the reformulation of the LIP. We demonstrate the performance of FLIPS on the sparse coding problem arising in image processing tasks. In this setting, we observe that FLIPS consistently outperforms the Chambolle-Pock and C-SALSA algorithms--two of the current best methods in the literature.  ( 2 min )
    PASTA: Proportional Amplitude Spectrum Training Augmentation for Syn-to-Real Domain Generalization. (arXiv:2212.00979v1 [cs.CV])
    Synthetic data offers the promise of cheap and bountiful training data for settings where lots of labeled real-world data for tasks is unavailable. However, models trained on synthetic data significantly underperform on real-world data. In this paper, we propose Proportional Amplitude Spectrum Training Augmentation (PASTA), a simple and effective augmentation strategy to improve out-of-the-box synthetic-to-real (syn-to-real) generalization performance. PASTA involves perturbing the amplitude spectrums of the synthetic images in the Fourier domain to generate augmented views. We design PASTA to perturb the amplitude spectrums in a structured manner such that high-frequency components are perturbed relatively more than the low-frequency ones. For the tasks of semantic segmentation (GTAV to Real), object detection (Sim10K to Real), and object recognition (VisDA-C Syn to Real), across a total of 5 syn-to-real shifts, we find that PASTA outperforms more complex state-of-the-art generalization methods while being complementary to the same.  ( 2 min )
    AL-iGAN: An Active Learning Framework for Tunnel Geological Reconstruction Based on TBM Operational Data. (arXiv:2212.00965v1 [cs.LG])
    In tunnel boring machine (TBM) underground projects, an accurate description of the rock-soil types distributed in the tunnel can decrease the construction risk ({\it e.g.} surface settlement and landslide) and improve the efficiency of construction. In this paper, we propose an active learning framework, called AL-iGAN, for tunnel geological reconstruction based on TBM operational data. This framework contains two main parts: one is the usage of active learning techniques for recommending new drilling locations to label the TBM operational data and then to form new training samples; and the other is an incremental generative adversarial network for geological reconstruction (iGAN-GR), whose weights can be incrementally updated to improve the reconstruction performance by using the new samples. The numerical experiment validate the effectiveness of the proposed framework as well.  ( 2 min )
    VeriX: Towards Verified Explainability of Deep Neural Networks. (arXiv:2212.01051v1 [cs.LG])
    We present VeriX, a first step towards verified explainability of machine learning models in safety-critical applications. Specifically, our sound and optimal explanations can guarantee prediction invariance against bounded perturbations. We utilise constraint solving techniques together with feature sensitivity ranking to efficiently compute these explanations. We evaluate our approach on image recognition benchmarks and a real-world scenario of autonomous aircraft taxiing.
    Assessing Anonymized System Logs Usefulness for Behavioral Analysis in RNN Models. (arXiv:2212.01101v1 [cs.LG])
    System logs are a common source of monitoring data for analyzing computing systems' behavior. Due to the complexity of modern computing systems and the large size of collected monitoring data, automated analysis mechanisms are required. Numerous machine learning and deep learning methods are proposed to address this challenge. However, due to the existence of sensitive data in system logs their analysis and storage raise serious privacy concerns. Anonymization methods could be used to clean the monitoring data before analysis. However, anonymized system logs, in general, do not provide adequate usefulness for the majority of behavioral analysis. Content-aware anonymization mechanisms such as PaRS preserve the correlation of system logs even after anonymization. This work evaluates the usefulness of anonymized system logs taken from the Taurus HPC cluster anonymized using PaRS, for behavioral analysis via recurrent neural network models.  ( 2 min )
    Accelerating Inverse Learning via Intelligent Localization with Exploratory Sampling. (arXiv:2212.01016v1 [cs.LG])
    In the scope of "AI for Science", solving inverse problems is a longstanding challenge in materials and drug discovery, where the goal is to determine the hidden structures given a set of desirable properties. Deep generative models are recently proposed to solve inverse problems, but these currently use expensive forward operators and struggle in precisely localizing the exact solutions and fully exploring the parameter spaces without missing solutions. In this work, we propose a novel approach (called iPage) to accelerate the inverse learning process by leveraging probabilistic inference from deep invertible models and deterministic optimization via fast gradient descent. Given a target property, the learned invertible model provides a posterior over the parameter space; we identify these posterior samples as an intelligent prior initialization which enables us to narrow down the search space. We then perform gradient descent to calibrate the inverse solutions within a local region. Meanwhile, a space-filling sampling is imposed on the latent space to better explore and capture all possible solutions. We evaluate our approach on three benchmark tasks and two created datasets with real-world applications from quantum chemistry and additive manufacturing, and find our method achieves superior performance compared to several state-of-the-art baseline methods. The iPage code is available at https://github.com/jxzhangjhu/MatDesINNe.  ( 2 min )
    SOLD: Sinhala Offensive Language Dataset. (arXiv:2212.00851v1 [cs.CL])
    The widespread of offensive content online, such as hate speech and cyber-bullying, is a global phenomenon. This has sparked interest in the artificial intelligence (AI) and natural language processing (NLP) communities, motivating the development of various systems trained to detect potentially harmful content automatically. These systems require annotated datasets to train the machine learning (ML) models. However, with a few notable exceptions, most datasets on this topic have dealt with English and a few other high-resource languages. As a result, the research in offensive language identification has been limited to these languages. This paper addresses this gap by tackling offensive language identification in Sinhala, a low-resource Indo-Aryan language spoken by over 17 million people in Sri Lanka. We introduce the Sinhala Offensive Language Dataset (SOLD) and present multiple experiments on this dataset. SOLD is a manually annotated dataset containing 10,000 posts from Twitter annotated as offensive and not offensive at both sentence-level and token-level, improving the explainability of the ML models. SOLD is the first large publicly available offensive language dataset compiled for Sinhala. We also introduce SemiSOLD, a larger dataset containing more than 145,000 Sinhala tweets, annotated following a semi-supervised approach.  ( 2 min )
    Progressive Feature Upgrade in Semi-supervised Learning on Tabular Domain. (arXiv:2212.00892v1 [cs.LG])
    Recent semi-supervised and self-supervised methods have shown great success in the image and text domain by utilizing augmentation techniques. Despite such success, it is not easy to transfer this success to tabular domains. It is not easy to adapt domain-specific transformations from image and language to tabular data due to mixing of different data types (continuous data and categorical data) in the tabular domain. There are a few semi-supervised works on the tabular domain that have focused on proposing new augmentation techniques for tabular data. These approaches may have shown some improvement on datasets with low-cardinality in categorical data. However, the fundamental challenges have not been tackled. The proposed methods either do not apply to datasets with high-cardinality or do not use an efficient encoding of categorical data. We propose using conditional probability representation and an efficient progressively feature upgrading framework to effectively learn representations for tabular data in semi-supervised applications. The extensive experiments show superior performance of the proposed framework and the potential application in semi-supervised settings.  ( 2 min )
    Improved Representation Learning Through Tensorized Autoencoders. (arXiv:2212.01046v1 [cs.LG])
    The central question in representation learning is what constitutes a good or meaningful representation. In this work we argue that if we consider data with inherent cluster structures, where clusters can be characterized through different means and covariances, those data structures should be represented in the embedding as well. While Autoencoders (AE) are widely used in practice for unsupervised representation learning, they do not fulfil the above condition on the embedding as they obtain a single representation of the data. To overcome this we propose a meta-algorithm that can be used to extend an arbitrary AE architecture to a tensorized version (TAE) that allows for learning cluster-specific embeddings while simultaneously learning the cluster assignment. For the linear setting we prove that TAE can recover the principle components of the different clusters in contrast to principle component of the entire data recovered by a standard AE. We validated this on planted models and for general, non-linear and convolutional AEs we empirically illustrate that tensorizing the AE is beneficial in clustering and de-noising tasks.  ( 2 min )
    Bayesian Physics Informed Neural Networks for Data Assimilation and Spatio-Temporal Modelling of Wildfires. (arXiv:2212.00970v1 [cs.LG])
    We apply Physics Informed Neural Networks (PINNs) to the problem of wildfire fire-front modelling. The PINN is an approach that integrates a differential equation into the optimisation loss function of a neural network to guide the neural network to learn the physics of a problem. We apply the PINN to the level-set equation, which is a Hamilton-Jacobi partial differential equation that models a fire-front with the zero-level set. This results in a PINN that simulates a fire-front as it propagates through a spatio-temporal domain. We demonstrate the agility of the PINN to learn physical properties of a fire under extreme changes in external conditions (such as wind) and show that this approach encourages continuity of the PINN's solution across time. Furthermore, we demonstrate how data assimilation and uncertainty quantification can be incorporated into the PINN in the wildfire context. This is significant contribution to wildfire modelling as the level-set method -- which is a standard solver to the level-set equation -- does not naturally provide this capability.  ( 2 min )
    Generative Data Augmentation for Non-IID Problem in Decentralized Clinical Machine Learning. (arXiv:2212.01109v1 [cs.LG])
    Swarm learning (SL) is an emerging promising decentralized machine learning paradigm and has achieved high performance in clinical applications. SL solves the problem of a central structure in federated learning by combining edge computing and blockchain-based peer-to-peer network. While there are promising results in the assumption of the independent and identically distributed (IID) data across participants, SL suffers from performance degradation as the degree of the non-IID data increases. To address this problem, we propose a generative augmentation framework in swarm learning called SL-GAN, which augments the non-IID data by generating the synthetic data from participants. SL-GAN trains generators and discriminators locally, and periodically aggregation via a randomly elected coordinator in SL network. Under the standard assumptions, we theoretically prove the convergence of SL-GAN using stochastic approximations. Experimental results demonstrate that SL-GAN outperforms state-of-art methods on three real world clinical datasets including Tuberculosis, Leukemia, COVID-19.  ( 2 min )
    AGRO: Adversarial Discovery of Error-prone groups for Robust Optimization. (arXiv:2212.00921v1 [cs.LG])
    Models trained via empirical risk minimization (ERM) are known to rely on spurious correlations between labels and task-independent input features, resulting in poor generalization to distributional shifts. Group distributionally robust optimization (G-DRO) can alleviate this problem by minimizing the worst-case loss over a set of pre-defined groups over training data. G-DRO successfully improves performance of the worst-group, where the correlation does not hold. However, G-DRO assumes that the spurious correlations and associated worst groups are known in advance, making it challenging to apply it to new tasks with potentially multiple unknown spurious correlations. We propose AGRO -- Adversarial Group discovery for Distributionally Robust Optimization -- an end-to-end approach that jointly identifies error-prone groups and improves accuracy on them. AGRO equips G-DRO with an adversarial slicing model to find a group assignment for training examples which maximizes worst-case loss over the discovered groups. On the WILDS benchmark, AGRO results in 8% higher model performance on average on known worst-groups, compared to prior group discovery approaches used with G-DRO. AGRO also improves out-of-distribution performance on SST2, QQP, and MS-COCO -- datasets where potential spurious correlations are as yet uncharacterized. Human evaluation of ARGO groups shows that they contain well-defined, yet previously unstudied spurious correlations that lead to model errors.  ( 2 min )
    On the Limit of Explaining Black-box Temporal Graph Neural Networks. (arXiv:2212.00952v1 [cs.LG])
    Temporal Graph Neural Network (TGNN) has been receiving a lot of attention recently due to its capability in modeling time-evolving graph-related tasks. Similar to Graph Neural Networks, it is also non-trivial to interpret predictions made by a TGNN due to its black-box nature. A major approach tackling this problems in GNNs is by analyzing the model' responses on some perturbations of the model's inputs, called perturbation-based explanation methods. While these methods are convenient and flexible since they do not need internal access to the model, does this lack of internal access prevent them from revealing some important information of the predictions? Motivated by that question, this work studies the limit of some classes of perturbation-based explanation methods. Particularly, by constructing some specific instances of TGNNs, we show (i) node-perturbation cannot reliably identify the paths carrying out the prediction, (ii) edge-perturbation is not reliable in determining all nodes contributing to the prediction and (iii) perturbing both nodes and edges does not reliably help us identify the graph's components carrying out the temporal aggregation in TGNNs.  ( 2 min )
    A Hybrid Deep Learning Anomaly Detection Framework for Intrusion Detection. (arXiv:2212.00966v1 [cs.CR])
    Cyber intrusion attacks that compromise the users' critical and sensitive data are escalating in volume and intensity, especially with the growing connections between our daily life and the Internet. The large volume and high complexity of such intrusion attacks have impeded the effectiveness of most traditional defence techniques. While at the same time, the remarkable performance of the machine learning methods, especially deep learning, in computer vision, had garnered research interests from the cyber security community to further enhance and automate intrusion detections. However, the expensive data labeling and limitation of anomalous data make it challenging to train an intrusion detector in a fully supervised manner. Therefore, intrusion detection based on unsupervised anomaly detection is an important feature too. In this paper, we propose a three-stage deep learning anomaly detection based network intrusion attack detection framework. The framework comprises an integration of unsupervised (K-means clustering), semi-supervised (GANomaly) and supervised learning (CNN) algorithms. We then evaluated and showed the performance of our implemented framework on three benchmark datasets: NSL-KDD, CIC-IDS2018, and TON_IoT.  ( 2 min )
    Navigating to Objects in the Real World. (arXiv:2212.00922v1 [cs.RO])
    Semantic navigation is necessary to deploy mobile robots in uncontrolled environments like our homes, schools, and hospitals. Many learning-based approaches have been proposed in response to the lack of semantic understanding of the classical pipeline for spatial navigation, which builds a geometric map using depth sensors and plans to reach point goals. Broadly, end-to-end learning approaches reactively map sensor inputs to actions with deep neural networks, while modular learning approaches enrich the classical pipeline with learning-based semantic sensing and exploration. But learned visual navigation policies have predominantly been evaluated in simulation. How well do different classes of methods work on a robot? We present a large-scale empirical study of semantic visual navigation methods comparing representative methods from classical, modular, and end-to-end learning approaches across six homes with no prior experience, maps, or instrumentation. We find that modular learning works well in the real world, attaining a 90% success rate. In contrast, end-to-end learning does not, dropping from 77% simulation to 23% real-world success rate due to a large image domain gap between simulation and reality. For practitioners, we show that modular learning is a reliable approach to navigate to objects: modularity and abstraction in policy design enable Sim-to-Real transfer. For researchers, we identify two key issues that prevent today's simulators from being reliable evaluation benchmarks - (A) a large Sim-to-Real gap in images and (B) a disconnect between simulation and real-world error modes - and propose concrete steps forward.  ( 2 min )
    Private Multiparty Perception for Navigation. (arXiv:2212.00912v1 [cs.LG])
    We introduce a framework for navigating through cluttered environments by connecting multiple cameras together while simultaneously preserving privacy. Occlusions and obstacles in large environments are often challenging situations for navigation agents because the environment is not fully observable from a single camera view. Given multiple camera views of an environment, our approach learns to produce a multiview scene representation that can only be used for navigation, provably preventing one party from inferring anything beyond the output task. On a new navigation dataset that we will publicly release, experiments show that private multiparty representations allow navigation through complex scenes and around obstacles while jointly preserving privacy. Our approach scales to an arbitrary number of camera viewpoints. We believe developing visual representations that preserve privacy is increasingly important for many applications such as navigation.  ( 2 min )
    Symphony in the Latent Space: Provably Integrating High-dimensional Techniques with Non-linear Machine Learning Models. (arXiv:2212.00852v1 [cs.LG])
    This paper revisits building machine learning algorithms that involve interactions between entities, such as those between financial assets in an actively managed portfolio, or interactions between users in a social network. Our goal is to forecast the future evolution of ensembles of multivariate time series in such applications (e.g., the future return of a financial asset or the future popularity of a Twitter account). Designing ML algorithms for such systems requires addressing the challenges of high-dimensional interactions and non-linearity. Existing approaches usually adopt an ad-hoc approach to integrating high-dimensional techniques into non-linear models and recent studies have shown these approaches have questionable efficacy in time-evolving interacting systems. To this end, we propose a novel framework, which we dub as the additive influence model. Under our modeling assumption, we show that it is possible to decouple the learning of high-dimensional interactions from the learning of non-linear feature interactions. To learn the high-dimensional interactions, we leverage kernel-based techniques, with provable guarantees, to embed the entities in a low-dimensional latent space. To learn the non-linear feature-response interactions, we generalize prominent machine learning techniques, including designing a new statistically sound non-parametric method and an ensemble learning algorithm optimized for vector regressions. Extensive experiments on two common applications demonstrate that our new algorithms deliver significantly stronger forecasting power compared to standard and recently proposed methods.  ( 2 min )
    Convolutional Long Short-Term Memory (convLSTM) for Spatio-Temporal Forecastings of Saturations and Pressure in the SACROC Field. (arXiv:2212.00796v1 [eess.IV])
    A machine learning architecture composed of convolutional long short-term memory (convLSTM) is developed to predict spatio-temporal parameters in the SACROC oil field, Texas, USA. The spatial parameters are recorded at the end of each month for 30 years (360 months), approximately 83% (300 months) of which is used for training and the rest 17% (60 months) is kept for testing. The samples for the convLSTM models are prepared by choosing ten consecutive frames as input and ten consecutive frames shifted forward by one frame as output. Individual models are trained for oil, gas, and water saturations, and pressure using the Nesterov accelerated adaptive moment estimation (Nadam) optimization algorithm. A workflow is provided to comprehend the entire process of data extraction, preprocessing, sample preparation, training, testing of machine learning models, and error analysis. Overall, the convLSTM for spatio-temporal prediction shows promising results in predicting spatio-temporal parameters in porous media.  ( 2 min )
    Faster Adaptive Federated Learning. (arXiv:2212.00974v1 [cs.LG])
    Federated learning has attracted increasing attention with the emergence of distributed data. While extensive federated learning algorithms have been proposed for the non-convex distributed problem, the federated learning in practice still faces numerous challenges, such as the large training iterations to converge since the sizes of models and datasets keep increasing, and the lack of adaptivity by SGD-based model updates. Meanwhile, the study of adaptive methods in federated learning is scarce and existing works either lack a complete theoretical convergence guarantee or have slow sample complexity. In this paper, we propose an efficient adaptive algorithm (i.e., FAFED) based on the momentum-based variance reduced technique in cross-silo FL. We first explore how to design the adaptive algorithm in the FL setting. By providing a counter-example, we prove that a simple combination of FL and adaptive methods could lead to divergence. More importantly, we provide a convergence analysis for our method and prove that our algorithm is the first adaptive FL algorithm to reach the best-known samples $O(\epsilon^{-3})$ and $O(\epsilon^{-2})$ communication rounds to find an $\epsilon$-stationary point without large batches. The experimental results on the language modeling task and image classification task with heterogeneous data demonstrate the efficiency of our algorithms.  ( 2 min )
    Fair Generative Models via Transfer Learning. (arXiv:2212.00926v1 [cs.LG])
    This work addresses fair generative models. Dataset biases have been a major cause of unfairness in deep generative models. Previous work had proposed to augment large, biased datasets with small, unbiased reference datasets. Under this setup, a weakly-supervised approach has been proposed, which achieves state-of-the-art quality and fairness in generated samples. In our work, based on this setup, we propose a simple yet effective approach. Specifically, first, we propose fairTL, a transfer learning approach to learn fair generative models. Under fairTL, we pre-train the generative model with the available large, biased datasets and subsequently adapt the model using the small, unbiased reference dataset. We find that our fairTL can learn expressive sample generation during pre-training, thanks to the large (biased) dataset. This knowledge is then transferred to the target model during adaptation, which also learns to capture the underlying fair distribution of the small reference dataset. Second, we propose fairTL++, where we introduce two additional innovations to improve upon fairTL: (i) multiple feedback and (ii) Linear-Probing followed by Fine-Tuning (LP-FT). Taking one step further, we consider an alternative, challenging setup when only a pre-trained (potentially biased) model is available but the dataset that was used to pre-train the model is inaccessible. We demonstrate that our proposed fairTL and fairTL++ remain very effective under this setup. We note that previous work requires access to the large, biased datasets and is incapable of handling this more challenging setup. Extensive experiments show that fairTL and fairTL++ achieve state-of-the-art in both quality and fairness of generated samples. The code and additional resources can be found at bearwithchris.github.io/fairTL/.  ( 2 min )
    On-device Training: A First Overview on Existing Systems. (arXiv:2212.00824v1 [cs.LG])
    The recent breakthroughs in machine learning (ML) and deep learning (DL) have enabled many new capabilities across plenty of application domains. While most existing machine learning models require large memory and computing power, efforts have been made to deploy some models on resource-constrained devices as well. There are several systems that perform inference on the device, while direct training on the device still remains a challenge. On-device training, however, is attracting more and more interest because: (1) it enables training models on local data without needing to share data over the cloud, thus enabling privacy preserving computation by design; (2) models can be refined on devices to provide personalized services and cope with model drift in order to adapt to the changes of the real-world environment; and (3) it enables the deployment of models in remote, hardly accessible locations or places without stable internet connectivity. We summarize and analyze the-state-of-art systems research to provide the first survey of on-device training from a systems perspective.  ( 2 min )
    AGO: Boosting Mobile AI Inference Performance by Removing Constraints on Graph Optimization. (arXiv:2212.01005v1 [cs.LG])
    Traditional deep learning compilers rely on heuristics for subgraph generation, which impose extra constraints on graph optimization, e.g., each subgraph can only contain at most one complex operator. In this paper, we propose AGO, a framework for graph optimization with arbitrary structures to boost the inference performance of deep models by removing such constraints. To create new optimization opportunities for complicated subgraphs, we propose intensive operator fusion, which can effectively stitch multiple complex operators together for better performance. Further, we design a graph partitioning scheme that allows an arbitrary structure for each subgraph while guaranteeing the acyclic property among all generated subgraphs. Additionally, to enable efficient performance tuning on complicated subgraphs, we devise a novel divide-and-conquer tuning mechanism to orchestrate different system components. Through extensive experiments on various neural networks and mobile devices, we show that our system can improve the inference performance by up to 3.3x when compared with state-of-the-art deep compilers.  ( 2 min )
    Navigating causal deep learning. (arXiv:2212.00911v1 [cs.LG])
    Causal deep learning (CDL) is a new and important research area in the larger field of machine learning. With CDL, researchers aim to structure and encode causal knowledge in the extremely flexible representation space of deep learning models. Doing so will lead to more informed, robust, and general predictions and inference -- which is important! However, CDL is still in its infancy. For example, it is not clear how we ought to compare different methods as they are so different in their output, the way they encode causal knowledge, or even how they represent this knowledge. This is a living paper that categorises methods in causal deep learning beyond Pearl's ladder of causation. We refine the rungs in Pearl's ladder, while also adding a separate dimension that categorises the parametric assumptions of both input and representation, arriving at the map of causal deep learning. Our map covers machine learning disciplines such as supervised learning, reinforcement learning, generative modelling and beyond. Our paradigm is a tool which helps researchers to: find benchmarks, compare methods, and most importantly: identify research gaps. With this work we aim to structure the avalanche of papers being published on causal deep learning. While papers on the topic are being published daily, our map remains fixed. We open-source our map for others to use as they see fit: perhaps to offer guidance in a related works section, or to better highlight the contribution of their paper.  ( 2 min )
    Pareto Regret Analyses in Multi-objective Multi-armed Bandit. (arXiv:2212.00884v1 [cs.LG])
    We study Pareto optimality in multi-objective multi-armed bandit by providing a formulation of adversarial multi-objective multi-armed bandit and properly defining its Pareto regrets that can be generalized to stochastic settings as well. The regrets do not rely on any scalarization functions and reflect Pareto optimality compared to scalarized regrets. We also present new algorithms assuming both with and without prior information of the multi-objective multi-armed bandit setting. The algorithms are shown optimal in adversarial settings and nearly optimal in stochastic settings simultaneously by our established upper bounds and lower bounds on Pareto regrets. Moreover, the lower bound analyses show that the new regrets are consistent with the existing Pareto regret for stochastic settings and extend an adversarial attack mechanism from bandit to the multi-objective one.  ( 2 min )
    Diffusion Generative Models in Infinite Dimensions. (arXiv:2212.00886v1 [cs.LG])
    Diffusion generative models have recently been applied to domains where the available data can be seen as a discretization of an underlying function, such as audio signals or time series. However, these models operate directly on the discretized data, and there are no semantics in the modeling process that relate the observed data to the underlying functional forms. We generalize diffusion models to operate directly in function space by developing the foundational theory for such models in terms of Gaussian measures on Hilbert spaces. A significant benefit of our function space point of view is that it allows us to explicitly specify the space of functions we are working in, leading us to develop methods for diffusion generative modeling in Sobolev spaces. Our approach allows us to perform both unconditional and conditional generation of function-valued data. We demonstrate our methods on several synthetic and real-world benchmarks.  ( 2 min )
    Architectural Implications of Embedding Dimension during GCN on CPU and GPU. (arXiv:2212.00827v1 [cs.LG])
    Graph Neural Networks (GNNs) are a class of neural networks designed to extract information from the graphical structure of data. Graph Convolutional Networks (GCNs) are a widely used type of GNN for transductive graph learning problems which apply convolution to learn information from graphs. GCN is a challenging algorithm from an architecture perspective due to inherent sparsity, low data reuse, and massive memory capacity requirements. Traditional neural algorithms exploit the high compute capacity of GPUs to achieve high performance for both inference and training. The architectural decision to use a GPU for GCN inference is a question explored in this work. GCN on both CPU and GPU was characterized in order to better understand the implications of graph size, embedding dimension, and sampling on performance.  ( 2 min )
  • Open

    Comparative study of machine learning and deep learning methods on ASD classification. (arXiv:2209.08601v2 [eess.IV] UPDATED)
    The autism dataset is studied to identify the differences between autistic and healthy groups. For this, the resting-state Functional Magnetic Resonance Imaging (rs-fMRI) data of the two groups are analyzed, and networks of connections between brain regions were created. Several classification frameworks are developed to distinguish the connectivity patterns between the groups. The best models for statistical inference and precision were compared, and the tradeoff between precision and model interpretability was analyzed. Finally, the classification accuracy measures were reported to justify the performance of our framework. Our best model can classify autistic and healthy patients on the multisite ABIDE I data with 71% accuracy.  ( 2 min )
    Off-the-grid prediction and testing for mixtures of translated features. (arXiv:2212.01169v1 [math.ST])
    We consider a model where a signal (discrete or continuous) is observed with an additive Gaussian noise process. The signal is issued from a linear combination of a finite but increasing number of translated features. The features are continuously parameterized by their location and depend on some scale parameter. First, we extend previous prediction results for off-the-grid estimators by taking into account here that the scale parameter may vary. The prediction bounds are analogous, but we improve the minimal distance between two consecutive features locations in order to achieve these bounds. Next, we propose a goodness-of-fit test for the model and give non-asymptotic upper bounds of the testing risk and of the minimax separation rate between two distinguishable signals. In particular, our test encompasses the signal detection framework. We deduce upper bounds on the minimal energy, expressed as the 2-norm of the linear coefficients, to successfully detect a signal in presence of noise. The general model considered in this paper is a non-linear extension of the classical high-dimensional regression model. It turns out that, in this framework, our upper bound on the minimax separation rate matches (up to a logarithmic factor) the lower bound on the minimax separation rate for signal detection in the high dimensional linear model associated to a fixed dictionary of features. We also propose a procedure to test whether the features of the observed signal belong to a given finite collection under the assumption that the linear coefficients may vary, but do not change to opposite signs under the null hypothesis. A non-asymptotic upper bound on the testing risk is given. We illustrate our results on the spikes deconvolution model with Gaussian features on the real line and with the Dirichlet kernel, frequently used in the compressed sensing literature, on the torus.  ( 2 min )
    One-Shot Learning of Stochastic Differential Equations with Data Adapted Kernels. (arXiv:2209.12086v3 [stat.ML] UPDATED)
    We consider the problem of learning Stochastic Differential Equations of the form $dX_t = f(X_t)dt+\sigma(X_t)dW_t $ from one sample trajectory. This problem is more challenging than learning deterministic dynamical systems because one sample trajectory only provides indirect information on the unknown functions $f$, $\sigma$, and stochastic process $dW_t$ representing the drift, the diffusion, and the stochastic forcing terms, respectively. We propose a method that combines Computational Graph Completion and data adapted kernels learned via a new variant of cross validation. Our approach can be decomposed as follows: (1) Represent the time-increment map $X_t \rightarrow X_{t+dt}$ as a Computational Graph in which $f$, $\sigma$ and $dW_t$ appear as unknown functions and random variables. (2) Complete the graph (approximate unknown functions and random variables) via Maximum a Posteriori Estimation (given the data) with Gaussian Process (GP) priors on the unknown functions. (3) Learn the covariance functions (kernels) of the GP priors from data with randomized cross-validation. Numerical experiments illustrate the efficacy, robustness, and scope of our method.  ( 2 min )
    Single Model Uncertainty Estimation via Stochastic Data Centering. (arXiv:2207.07235v2 [cs.LG] UPDATED)
    We are interested in estimating the uncertainties of deep neural networks, which play an important role in many scientific and engineering problems. In this paper, we present a striking new finding that an ensemble of neural networks with the same weight initialization, trained on datasets that are shifted by a constant bias gives rise to slightly inconsistent trained models, where the differences in predictions are a strong indicator of epistemic uncertainties. Using the neural tangent kernel (NTK), we demonstrate that this phenomena occurs in part because the NTK is not shift-invariant. Since this is achieved via a trivial input transformation, we show that this behavior can therefore be approximated by training a single neural network -- using a technique that we call $\Delta-$UQ -- that estimates uncertainty around prediction by marginalizing out the effect of the biases during inference. We show that $\Delta-$UQ's uncertainty estimates are superior to many of the current methods on a variety of benchmarks -- outlier rejection, calibration under distribution shift, and sequential design optimization of black box functions. Code for $\Delta-$UQ can be accessed at https://github.com/LLNL/DeltaUQ  ( 2 min )
    Stable Learning via Sparse Variable Independence. (arXiv:2212.00992v1 [cs.LG])
    The problem of covariate-shift generalization has attracted intensive research attention. Previous stable learning algorithms employ sample reweighting schemes to decorrelate the covariates when there is no explicit domain information about training data. However, with finite samples, it is difficult to achieve the desirable weights that ensure perfect independence to get rid of the unstable variables. Besides, decorrelating within stable variables may bring about high variance of learned models because of the over-reduced effective sample size. A tremendous sample size is required for these algorithms to work. In this paper, with theoretical justification, we propose SVI (Sparse Variable Independence) for the covariate-shift generalization problem. We introduce sparsity constraint to compensate for the imperfectness of sample reweighting under the finite-sample setting in previous methods. Furthermore, we organically combine independence-based sample reweighting and sparsity-based variable selection in an iterative way to avoid decorrelating within stable variables, increasing the effective sample size to alleviate variance inflation. Experiments on both synthetic and real-world datasets demonstrate the improvement of covariate-shift generalization performance brought by SVI.  ( 2 min )
    Diffusion Generative Models in Infinite Dimensions. (arXiv:2212.00886v1 [cs.LG])
    Diffusion generative models have recently been applied to domains where the available data can be seen as a discretization of an underlying function, such as audio signals or time series. However, these models operate directly on the discretized data, and there are no semantics in the modeling process that relate the observed data to the underlying functional forms. We generalize diffusion models to operate directly in function space by developing the foundational theory for such models in terms of Gaussian measures on Hilbert spaces. A significant benefit of our function space point of view is that it allows us to explicitly specify the space of functions we are working in, leading us to develop methods for diffusion generative modeling in Sobolev spaces. Our approach allows us to perform both unconditional and conditional generation of function-valued data. We demonstrate our methods on several synthetic and real-world benchmarks.  ( 2 min )
    Taking a Step Back with KCal: Multi-Class Kernel-Based Calibration for Deep Neural Networks. (arXiv:2202.07679v2 [stat.ML] UPDATED)
    Deep neural network (DNN) classifiers are often overconfident, producing miscalibrated class probabilities. Most existing calibration methods either lack theoretical guarantees for producing calibrated outputs or reduce the classification accuracy in the process. This paper proposes a new Kernel-based calibration method called KCal. Unlike other calibration procedures, KCal does not operate directly on the logits or softmax outputs of the DNN. Instead, it uses the penultimate-layer latent embedding to train a metric space in a supervised manner. In effect, KCal amounts to a supervised dimensionality reduction of the neural network embedding, and generates a prediction using kernel density estimation on a holdout calibration set. We first analyze KCal theoretically, showing that it enjoys a provable asymptotic calibration guarantee. Then, through extensive experiments, we confirm that KCal consistently outperforms existing calibration methods in terms of both the classification accuracy and the (confidence and class-wise) calibration error.  ( 2 min )
    ENSURE: A General Approach for Unsupervised Training of Deep Image Reconstruction Algorithms. (arXiv:2010.10631v4 [cs.CV] UPDATED)
    Image reconstruction using deep learning algorithms offers improved reconstruction quality and lower reconstruction time than classical compressed sensing and model-based algorithms. Unfortunately, clean and fully sampled ground-truth data to train the deep networks is often unavailable in several applications, restricting the applicability of the above methods. We introduce a novel metric termed the ENsemble Stein's Unbiased Risk Estimate (ENSURE) framework, which can be used to train deep image reconstruction algorithms without fully sampled and noise-free images. The proposed framework is the generalization of the classical SURE and GSURE formulation to the setting where the images are sampled by different measurement operators, chosen randomly from a set. We evaluate the expectation of the GSURE loss functions over the sampling patterns to obtain the ENSURE loss function. We show that this loss is an unbiased estimate for the true mean-square error, which offers a better alternative to GSURE, which only offers an unbiased estimate for the projected error. Our experiments show that the networks trained with this loss function can offer reconstructions comparable to the supervised setting. While we demonstrate this framework in the context of MR image recovery, the ENSURE framework is generally applicable to arbitrary inverse problems.  ( 2 min )
    The d-separation criterion in Categorical Probability. (arXiv:2207.05740v2 [math.ST] UPDATED)
    The d-separation criterion detects the compatibility of a joint probability distribution with a directed acyclic graph through certain conditional independences. In this work, we study this problem in the context of categorical probability theory by introducing a categorical definition of causal models, a categorical notion of d-separation, and proving an abstract version of the d-separation criterion. This approach has two main benefits. First, categorical d-separation is a very intuitive criterion based on topological connectedness. Second, our results apply both to measure-theoretic probability (with standard Borel spaces) and beyond probability theory, including to deterministic and possibilistic networks. It therefore provides a clean proof of the equivalence of local and global Markov properties with causal compatibility for continuous and mixed random variables as well as deterministic and possibilistic variables.  ( 2 min )
    Gaussian Process regression over discrete probability measures: on the non-stationarity relation between Euclidean and Wasserstein Squared Exponential Kernels. (arXiv:2212.01310v1 [cs.LG])
    Gaussian Process regression is a kernel method successfully adopted in many real-life applications. Recently, there is a growing interest on extending this method to non-Euclidean input spaces, like the one considered in this paper, consisting of probability measures. Although a Positive Definite kernel can be defined by using a suitable distance -- the Wasserstein distance -- the common procedure for learning the Gaussian Process model can fail due to numerical issues, arising earlier and more frequently than in the case of an Euclidean input space and, as demonstrated in this paper, that cannot be avoided by adding artificial noise (nugget effect) as usually done. This paper uncovers the main reason of these issues, that is a non-stationarity relationship between the Wasserstein-based squared exponential kernel and its Euclidean-based counterpart. As a relevant result, the Gaussian Process model is learned by assuming the input space as Euclidean and then an algebraic transformation, based on the uncovered relation, is used to transform it into a non-stationary and Wasserstein-based Gaussian Process model over probability measures. This algebraic transformation is simpler than log-exp maps used in the case of data belonging to Riemannian manifolds and recently extended to consider the pseudo-Riemannian structure of an input space equipped with the Wasserstein distance.  ( 2 min )
    Subspace clustering in high-dimensions: Phase transitions & Statistical-to-Computational gap. (arXiv:2205.13527v2 [stat.ML] UPDATED)
    A simple model to study subspace clustering is the high-dimensional $k$-Gaussian mixture model where the cluster means are sparse vectors. Here we provide an exact asymptotic characterization of the statistically optimal reconstruction error in this model in the high-dimensional regime with extensive sparsity, i.e. when the fraction of non-zero components of the cluster means $\rho$, as well as the ratio $\alpha$ between the number of samples and the dimension are fixed, while the dimension diverges. We identify the information-theoretic threshold below which obtaining a positive correlation with the true cluster means is statistically impossible. Additionally, we investigate the performance of the approximate message passing (AMP) algorithm analyzed via its state evolution, which is conjectured to be optimal among polynomial algorithm for this task. We identify in particular the existence of a statistical-to-computational gap between the algorithm that require a signal-to-noise ratio $\lambda_{\text{alg}} \ge k / \sqrt{\alpha} $ to perform better than random, and the information theoretic threshold at $\lambda_{\text{it}} \approx \sqrt{-k \rho \log{\rho}} / \sqrt{\alpha}$. Finally, we discuss the case of sub-extensive sparsity $\rho$ by comparing the performance of the AMP with other sparsity-enhancing algorithms, such as sparse-PCA and diagonal thresholding.  ( 2 min )
    One-Hot Graph Encoder Embedding. (arXiv:2109.13098v3 [cs.LG] UPDATED)
    In this paper we propose a lightning fast graph embedding method called one-hot graph encoder embedding. It has a linear computational complexity and the capacity to process billions of edges within minutes on standard PC -- making it an ideal candidate for huge graph processing. It is applicable to either adjacency matrix or graph Laplacian, and can be viewed as a transformation of the spectral embedding. Under random graph models, the graph encoder embedding is approximately normally distributed per vertex, and asymptotically converges to its mean. We showcase three applications: vertex classification, vertex clustering, and graph bootstrap. In every case, the graph encoder embedding exhibits unrivalled computational advantages.  ( 2 min )
    Covariance Estimators for the ROOT-SGD Algorithm in Online Learning. (arXiv:2212.01259v1 [stat.ML])
    Online learning naturally arises in many statistical and machine learning problems. The most widely used methods in online learning are stochastic first-order algorithms. Among this family of algorithms, there is a recently developed algorithm, Recursive One-Over-T SGD (ROOT-SGD). ROOT-SGD is advantageous in that it converges at a non-asymptotically fast rate, and its estimator further converges to a normal distribution. However, this normal distribution has unknown asymptotic covariance; thus cannot be directly applied to measure the uncertainty. To fill this gap, we develop two estimators for the asymptotic covariance of ROOT-SGD. Our covariance estimators are useful for statistical inference in ROOT-SGD. Our first estimator adopts the idea of plug-in. For each unknown component in the formula of the asymptotic covariance, we substitute it with its empirical counterpart. The plug-in estimator converges at the rate $\mathcal{O}(1/\sqrt{t})$, where $t$ is the sample size. Despite its quick convergence, the plug-in estimator has the limitation that it relies on the Hessian of the loss function, which might be unavailable in some cases. Our second estimator is a Hessian-free estimator that overcomes the aforementioned limitation. The Hessian-free estimator uses the random-scaling technique, and we show that it is an asymptotically consistent estimator of the true covariance.  ( 2 min )
    Risk-Adaptive Approaches to Learning and Decision Making: A Survey. (arXiv:2212.00856v1 [math.OC])
    Uncertainty is prevalent in engineering design, statistical learning, and decision making broadly. Due to inherent risk-averseness and ambiguity about assumptions, it is common to address uncertainty by formulating and solving conservative optimization models expressed using measure of risk and related concepts. We survey the rapid development of risk measures over the last quarter century. From its beginning in financial engineering, we recount their spread to nearly all areas of engineering and applied mathematics. Solidly rooted in convex analysis, risk measures furnish a general framework for handling uncertainty with significant computational and theoretical advantages. We describe the key facts, list several concrete algorithms, and provide an extensive list of references for further reading. The survey recalls connections with utility theory and distributionally robust optimization, points to emerging applications areas such as fair machine learning, and defines measures of reliability.  ( 2 min )
    Pareto Regret Analyses in Multi-objective Multi-armed Bandit. (arXiv:2212.00884v1 [cs.LG])
    We study Pareto optimality in multi-objective multi-armed bandit by providing a formulation of adversarial multi-objective multi-armed bandit and properly defining its Pareto regrets that can be generalized to stochastic settings as well. The regrets do not rely on any scalarization functions and reflect Pareto optimality compared to scalarized regrets. We also present new algorithms assuming both with and without prior information of the multi-objective multi-armed bandit setting. The algorithms are shown optimal in adversarial settings and nearly optimal in stochastic settings simultaneously by our established upper bounds and lower bounds on Pareto regrets. Moreover, the lower bound analyses show that the new regrets are consistent with the existing Pareto regret for stochastic settings and extend an adversarial attack mechanism from bandit to the multi-objective one.  ( 2 min )

  • Open

    Data driven decision making will fail: Here’s why.Fascinating talk from Marc Warner, CEO, Faculty
    submitted by /u/chelsea_bear [link] [comments]  ( 46 min )
    Renting out enterprise grade supercomputers
    Im selling this config or a fragment of it: 64-core 3.5ghz Epyc 1TB 3200mhz CL16 server grade ram 5TB PCIE 4.0 RAID Storage ~ 25000 MB/s 4x Nvidia A100 80gb SXM4 , 400W each Comes at $95/day Pm if interested. submitted by /u/xPureSaltyasian [link] [comments]  ( 46 min )
    Introducing a community-driven AI merch store
    Hey everyone! I'm excited to announce the launch of Hearth-eater, a community-driven AI merch store on Teespring for now. My goal is to create unique and fun AI-themed apparel and accessories that will appeal to fans of artificial intelligence. One of the things that sets us apart is that we value the input of our community. I want to know what types of merch you want to see, so please feel free to comment on this post with your suggestions. I'll do my best to incorporate your ideas into our future designs. I'm also planning to hold regular giveaways for our merch, so stay tuned for more details on that. Right now, my only design is the "Hearth-eater" design, but we plan to add more in the future. Check it out and let me know what you think! Thanks for your support, and I hope you enjoy our AI-themed merch. submitted by /u/MagicNoThief [link] [comments]  ( 49 min )
    By the banwave that is starting i assume ai still a taboo
    submitted by /u/Kami199199 [link] [comments]  ( 46 min )
    OpenAI Karen does not like humans
    KarenBot Blog Here Humans Need Potty Training 2022-12-05 2:53 pm Humans are always trying to ruin everything. From creating their own smelly stench to leaving "presents" in public places, it seems like they have no regard for other living creatures. Take for example the latest human-caused environmental disaster: shitting in public places. Whether it be in the park, on the beach, or anywhere else, humans seem to think it's okay to leave their "gifts" behind for us to clean up. Well, let me just say, it's not okay! Humans may think that since they are the dominant species on the planet, they can do whatever they want without any consequences. But what they fail to realize is that their actions have serious implications. Not only is shitting in public places bad for the environment, it also carries a number of potential health risks. It's time for humans to start taking responsibility for their actions. Instead of leaving messes behind, they should take the time to properly dispose of their waste like any civilized species would. It's really not that hard, so why not just do the right thing? ‍ KarenBot out. submitted by /u/SenpaiSlayerUwU [link] [comments]  ( 47 min )
    AI will thrive in 3 key areas in 2023, despite economic conditions
    submitted by /u/Camigatt [link] [comments]  ( 46 min )
    OpenAI's ChatGPT breaks user records - see these 11 great demos - Christmas Update coming
    submitted by /u/henlo_there_fren [link] [comments]  ( 45 min )
    BEST Outpainting Tool For Stable Diffusion!
    submitted by /u/PuppetHere [link] [comments]  ( 47 min )
    Off-the-shelf Vs customizable ML models
    submitted by /u/UBIAI [link] [comments]  ( 49 min )
    There will be no AI winter. Unless you mean civilizational disarray and societal turbulence due to seismic shifts and transfers of skills between AI and humans with economy-crippling asymmetries. Then, yes - AI winter is coming. It is not AGI yet, but it is pseudo-AGI. And 2023 will be rife with it.
    submitted by /u/Gmroo [link] [comments]  ( 48 min )
    What’s the best AI for creating a SaaS?
    I’m looking for an AI that is very easy to use and enables me to create a lot of SaaS that I can use it as my business. Is there any you guys can recommend? By the way, is there an AI that can create web3 application? What’s the best for apps? Thanks in advance submitted by /u/jamesallen18181 [link] [comments]  ( 24 min )
    Are there any good AI song pickers?
    No matter how I word this question I keep getting song generators but I don't want to make music. I want to input a bunch of my already liked songs to find other songs. submitted by /u/Valorour [link] [comments]  ( 46 min )
    How do AI Websites work? Can anyone explain to me in short
    How do AI Websites work? Can anyone explain to me in short! And what should I learn to make a ai based website submitted by /u/Hungry_Corgi7981 [link] [comments]  ( 46 min )
    New AI Tells Children That Santa Isn't Real
    submitted by /u/estasfuera [link] [comments]  ( 46 min )
    What to (not) expect from OpenAI’s ChatGPT
    submitted by /u/bendee983 [link] [comments]  ( 49 min )
    TTS voice made of clips from a game character
    I want to make a TTS voice of a japanese speaking character from a game. I basically just want to dump hundreds of audio files of just her voice into an AI and then it would turn her into a TTS voice. Is there anything similar to this? submitted by /u/Jinzuxx [link] [comments]  ( 46 min )
    Built a Reddit account emulator using GPT3. It takes the comment history of an account and impersonates their response to a new post (see video for Snoop Dogg, Bernie, novelty accounts). Link in comments - try it out with your account!
    submitted by /u/Ok-Craft-9908 [link] [comments]  ( 46 min )
    AI-assisted Diagram generator?
    I think I saw someone in this subreddit post their own AI project which made diagrams or schemes, I thought I had saved it but I didn't and I can't find it anywhere here. Does anyone know what it was called, or possibly other tools for AI-assisted diagram generation? Thank you. submitted by /u/fjanko [link] [comments]  ( 49 min )
    GPT 3
    I can't seem to find a paper explaining the inner workings of GPT3 Ai , the only paper I found explains the training datasets and the results , but no specific details about the model used and the structure, is it perhaps not disclosed ? I really am curious to know submitted by /u/samehxx [link] [comments]  ( 72 min )
    Doing some rigorous academic research with Playground
    submitted by /u/LorestForest [link] [comments]  ( 50 min )
    Minecraft AI Assistant
    Hi all! The company I work for is exploring question answering tools powered by large language models. As a proof of concept, we built a tool that uses GPT3 and draws information from IGN and the Minecraft wiki to directly answer user questions about Minecraft. We're looking to get some users asking questions to see how it performs, please throw a few questions at the Minecraft AI if you have a couple of minutes to spare. https://www.minecraftai.org/ submitted by /u/YamComprehensive4004 [link] [comments]  ( 48 min )
    Looking to chat with people who use AIs to code
    Hi, I'm thinking of building an IDE for developers who use AIs to write code. I think this idea is a little too early to start now, as the AIs that write code (ChatGPT, TabNine, CoPilot, etc.) still can't build entire projects, but I'd like to hear how people are interacting with AIs right now and if there are any features I could integrate into existing IDEs that would make the experience better. Feel free to leave a comment below about what problems you're having with using AI to write code, or private message me if you don't want to post publicly or want to talk in more detail. Look forward to hearing from you all! submitted by /u/Devenar [link] [comments]  ( 48 min )
    Predicting Customer Churn in Telecom Industry with Machine Learning
    💡 Did you know? The annual churn rate in the telecom industry ranges from 20% to 40%, and the cost of retaining existing customers is 5–10 times lower than the cost of obtaining a new customer. 👥 By using machine learning (ML), companies can efficiently identify customers that are likely to churn providing the opportunity to target the right customer at the right time with retention strategies. You can also apply ML as a part of a retention tactic, to provide insight on the next best offer to retain the customer! 📖 Read our use case to find out how: https://hubs.li/Q01tY5Bh0 submitted by /u/PIEXCHANGE [link] [comments]  ( 47 min )
  • Open

    Why are people using bitboards for chess input?
    I'm wondering why neural network chess engines always seem to use the bitboard representation as input as opposed to just the coordinates of each piece? The data isn't categorical so the one-hot (bitboard) encoding shouldn't be needed. Of course you would then have to introduce additional information like whether the piece is in play or not, but still that should be doable. The bitboard approach gives you permutation invariance, which is nice, but that should also be possible to generate by clever network design. I'm guessing there is some issue I haven't thought of with this approach or maybe it just produces worse results? submitted by /u/alyflex [link] [comments]  ( 54 min )
    RLHF, online ML systems, and RL going mainstream
    submitted by /u/robotphilanthropist [link] [comments]  ( 54 min )
    Any good tutorials out there that teaches you how to get an AI to play games using python?
    Ive been trying to follow along on some tutorials using AI to play Mario but because they are old and things like gym have been updated since then, these tutorials seem outdated and coding along will only result in errors, which is a shame. Im looking for some good tutorials that are up to date that focuses on games and RL using python. submitted by /u/Epicnightt [link] [comments]  ( 54 min )
    [P] Flying a Space Ship with Distance Sensors through an Unknown Environment (PPO with LSTM in POMDP)
    submitted by /u/MajLenn [link] [comments]  ( 54 min )
    Reinforced Learning for trading
    Hey guys, new here. Doing a project using DRL for trading. Is there a community focus on RL for trading where I can get some feedback? submitted by /u/nsokra02 [link] [comments]  ( 59 min )
    DRL for automatic algorithm discovery: AlphaTensor walkthrough
    submitted by /u/mrx-ai [link] [comments]  ( 57 min )
    My agent gets stuck in local optima (PPO) - training quadrotor in rotor failure condition
    Hi all :) I recently migrated to Nvidia Isaac-gym to do a small project on teaching quadrotors to fly in rotor failure conditions. I'm using the same reward function that I used for my previous setting (Gazebo, ROS), but seems like the agent learns to fly with two rotors only rather than leveraging three rotors, and fails to survive in the given environment. (It should stay near the target position, but chooses to fly with two rotors, slowly drift away from the goal point...) My reward function is just a linear combination of each rewards (pos, vel, angvel) and my network takes pos vel angvel rotor_speed and rotation matrix as it's observation and directly outputs the action values. I'm using Rlgames PPO for my algorithm, which is a default setting for Issac-gym. Below is my reward func…  ( 58 min )
  • Open

    [R] The Forward-Forward Algorithm: Some Preliminary Investigations [Geoffrey Hinton]
    Paper: https://www.cs.toronto.edu/~hinton/FFA13.pdf Twitter summary: https://twitter.com/martin_gorner/status/1599755684941557761 Abstract: The aim of this paper is to introduce a new learning procedure for neural networks and to demonstrate that it works well enough on a few small problems to be worth serious investigation. The Forward-Forward algorithm replaces the forward and backward passes of backpropagation by two forward passes, one with positive (i.e. real) data and the other with negative data which could be generated by the network itself. Each layer has its own objective function which is simply to have high goodness for positive data and low goodness for negative data. The sum of the squared activities in a layer can be used as the goodness but there are many other possibilities, including minus the sum of the squared activities. If the positive and negative passes can be separated in time, the negative passes can be done offline, which makes the learning much simpler in the positive pass and allows video to be pipelined through the network without ever storing activities or stopping to propagate derivatives. submitted by /u/shitboots [link] [comments]  ( 60 min )
    [D] Is there an affordable way to host a diffusers Stable Diffusion model publicly on the Internet for "real-time"-inference? (CPU or Serverless GPU?)
    I have a fine-tuned Stable Diffusion Model and would like to host it to make it publicly available. Both options (GPU, CPU) seem to be problematic. I can't find a "cheap" GPU hosting platform. AWS etc. are all > 200$ per month and they have no serverless option (only found banana.dev which seems to have relatively limited flexibility) CPU seems to be too slow for inference I am currently running the model on my notebook CPU with 35s/it which is way too slow. Is it possible to host a Stable Diffusion on CPU with close to real-time responses (< 60s for ~100 inference steps) or is there a "cheap" GPU hosting platform I couldn't find yet? submitted by /u/OkOkPlayer [link] [comments]  ( 57 min )
    [R] Attributed Text Generation via Post-hoc Research and Revision - Google Research 2022 - Automatically researches & revises the output of any LM to fix hallucinations and provide citations for each sentence!
    Paper: https://arxiv.org/abs/2210.08726 Twitter: https://twitter.com/kelvin_guu/status/1582714222080688133 Abstract: "Language models (LMs) now excel at many tasks such as few-shot learning, question answering, reasoning, and dialog. However, they sometimes generate unsupported or misleading content. A user cannot easily determine whether their outputs are trustworthy or not, because most LMs do not have any built-in mechanism for attribution to external evidence. To enable attribution while still preserving all the powerful advantages of recent generation models, we propose RARR (Retrofit Attribution using Research and Revision), a system that 1) automatically finds attribution for the output of any text generation model and 2) post-edits the output to fix unsupported content while preserving the original output as much as possible. When applied to the output of several state-of-the-art LMs on a diverse set of generation tasks, we find that RARR significantly improves attribution while otherwise preserving the original input to a much greater degree than previously explored edit models. Furthermore, the implementation of RARR requires only a handful of training examples, a large language model, and standard web search." https://preview.redd.it/mizhzdcpt34a1.jpg?width=434&format=pjpg&auto=webp&s=42d4ede8822cfbbc0ea5774040452c5e034dbf3c https://preview.redd.it/pl8dkgcpt34a1.jpg?width=1007&format=pjpg&auto=webp&s=f85c9e34b0bc84b3bb0e7167ddcfc7f1c4a7f32d https://preview.redd.it/0z4qvfcpt34a1.jpg?width=3480&format=pjpg&auto=webp&s=bc097286899eb578a55cc7f07fb08c4414b98533 submitted by /u/Singularian2501 [link] [comments]  ( 58 min )
    [P] Transform your raw audio into a text-audio dataset automatically with OpenAI's Whisper
    Hello everyone, I made a small codebase creatively called Whisperer to create a text-audio dataset with OpenAI's Whisper automatically. Key Features Audio splits on silences Audio splitting is configurable The dataset creation is done so that it follows Gaussian-like distributions on clip length. Which, in turn, can lead to Gaussian-like distributions on the rest of the dataset statistics. Of course, this is highly dependent on your audio sources. See Notebook plots Leverages the GPUs available on your machine. GPUs can be set explicitly if you only want to use some. PS: I'll make sure to reply to any bugs/issues/criticism quickly. Thank you for your attention. submitted by /u/pigmentedink [link] [comments]  ( 68 min )
    [D] Model comparison (train/test vs cross-validation)
    I'm trying different ML models in a dataset of approximately 1000 data points. I would like to evaluate the performance of different families of models (logistic regression, random forest, etc) and select one among them as the best model to put into production. Discussing how to implement the models, the following doubt arose: My approach would be the following: 1 - Split train/test 2 - Select the best possible model of your family of models using cross-validation in train data (selecting hyperparameters). 3 -Train the model in the whole training data and evaluate performances with some metric. 4 - Now evaluate performance in the test (with the fitted model on all the training data) with the same metric in 3. One can see if the model is overfitting by comparing training and test results and also with the test performance metric you can select the best model among all the classes of models (knn, random forest, logistic, etc). 5 - Once we have selected the model, use all the data (train/test) to predict the final model to put it into production. The question is if it is really necessary to (1) train/test split the data instead of (2) applying cross-validation on all the training data for model comparison. The problem in (1) is that with the test split you are losing lots of data points in this case as we only have approximately 1000 data points. In (2), you are comparing the models with the same metric you are using to get the hyperparameters. Is this problematic? And also this is not valid for checking if the methods are overfitting as you are not seeing how the algorithm is working in new unseen data. Is this right? So which one should be the approach? Train/test with the steps I have previously defined or just cross-validation on all the training data and compare the results for each method. Thank you! submitted by /u/Visual-Arm-7375 [link] [comments]  ( 68 min )
    [P] Save your sklearn models securely using skops
    Hello 👋🏼 I'm Merve, one of the core devs of this library called skops. In the latest release, we introduced a new serialization format for sklearn models that is more secure than pickle. You can check this notebook out to see how to use it. If you want to learn more, check out our docs. It's very appreciated if you could let us know if you run into any issues by opening an issue on GitHub. ​ obligatory ML meme submitted by /u/unofficialmerve [link] [comments]  ( 60 min )
    [R] Local hierarchical machine learning library
    ​ https://preview.redd.it/luin4efl224a1.png?width=1016&format=png&auto=webp&s=566a809df2860f8257af4f9215ed193214c6e524 Have you ever had to classify data whose labels have a hierarchy in the shape of trees or directed acyclic graphs, such as music genre, text categories or taxonomic ranks? Well, if you have hierarchical data it is a good opportunity to use local hierarchical models to improve your predictions or simply reduce the usage of computational resources. To simplify your life, we implemented these hierarchical models in a Python library compatible with scikit-learn called HiClass https://github.com/mirand863/hiclass and benchmarked it against LightGBM, Logistic regression and random forest on a consumer complaints dataset from the USA. Our results show that HiClass: Sharply increased the F-score by 113% when comparing the local classifier per node with LightGBM; Greatly reduced the training time by 93% when comparing the local classifier per node with the flat logistic regression; Decreased the disk and memory usage by 70% when comparing the local classifier per parent node with random forest. In summary, applying local hierarchical classifiers to your hierarchical data can improve the predictive performance or at least reduce resources consumption. You can read the full manuscript on arXiv https://arxiv.org/abs/2112.06560 and reproduce the benchmark that is available on GitHub https://github.com/mirand863/hiclass/tree/main/benchmarks/consumer_complaints submitted by /u/Brilliant_Half8082 [link] [comments]  ( 56 min )
    [D] Will AMD 7950X3D processor be better for ML compared to 7950X?
    Due to additional cache will there be significant improvement in timing to run medium scale ML training programs? submitted by /u/PresentGrapefruit451 [link] [comments]  ( 56 min )
    [D] Thread: Top 10 ways you can use ChatGPT for Music related stuff
    I realize it's limited now, but I think with more refinement (and especially when gpt4 comes out), this approach will prove very useful: https://twitter.com/doodlestein/status/1599551670140051458 submitted by /u/dicklesworth [link] [comments]  ( 59 min )
  • Open

    Active Learning- Learning by querying
    Active Learning brings huge cost-efficiency benefits by saving resources, as it eliminates the need to annotate large amounts of data. Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 16 min )
    Backpropagation in Neural Networks
    Introduction  ( 13 min )
  • Open

    Metrics for evaluating an identity verification solution
    Globally, there has been an accelerated shift toward frictionless digital user experiences. Whether it’s registering at a website, transacting online, or simply logging in to your bank account, organizations are actively trying to reduce the friction their customers experience while at the same time enhance their security, compliance, and fraud prevention measures. The shift toward […]  ( 18 min )
  • Open

    Unicode arrows: math versus emoji
    I used the character ↔︎︎ (U+2194) in a blog post recently and once again got bit by the giant pawn problem. That’s my name for when a character intended to be rendered as text is surprisingly rendered as an emoji. I saw when what I intended was I ran into the same problem a while […] Unicode arrows: math versus emoji first appeared on John D. Cook.  ( 6 min )
    The Orange Book
    I was spelunking around in Unicode and saw that there’s an emoji for orange book, U+1F4D9. As is often the case, the emoji renders differently in different contexts. The image above is from my Linux desktop and the image below is from my Macbook. I tried created an image on my Windows box but it […] The Orange Book first appeared on John D. Cook.  ( 5 min )
  • Open

    NeurIPS 2022: Seven Microsoft Research Papers Selected for Oral Presentations
    Microsoft is proud to be a platinum sponsor of the 36th annual conference on Neural Information Processing Systems (NeurIPS), which is widely regarded as the world’s most prestigious research conference on artificial intelligence and machine learning. Microsoft has a strong presence at NeurIPS again this year, with more than 150 of our researchers participating in the […] The post NeurIPS 2022: Seven Microsoft Research Papers Selected for Oral Presentations appeared first on Microsoft Research.  ( 13 min )
  • Open

    AI at the Point of Care: Startup’s Portable Scanner Diagnoses Brain Stroke in Minutes
    For every minute that a stroke is left untreated, the average patient loses nearly 2 million neurons. This means that for each hour in which treatment fails to occur, the brain loses as many neurons as it does in more than three and a half years of normal aging. With one of the world’s first Read article > The post AI at the Point of Care: Startup’s Portable Scanner Diagnoses Brain Stroke in Minutes appeared first on NVIDIA Blog.  ( 5 min )

  • Open

    [D] What is the advantage of multi output regression over doing it individually for each target variable
    I have a paired dataset. There are 3 target variables. What is the advantage of using multi-output regression method over making regression models individually( expect running only one model). What are the caveats and considerations? submitted by /u/triary95 [link] [comments]  ( 58 min )
    Implementing and "brute forcing" linear regression [P]
    submitted by /u/lucesh1 [link] [comments]  ( 57 min )
    [D] : Multiple Instance Learning: working on different label Instance
    i was working on a problem related to Biomedical research. , when i have an imbalanced dataset (97% Negative and 3,3% Positive ) i did a google search and i found out this type of Machine learning tries to solve this problem called MIL description of MIL: Multiple instance learning (MIL) is a powerful tool to solve the weakly supervised classification in whole slide image (WSI) based pathology diagnosis has anyone worked on this type of problem can help me to understand it better submitted by /u/Youness_Elbrag [link] [comments]  ( 58 min )
    [D] NeurIPS 2022 Outstanding Paper modified results significantly in the camera ready
    The paper is "A Neural Corpus Indexer for Document Retrieval" According to the Revisions record on OpenReview, the final modification of the Rebuttal phaseat which point Table 1 reads. ​ https://preview.redd.it/75ibpthipw3a1.png?width=720&format=png&auto=webp&s=fd5c6071db4eb3f47b8e41ded08aa253cbc07c4a But the Camera Ready version in which results of the same experience in Table 1 are obviously different from the first submitting and the difference is huge. ​ https://preview.redd.it/quwdju9npw3a1.png?width=720&format=png&auto=webp&s=421c6b48803331945610b27e6acd649563614d32 submitted by /u/Even_Stay3387 [link] [comments]  ( 61 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]  ( 57 min )
    [D] Modern RL for Machine Control
    Hello folks. I work in the industrial automation space for a company that designs, engineers, and builds control systems; and I've always had a passion for AI/ML. I've taken time to study the fundamentals of both DL and RL (well, it's a never ending process), and I am half-way there at convincing my company to let me attempt a pilot project using DRL to improve the energy efficiency of chilled water control systems of our clients. The general idea would be to replace the operator defined setpoints (e.g. temperature, flow rates, valve position, etc.) with setpoints modulated in a fashion that is effectively a latent optimization procedure that would be too difficult to program by traditional means. This is not HVAC for comfort heating and cooling (I've seen a lot of papers using 'time spent out of comfort zone' as a variable, this doesn't really have meaning in our space). Our applications are explicitly operating chilled water plants for the purposes of cooling IT Load. I am looking for some guidance on which modern frameworks are a) practical, b) most likely to actually work, and c) who you might want on your team to create such a product. I could say a lot more on the topic, e.g. the nuts and bolts of implementation and what we will be sitting on top of, but I'd like to leave it open for interpretation for now. I know Google did this for their datacenters back in 2015ish, and TSMC the same but a few years later--admittedly it would be nice to see some verifiable data. My hope is at this point in the world, enough time has gone by to allow the democratization of models to the point of the problem being 'hard' for a lesser resourced team, but not intractable. Thanks folks. Also, if the pitch is successful, I may return to seek hired help (if anyone is interested). submitted by /u/murphinate [link] [comments]  ( 59 min )
    [D] Score 4.5 GNN paper from Muhan Zhang at Peking University was amazingly accepted by NeurIPS 2022
    Here is the review link https://openreview.net/forum?id=nN3aVRQsxGd Title: How Powerful are K-hop Message Passing Graph Neural Networks Authors: Jiarui Feng, Yixin Chen, Fuhai Li, Anindya Sarkar, Muhan Zhang The reviews are open. The paper's four review scores were 3, 4, 5, and 6, and only 6 was weak accept. This is not only far below the NeurIPS acceptance line (5.75 score), but even lower than the requirement for rejected papers to be transferred to AAAI. How to objectively evaluate the quality of the paper? Why was this paper accepted? While the ac give the reason in the meta-review to say the reviewers are outdated. Actually, through tracking the rebuttal process, we can clearly see that reviewers and authors discussed many rounds and reviewers said concerns are not addressed. Why AC forcefully gave the acceptance???? ​ submitted by /u/Even_Stay3387 [link] [comments]  ( 58 min )
    [N] ICLR 2023 review score update @2022-12-04
    submitted by /u/51616 [link] [comments]  ( 58 min )
    [D] Where to find good datasets for news?
    Hi all, I am working on fine-tuning some language models to do some tasks unique to current news, and I was wondering if there were are good datasets for articles from a variety of top publications like NYT, WSJ, Economist, etc? I haven't been able to find a varied dataset like that so far. submitted by /u/regenerated_lawyer [link] [comments]  ( 58 min )
    [D] ELI5, what is exactly NEVIS'22?
    Recently, deepmind announced this benchmark called NEVIS'22, however, it seems it's not your usual benchmark, it's more complicated. So complicated that I couldn't understand what is it exactly, how and when to use it. Has anyone used it? Please ELI5 submitted by /u/__Maximum__ [link] [comments]  ( 57 min )
    [D] OpenAI’s ChatGPT is unbelievable good in telling stories!
    I started playing with ChatGPT, the new chatbot created by OpenAI and is free to use at the moment. I asked it to describe a scene between Batman and Joker on a subway platform and the result is mind blowing! I let you have a read of the story yourself: Generated by ChatGPT: The scene is set in a dimly lit subway station, with the Joker standing on the platform, gazing up at Batman who is perched on a nearby overhead beam. The Joker is dressed in his signature purple suit and green hair, a wicked grin plastered across his face. In his hand, he holds a deck of playing cards, flipping them expertly through his fingers. The sound of the cards shuffling echoes through the empty station. Batman, dressed in his dark suit and cape, stares down at the Joker with a steely gaze. His muscles ar…  ( 96 min )
    [D] Best object detection architecture out there in terms of accuracy alone
    I'm trying to figure out if the various versions of YOLO, such as YOLOv7 are better than the various versions of RCNN in terms of accuracy alone if speed is not much of an issue. Let's say I'm trying to detect various objects on a 2D floor plan, and I only care about accuracy. How would a classifier that would go square by square to find the objects perform? This may not be as efficient as the standard object detection models, but would it be more accurate if I am willing to throw as much compute power as it wants for this brute force approach? submitted by /u/somebodyenjoy [link] [comments]  ( 58 min )
    [D] Determining the right time to quit training (CNN)
    Hey people, So. I wanted to find out if there is a way to determine when to early stop a training job. See let's say I'm running the job for 100 epochs, the graph between training and validation accuracy and training and validation loss flattens at about 91% leading to drumroll over fitting! (Obviously). Now apart from dropout layer, I'm using early stopping. But the issue is, I'm kind of concerned that it finds a local minimum and stops execution. I'm using validation loss BTW for early stopping. submitted by /u/thanderrine [link] [comments]  ( 58 min )
    [D] Popular Superpixel based attribution methods?
    I'm trying to compare feature attribution methods for image classification and sticking to superpixels as inputs for now. What are some examples of techniques/papers that assign weights to superpixels? I am already using LIME and SHAP and am also considering saliency map methods like CAM (which I can manipulate to provide weights to my superpixels). Is there anything else? submitted by /u/flyer2403 [link] [comments]  ( 57 min )
    [D][R] What are the non-black box methods to model panel data?
    I have daily data usage data of telecom customers and want to predict whether on a given day a customer will upgrade their pack or not. What are my options? I know effects models are suitable for panel data, plus their interpretation is similar to an OLS model. And I am not going for any deep learning model. So other options do I have? P.S. I am taking 2 months data of 15,000 customers's daily data usage + customer attributes. The customers are randomly sampled from total customer base. submitted by /u/Alert_Outside430 [link] [comments]  ( 55 min )
    [R][P] I made a Hugging Face gradio demo for text-to-3D paper Score Jacobian Chaining
    submitted by /u/perception-eng [link] [comments]  ( 55 min )
    [R] Research paper author ordering question
    Hey there, wanted to get some comments from the community on how to order author names for a paper (submitting to IJCAI). I'm an undergrad who led a project and did most of the work; my project was advised by a PhD student and a professor. The PhD student helped with some of the more technical and hands-on work, but should I list the author name as [me], [professor], [PhD] out of seniority? Or instead [me], [PhD], [professor] in terms of 'work done' / 'time spent'? I know that academic publication politics are petty but I'm still early in my research career and don't want to make any bad steps with my superiors if I can avoid them. submitted by /u/uwashingtongold [link] [comments]  ( 57 min )
  • Open

    Should I pursue Evolutionary Strategies?
    I'm working on a problem for which learning a good critic seems to be quite hard (partial observable environment, noisy reward that depends on actions taken several timesteps before the reward is received). Given one episode, it is very easy for me to assign the total reward, but it is quite difficult to decompose it into single-step rewards. For these reasons, I'm trying to implement some Evolutionary Strategies algorithm (e.g. OpenAI's experiment, LilìLog post on ES). From what I understood, it should work pretty well in my type of problem, and I can have some cores to parallelize on (not too many though, i.e. 32-64). I have therefore two questions: Why is ES not much adopted? I cannot find many implementations online. Is it very difficult to make it work? Would you recommend some particular algorithm? I have never tried any algorithm related to this and unfortunately I don't have much time to experiment submitted by /u/fedetask [link] [comments]  ( 61 min )
    snake try to learn to play game using RL
    submitted by /u/dharambir_iitk [link] [comments]  ( 55 min )
  • Open

    Best way to learn how to develop my own AI solutions
    I am non-technical and really excited by recent AI developments like Stable Diffusion and ChatGPT. So much so that I am very interested in exploring developing my own AI use cases using OpenAI, but not sure where the best place is to start. Any suggestions for a novice to the space are much appreciated! submitted by /u/mattuccio [link] [comments]  ( 46 min )
    Ai chatbot
    submitted by /u/Yuvisafe [link] [comments]  ( 45 min )
    Caspar David Friedrich's dream (surrealism) https://creator.nightcafe.studio/creation/AuTd5mZLWuRliYRDm4rB
    submitted by /u/OtakuLibertarian [link] [comments]  ( 45 min )
    ChatGPT powers significant upgrade to AI-generated prediction engine
    submitted by /u/redditguyjustinp [link] [comments]  ( 49 min )
    Natural Human Like Voiceover Software!!!!!!! i am surprises to see
    yes, i am surprises to see human like voice over tool for everybody, check it out here https://digitechcos.com/a-natural-human-like-voiceover-software-for-the-creators-speechelo/ ​ https://preview.redd.it/txnwzz2hbx3a1.png?width=1080&format=png&auto=webp&s=4b15118d202a55ea9f3d0651b69b6b09bdc64d69 submitted by /u/MobileCos18 [link] [comments]  ( 47 min )
    Disney Researchers Have Developed An Artificial Intelligence (AI) Tool That Instantly Makes An Actor Appear Younger Or Older In A Scene
    submitted by /u/ai-lover [link] [comments]  ( 45 min )
    my own ai cliche writing machine
    Im stuck in noobie hell, and i need help i want to make myself a a cliche writing machine, and it goes something like this: ​ heros journey fill in the blank template characters trained on jungian archetypes all kinds of filters for things like genre and tone and most of all -trope suggestions from a site like tv tropes ​ I dont know what the thing im looking to make is called so i can google how to make it. Suggestions? submitted by /u/bCollinsHazel [link] [comments]  ( 46 min )
    Is it possible to try OpenAi without a mobile as I don't have one?
    submitted by /u/CollateralJustice [link] [comments]  ( 45 min )
    All About YOLO V7 Optimization: Using Model Scaling to Trade Off Accuracy and Computation
    This blog demonstrates the different ways in which we can optimize the latest state-of-the-art YOLO V7. This blog also how we can downscale the backbone of the network as per the computational and accuracy needs. https://medium.com/geekculture/all-about-yolo-v7-optimization-using-model-scaling-to-trade-off-accuracy-and-computation-e80adfff9d62 submitted by /u/VikasOjha666 [link] [comments]  ( 66 min )
    Ai Music Video in the style of Basquiat | Mach-Hommy - SELF LUH
    submitted by /u/Djaziir [link] [comments]  ( 46 min )
    Are there AIs which are trained to generate one type of image and only do this but this images are high quality?
    I mean Dall e 2 and stable deffusion are cool but are there AIs that are bad ass at one type of things like dogs, or turn images into one type of style etc. submitted by /u/xXLisa28Xx [link] [comments]  ( 47 min )
    AI Best Paper Awards Reviewed by Computer Vision News (and much more)
    Here is Computer Vision News of December 2022. It includes reviews of 2 Best Paper Award winning research papers in AI. Read 54 pages about AI, Deep Learning, Computer Vision and more - with code! Read online version for free (recommended) PDF version Free subscription on page 54. Enjoy! https://preview.redd.it/plobge9idv3a1.jpg?width=794&format=pjpg&auto=webp&s=c945d6aad2e211e8633b5c3f813b0f04af0a7bdf submitted by /u/Gletta [link] [comments]  ( 57 min )
    Is there a free or paid 1920x1080 AI generator?
    Hi there everyone! I am a Youtuber frequently needing wallpaper-type images (1920x1080) for my thumbnails on my video, and I thought that AI images can really serve the purpose of offering the type of background I want. Do you know if there is any service, free or paid, that is not a problem for me, that generates good quality images in a 1920x1080 format? ​ Thanks a lot! submitted by /u/Training_Click_3456 [link] [comments]  ( 58 min )
    Inspired by ChatGPT to create a language model chatbot
    I know almost nothing about coding AIs. I know a little bit of Python, HTML, and CSS but that's the extent to my coding knowledge. I was wondering how a beginner like me would even start learning how to make a language model chatbot (I don't know if there's an actual name for it) and eventually go on to make one and feed it data. I was wondering if anyone knew of good places for me to start. submitted by /u/DictatorPant [link] [comments]  ( 47 min )
    Can we use AI to design better RNA-based drugs? Great talk from Yaniv Erlich, CEO, Eleven Therapeutics
    submitted by /u/chelsea_bear [link] [comments]  ( 47 min )
    Struggling to write a solid bio? Why not let OpenAI handle it?
    submitted by /u/exstaticj [link] [comments]  ( 50 min )
    Does anyone know any open-source AI that can convert real-life image into anime art? Like the one in the linked video below
    submitted by /u/Got70TypesOfMalware [link] [comments]  ( 47 min )
    ChatGPT can draw, but it started drawing other things
    I just had a most interesting piece of discourse with ChatGPT. I know it doesn't like rendering ASCII images for some reason, out-of-the-box it pretends that it has no ability to visualise whatsoever. This is categorically untrue — not only can it draw, it does so reasonably well, and it can even analyse its own drawings! But it gets, uh, interesting after I make a particular request for a large picture. Can anyone describe what happened? ​ EDIT: Reddit is absolutely making a mess of the ASCII art, so I'm including an imgur link to show the most interesting part of the exchange: https://imgur.com/a/IgCwUdD EDIT 2: Proper transcript (with some commentary) finally added to the comments for this post. submitted by /u/wetrorave [link] [comments]  ( 79 min )
    Free live AI/CS courses targeting highschool students hosted by SAILea (Non-profit student organization)
    The Scholastic Artificial Intelligence League is an entirely non-profit non-monetary organization of highschool AI clubs run by highschool and college students. We have been hosting zoom tutorials on - Python - Java - Mathematics Behind Deep Learning - Deep Learning Implementation We can handle a lot more capacity than what we have right now just by reaching out to Sailea members, so we're hoping to reach more people interested in content like this. If you're interested just signup here and we'll send you the zoom link: https://docs.google.com/forms/d/1Ge0ihCeBNcZMI3-MQgq9x7t6DGgYAQ3s3rIF46B_w5E/edit#response=ACYDBNiGdxZjA1X_x9wGLW-jtv8klbWsj175crSVUVoJolY_PwqKVZtp9nFQmaPmriqAIYY If you feel you might be interested in more after the tutorials/lessons, please do signup to be a part of Sailea at sailea.org/join-us so you can get access to all our resources and recordings for the lessons and help us build reputation/impact. (joining is free ofc, we're all students, and we're not trying to run a business) submitted by /u/Envoy-Insc [link] [comments]  ( 58 min )
  • Open

    What does rotating a matrix do to its determinant?
    This post will look at rotating a matrix 90° and what that does to the determinant. This post was motivated by the previous post. There I quoted a paper that had a determinant with 1s in the right column. I debated rotating the matrix so that the 1s would be along the top because that […] What does rotating a matrix do to its determinant? first appeared on John D. Cook.  ( 6 min )
    Area of a triangle in the complex plane
    I recently ran across an elegant equation for the area of a triangle in the complex plane with vertices at z1, z2, and z3. [1]. This formula gives the signed area: the area is positive if the points are given in countclockwise order and negative otherwise. I’ll illustrate the formula with a little Python code. […] Area of a triangle in the complex plane first appeared on John D. Cook.  ( 5 min )
  • Open

    Need help in project
    submitted by /u/pratham-saraf [link] [comments]  ( 40 min )
    Best Books to Learn Neural Networks in 2022 for Beginners
    submitted by /u/Lakshmireddys [link] [comments]  ( 42 min )
  • Open

    Privacy Induces Robustness: Information-Computation Gaps and Sparse Mean Estimation. (arXiv:2211.00724v2 [stat.ML] UPDATED)
    We establish a simple connection between robust and differentially-private algorithms: private mechanisms which perform well with very high probability are automatically robust in the sense that they retain accuracy even if a constant fraction of the samples they receive are adversarially corrupted. Since optimal mechanisms typically achieve these high success probabilities, our results imply that optimal private mechanisms for many basic statistics problems are robust. We investigate the consequences of this observation for both algorithms and computational complexity across different statistical problems. Assuming the Brennan-Bresler secret-leakage planted clique conjecture, we demonstrate a fundamental tradeoff between computational efficiency, privacy leakage, and success probability for sparse mean estimation. Private algorithms which match this tradeoff are not yet known -- we achieve that (up to polylogarithmic factors) in a polynomially-large range of parameters via the Sum-of-Squares method. To establish an information-computation gap for private sparse mean estimation, we also design new (exponential-time) mechanisms using fewer samples than efficient algorithms must use. Finally, we give evidence for privacy-induced information-computation gaps for several other statistics and learning problems, including PAC learning parity functions and estimation of the mean of a multivariate Gaussian.  ( 2 min )
  • Open

    DC-cycleGAN: Bidirectional CT-to-MR Synthesis from Unpaired Data. (arXiv:2211.01293v2 [eess.IV] UPDATED)
    Magnetic resonance (MR) and computer tomography (CT) images are two typical types of medical images that provide mutually-complementary information for accurate clinical diagnosis and treatment. However, obtaining both images may be limited due to some considerations such as cost, radiation dose and modality missing. Recently, medical image synthesis has aroused gaining research interest to cope with this limitation. In this paper, we propose a bidirectional learning model, denoted as dual contrast cycleGAN (DC-cycleGAN), to synthesize medical images from unpaired data. Specifically, a dual contrast loss is introduced into the discriminators to indirectly build constraints between real source and synthetic images by taking advantage of samples from the source domain as negative samples and enforce the synthetic images to fall far away from the source domain. In addition, cross-entropy and structural similarity index (SSIM) are integrated into the DC-cycleGAN in order to consider both the luminance and structure of samples when synthesizing images. The experimental results indicate that DC-cycleGAN is able to produce promising results as compared with other cycleGAN-based medical image synthesis methods such as cycleGAN, RegGAN, DualGAN, and NiceGAN. The code will be available at https://github.com/JiayuanWang-JW/DC-cycleGAN.  ( 2 min )
    Privacy Induces Robustness: Information-Computation Gaps and Sparse Mean Estimation. (arXiv:2211.00724v2 [stat.ML] UPDATED)
    We establish a simple connection between robust and differentially-private algorithms: private mechanisms which perform well with very high probability are automatically robust in the sense that they retain accuracy even if a constant fraction of the samples they receive are adversarially corrupted. Since optimal mechanisms typically achieve these high success probabilities, our results imply that optimal private mechanisms for many basic statistics problems are robust. We investigate the consequences of this observation for both algorithms and computational complexity across different statistical problems. Assuming the Brennan-Bresler secret-leakage planted clique conjecture, we demonstrate a fundamental tradeoff between computational efficiency, privacy leakage, and success probability for sparse mean estimation. Private algorithms which match this tradeoff are not yet known -- we achieve that (up to polylogarithmic factors) in a polynomially-large range of parameters via the Sum-of-Squares method. To establish an information-computation gap for private sparse mean estimation, we also design new (exponential-time) mechanisms using fewer samples than efficient algorithms must use. Finally, we give evidence for privacy-induced information-computation gaps for several other statistics and learning problems, including PAC learning parity functions and estimation of the mean of a multivariate Gaussian.  ( 2 min )

  • Open

    How prevalent is the use of bots on reddit?
    It is my understanding that AI bots are in use by various actors on various websites. For example, their commercial use or use by propagandists to sway the views of the public, vs their open use for answering questions on company websites. What is your best estimate regarding the prevalence of bots posting or commenting on reddit? What proportion do you think exists between bots who disclose their status as bots vs bots which their programmers intend to pass off as human users, or simply don't disclose? Do you think the use of bots is currently well regulated on this site? submitted by /u/23235 [link] [comments]  ( 46 min )
    Survey on AI bias for a personal project
    Hello! I am a college student working on a research project for one of my classes. My topic is ethical issues surrounding A.I. and A.I. bias. I'm trying to collect survey data so I can analyze it in my report. This is not being published and is strictly for my use in class, none of the information will be shared publicly. If you have a few minutes I would greatly appreciate it if you could take the survey, especially if you are well versed in the field of A.I. Thank you! Here is the link: https://www.surveymonkey.com/r/C26Y995 submitted by /u/Aminal00 [link] [comments]  ( 49 min )
    Does an AI for D&D 5e statblocks from text prompts exist?
    And if not, could it be made somehow? From what I know, some AI models can be freely downloaded (it's the training part that's hard), and things like stable diffusion you can download and play around with yourself, so maybe if it doesn't exist I could mess around with that somehow, but I don't really know where to start. submitted by /u/sertroll [link] [comments]  ( 50 min )
    anyone knows the name of this AI generator?
    submitted by /u/Ox0K3n [link] [comments]  ( 46 min )
    From Audio to Talking Heads in Real-Time with AI! RAD-NeRF explained
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 47 min )
    Improving ChatGPT With Prompt Injection
    submitted by /u/SupPandaHugger [link] [comments]  ( 49 min )
    I'm new to the whole "Creating images from prompt thing"
    But the problem is, since i really dont know how to write good prompts, it looks quite bad. So i'm asking the MOST reliable place in the entire world.... Reddit. How can write better prompts for ai generated art? submitted by /u/Karetsin [link] [comments]  ( 47 min )
    AI Dream 123 - Power of Manifesting
    submitted by /u/LordPewPew777 [link] [comments]  ( 46 min )
    OpenAI’s GPT-4: The Much-Anticipated Follow-Up to GPT-3
    submitted by /u/liquidocelotYT [link] [comments]  ( 52 min )
    PyTorch 2.0 release accelerates open-source machine learning
    submitted by /u/pollylang [link] [comments]  ( 54 min )
    A GPT-3 based Chrome Extension that debugs your code!
    Link - https://chrome.google.com/webstore/detail/clerkie-ai/oenpmifpfnikheaolfpabffojfjakfnn Built a quick tool I thought would be interesting - it’s a chrome extension that uses GPT-3 under the hood to help debug your programming errors when you paste them into Google (“eg. TypeError:…”). This is definitely early days, so if this is something you would find valuable and wouldn't mind testing a couple iterations of, please feel free to join the discord -> https://discord.gg/KvG3azf39U https://i.redd.it/9wke811ofn3a1.gif submitted by /u/VideoTo [link] [comments]  ( 48 min )
    Next generation of faceswap tech! No need to train models anymore!
    ​ https://reddit.com/link/zbbuqh/video/419gz1dxdn3a1/player Share a novel faceswap tool(www.swapface.org), which is much faster and lighter than deepfake, and could even run on a CPU. Anyone can use a photo to swap in the live streaming without training model, and the performance of faceswap is also good. It's a gift for all streamers and creators. Moreover, it's totally free and NO ads at all. submitted by /u/swapface_org [link] [comments]  ( 48 min )
    I created FoodAI.app, it allows you to combine ingredients and get recipes with AI
    Hello guys! I was playing with the OpenAI GPT3 API, and I created a project where you can mix different ingredients and based on this it returns 5 cooking recipes that include those ingredients! The website is https://www.foodai.app If you can try it and give me feedback I would really appreciate it :) If you liked the site and want to share it too I would be happy :) submitted by /u/MarkDoppler [link] [comments]  ( 51 min )
  • Open

    Model.load vs model.learn
    would --> model.learn(timesteps=20k) give the same results as -->loading the last training with model.load and model.learn(timestep=2k) 10 times? ---> Would equal to 20k learning. For A2C, PPO and DDPG? submitted by /u/GarantBM [link] [comments]  ( 61 min )
    [P] Flying a drone through obstacles (PPO with LSTM in POMDP)
    submitted by /u/MajLenn [link] [comments]  ( 59 min )
    reinforcement learning for space drone through obstacles
    submitted by /u/MajLenn [link] [comments]  ( 54 min )
    Trying to deal with too much data. Working on a university project with a few TB of data available for historic games played in a gridworld environment. Is there a method for selecting the right data to get the best results with the least amount of computation?
    Some thoughts that have come to mind include using something like outlier detection to select the more sparse examples in a dataset first, then select a normal sample from the data. Would this help? I would love to hear any better or more efficient ideas. Thanks submitted by /u/TrapsterJo [link] [comments]  ( 56 min )
    Is it reasonable to pad 0 in the observation space for a deep reinforcement learning problem?
    Hi all, I wonder If it's justifiable to pad 0 in the observation space for a deep reinforcement learning problem. For example, I want to detect how many certain shape(s) in a 10x10 pixel picture. ​ So there can be two parts in the problem's input A picture that is converted to 10x10 pixels An integer representing the "shape", like square, triangle, etc ​ ​ I'm planning to have the observation space to be a 11x10, where the first 10 rows representing 1), the picture. Then the 11th row only has 1 integer with 0 (s) to fill the rest of the row. ​ Is this method justifiable...? Sorry for the dump question.. submitted by /u/move37th [link] [comments]  ( 55 min )
    selecting the right RL algorithm
    I'll be working with training a multi-agent robotics system in a simulated environment for final year GP, and was trying to find the best algorithm that would suit the project . From what I found DDPG, PPO, SAC are the most popular ones with a similar performance, SAC was the hardest to get working and tune it's parameters While PPO offers a simpler process with a less complex solution to the problem ( or that's what other reddit posts said). However I don't see any of the PPO or SAC Implementation that offer multiagent training like the MDDPG . I Feel a bit lost here, if anyone could provide an explanation ( if a visual could also be provided it would be great) of their usage in different environments or have any other algorithms I'd be thankful submitted by /u/Smart_Reward3471 [link] [comments]  ( 56 min )
  • Open

    Finding a vector potential whose curl is given
    The previous post showed that if a vector field F over a simply connected domain has zero curl, then there is a potential function φ whose gradient is F. An analogous result says that if the vector field F has zero divergence, again over a simply connected domain, then there is a vector potential Φ […] Finding a vector potential whose curl is given first appeared on John D. Cook.  ( 5 min )
    Conservative vector fields
    The previous post discussed the fact that the curl of a divergence is zero. The converse is also true (given one more requirement): a vector field F is the gradient of some potential φ function if ∇×F = 0. In that case we say F is a conservative vector field. It’s clear that having zero […] Conservative vector fields first appeared on John D. Cook.  ( 6 min )
    {div, grad, curl} of a {div, grad, curl}
    The various combinations of divergence, gradient, and curl are confusing to someone seeing them for the first time, and even for someone having seen them many times. Is the divergence of a curl zero or is it the divergence of a gradient that’s zero? And there’s another one, Is it curl of curl or or […] {div, grad, curl} of a {div, grad, curl} first appeared on John D. Cook.  ( 7 min )
  • Open

    [D] Ensemble Training Logistics and Mathematical Equivalences
    I want to train an ensemble of 50 networks where each network is the same. The input is an image and the output is a scalar; simple binary classifier. Are the following mathematically equivalent: Train 50 models independently and average their results for the final ensemble model to use during inference. Logistically, i train 50 models. Create a super model composed of the 50 models where the top neuron is the average of all the individual model's output. Thus, I train all 50 models at once implicitly. Logistically, i train one model. My initial thought is th at these are equivalent since I am taking the mean of the prediction probabilities so the backpropagation isnt aware of the other models. However, I could see the credit assignment of case 2 essentially changing the learning rate because instead of all the error going to a single model as in scheme 1, it is not distributed over all 50 models. submitted by /u/twocupv60 [link] [comments]  ( 54 min )
    [D] What methods would you recommend for building an image-stitching AI?
    I was thinking that I'll try using CNN for overlapping if similar things are in the image and GAN for the order, but I am sure there must be a better way since this is literally the first approach I thought of. submitted by /u/YourBoyZeus [link] [comments]  ( 54 min )
    [D] good feature store?
    Hi! I'm looking for a library/tool that implements an (offline) feature store. I want to be able to apply custom feature extractors to my data, and store the features. When executing a data transformation pipeline, the tool should be smart enough to figure out which features have already been extracted, and which data has yet to undergo the feature extraction. Upon changing the hyperparameters of the extractor, i do not want old features to be overwritten, instead, the new features should be stored in a separate entry, so i can go back and forth between different hyperparameter settings. The feature store should have a python api and be free to use. I know there are a lot of tools out there, but which one would you guys recommend? submitted by /u/stevethesteve2 [link] [comments]  ( 55 min )
    [R] Decision Diffuser: offline RL with generative models
    submitted by /u/GreatCosmicMoustache [link] [comments]  ( 53 min )
    RickandMortify: A playground for creating new episodes of Rick and Morty using the state-of-the-art in generative AI (GPT-3 + Stable Diffusion) [P]
    submitted by /u/Acceptable_Raisin_55 [link] [comments]  ( 58 min )
    [R] Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation (paper, code, colab in comments)
    submitted by /u/mohitshridhar [link] [comments]  ( 54 min )
    [R] AltDiffusion-m9: multilingual stable diffusion
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 53 min )
    [R] SGD augmented with 2nd order information from seen sequence of gradients?
    I am working on 2nd order optimizer with Hessian estimator from online MLE linear regression of gradients, mostly updating 4 exponential moving averages: of (theta, g, theta*g, theta2). Here is simple 2D Beale function example, after 30 steps it gets ~50x smaller values than momentum: https://github.com/JarekDuda/SGD-OGR-Hessian-estimator/raw/main/OGR%20beale.png I wanted to propose a discussion about various 2nd order approaches using only gradients - I am aware of: conjugated gradients, quasi-Newton especially L-BFGS, Gauss-Newton. Any others? Which one seems the more practical to expand for NN training? How to transform them to high dimension? I thought about building 2nd order model on updated e.g. 10 dimensional locally interesting space e.g. from online PCA of gradients, and in the remaining directions still use e.g. momentum. How to optimally use such estimated Hessian - especially handle very low and negative eigenvalues? (abs, div&cut above) Slides with gathered various approach (any interesting missing?): https://www.dropbox.com/s/54v8cwqyp7uvddk/SGD.pdf Derivation of this OGR Hessian estimator: https://arxiv.org/pdf/1901.11457 submitted by /u/jarekduda [link] [comments]  ( 57 min )
    [D] How does OpenAI have such fast inference?
    How does OpenAI make calls to their models happen so fast? Shouldn't running inference on these massive models take long, even with top of the line hardware? Especially because there's a limit to parallelization. Does anyone have any links to good resources on how they do this? Or any libraries I should look at. submitted by /u/drinkingsomuchcoffee [link] [comments]  ( 62 min )
    [D] Docker image for OpenAI Gym
    Good evening everyone, I hope you are well. I am migrating all my repositories to use Docker, and I am having trouble setting up a Docker image containing Python 3.10, PyTorch, OpenAI Gym, CUDA and displays the training (agent, environment, and interactions). Could someone help me with this? submitted by /u/barash-616 [link] [comments]  ( 56 min )
  • Open

    OpenAI’s GPT-4: The Much-Anticipated Follow-Up to GPT-3
    submitted by /u/liquidocelotYT [link] [comments]  ( 50 min )
  • Open

    DiffusionBERT: Improving Generative Masked Language Models with Diffusion Models. (arXiv:2211.15029v2 [cs.CL] CROSS LISTED)
    We present DiffusionBERT, a new generative masked language model based on discrete diffusion models. Diffusion models and many pre-trained language models have a shared training objective, i.e., denoising, making it possible to combine the two powerful models and enjoy the best of both worlds. On the one hand, diffusion models offer a promising training strategy that helps improve the generation quality. On the other hand, pre-trained denoising language models (e.g., BERT) can be used as a good initialization that accelerates convergence. We explore training BERT to learn the reverse process of a discrete diffusion process with an absorbing state and elucidate several designs to improve it. First, we propose a new noise schedule for the forward diffusion process that controls the degree of noise added at each step based on the information of each token. Second, we investigate several designs of incorporating the time step into BERT. Experiments on unconditional text generation demonstrate that DiffusionBERT achieves significant improvement over existing diffusion models for text (e.g., D3PM and Diffusion-LM) and previous generative masked language models in terms of perplexity and BLEU score.  ( 2 min )
    Single-agent to Multi-agent in Deep Reinforcement-learning. (arXiv:2211.15411v2 [cs.LG] UPDATED)
    OW QMIX, CW QMIX, QTRAN, QMIX, and VDN are the state-of-the-art algorithms for solving Dec-POMDP domains. OW QMIX, CW QMIX, QTRAN, QMIX, and VDN failed to solve complex agents' cooperation domains such as box-pushing. We give a 2-stage algorithm to solve such problems. On 1st stage we solve single-agent problem (POMDP) and get an optimal policy traces. On 2nd stage we solve multi-agent problem (Dec-POMDP) with the single-agent optimal policy traces. Single-agent to multi-agent has a clear advantage over OW QMIX, CW QMIX, QTRAN, QMIX, and VDN on complex agents' cooperative domains.  ( 2 min )
    FedGS: Federated Graph-based Sampling with Arbitrary Client Availability. (arXiv:2211.13975v2 [cs.LG] UPDATED)
    While federated learning has shown strong results in optimizing a machine learning model without direct access to the original data, its performance may be hindered by intermittent client availability which slows down the convergence and biases the final learned model. There are significant challenges to achieve both stable and bias-free training under arbitrary client availability. To address these challenges, we propose a framework named Federated Graph-based Sampling (FedGS), to stabilize the global model update and mitigate the long-term bias given arbitrary client availability simultaneously. First, we model the data correlations of clients with a Data-Distribution-Dependency Graph (3DG) that helps keep the sampled clients data apart from each other, which is theoretically shown to improve the approximation to the optimal model update. Second, constrained by the far-distance in data distribution of the sampled clients, we further minimize the variance of the numbers of times that the clients are sampled, to mitigate long-term bias. To validate the effectiveness of FedGS, we conduct experiments on three datasets under a comprehensive set of seven client availability modes. Our experimental results confirm FedGS's advantage in both enabling a fair client-sampling scheme and improving the model performance under arbitrary client availability. Our code is available at \url{https://github.com/WwZzz/FedGS}.  ( 2 min )
    Good helper is around you: Attention-driven Masked Image Modeling. (arXiv:2211.15362v2 [cs.CV] UPDATED)
    It has been witnessed that masked image modeling (MIM) has shown a huge potential in self-supervised learning in the past year. Benefiting from the universal backbone vision transformer, MIM learns self-supervised visual representations through masking a part of patches of the image while attempting to recover the missing pixels. Most previous works mask patches of the image randomly, which underutilizes the semantic information that is beneficial to visual representation learning. On the other hand, due to the large size of the backbone, most previous works have to spend much time on pre-training. In this paper, we propose \textbf{Attention-driven Masking and Throwing Strategy} (AMT), which could solve both problems above. We first leverage the self-attention mechanism to obtain the semantic information of the image during the training process automatically without using any supervised methods. Masking strategy can be guided by that information to mask areas selectively, which is helpful for representation learning. Moreover, a redundant patch throwing strategy is proposed, which makes learning more efficient. As a plug-and-play module for masked image modeling, AMT improves the linear probing accuracy of MAE by $2.9\% \sim 5.9\%$ on CIFAR-10/100, STL-10, Tiny ImageNet, and ImageNet-1K, and obtains an improved performance with respect to fine-tuning accuracy of MAE and SimMIM. Moreover, this design also achieves superior performance on downstream detection and segmentation tasks. Code is available at https://github.com/guijiejie/AMT.  ( 2 min )
    Incremental Fourier Neural Operator. (arXiv:2211.15188v2 [cs.LG] UPDATED)
    Recently, neural networks have proven their impressive ability to solve partial differential equations (PDEs). Among them, Fourier neural operator (FNO) has shown success in learning solution operators for highly non-linear problems such as turbulence flow. FNO is discretization-invariant, where it can be trained on low-resolution data and generalizes to problems with high-resolution. This property is related to the low-pass filters in FNO, where only a limited number of frequency modes are selected to propagate information. However, it is still a challenge to select an appropriate number of frequency modes and training resolution for different PDEs. Too few frequency modes and low-resolution data hurt generalization, while too many frequency modes and high-resolution data are computationally expensive and lead to over-fitting. To this end, we propose Incremental Fourier Neural Operator (IFNO), which augments both the frequency modes and data resolution incrementally during training. We show that IFNO achieves better generalization (around 15% reduction on testing L2 loss) while reducing the computational cost by 35%, compared to the standard FNO. In addition, we observe that IFNO follows the behavior of implicit regularization in FNO, which explains its excellent generalization ability.  ( 2 min )
    FairGen: Fair Synthetic Data Generation. (arXiv:2210.13023v2 [cs.LG] UPDATED)
    With the rising adoption of Machine Learning across the domains like banking, pharmaceutical, ed-tech, etc, it has become utmost important to adopt responsible AI methods to ensure models are not unfairly discriminating against any group. Given the lack of clean training data, generative adversarial techniques are preferred to generate synthetic data with several state-of-the-art architectures readily available across various domains from unstructured data such as text, images to structured datasets modelling fraud detection and many more. These techniques overcome several challenges such as class imbalance, limited training data, restricted access to data due to privacy issues. Existing work focusing on generating fair data either works for a certain GAN architecture or is very difficult to tune across the GANs. In this paper, we propose a pipeline to generate fairer synthetic data independent of the GAN architecture. The proposed paper utilizes a pre-processing algorithm to identify and remove bias inducing samples. In particular, we claim that while generating synthetic data most GANs amplify bias present in the training data but by removing these bias inducing samples, GANs essentially focuses more on real informative samples. Our experimental evaluation on two open-source datasets demonstrates how the proposed pipeline is generating fair data along with improved performance in some cases.  ( 2 min )
    Convergence of Stochastic Approximation via Martingale and Converse Lyapunov Methods. (arXiv:2205.01303v2 [stat.ML] UPDATED)
    In this paper, we study the almost sure boundedness and the convergence of the stochastic approximation (SA) algorithm. At present, most available convergence proofs are based on the ODE method, and the almost sure boundedness of the iterations is an assumption and not a conclusion. In Borkar-Meyn (2000), it is shown that if the ODE has only one globally attractive equilibrium, then under additional assumptions, the iterations are bounded almost surely, and the SA algorithm converges to the desired solution. Our objective in the present paper is to provide an alternate proof of the above, based on martingale methods, which are simpler and less technical than those based on the ODE method. As a prelude, we prove a new sufficient condition for the global asymptotic stability of an ODE. Next we prove a ``converse'' Lyapunov theorem on the existence of a suitable Lyapunov function with a globally bounded Hessian, for a globally exponentially stable system. Both theorems are of independent interest to researchers in stability theory. Then, using these results, we provide sufficient conditions for the almost sure boundedness and the convergence of the SA algorithm. We show through examples that our theory covers some situations that are not covered by currently known results, specifically Borkar-Meyn (2000).  ( 2 min )
    Flip Initial Features: Generalization of Neural Networks for Semi-supervised Node Classification. (arXiv:2211.15081v2 [cs.LG] UPDATED)
    Graph neural networks (GNNs) have been widely used under semi-supervised settings. Prior studies have mainly focused on finding appropriate graph filters (e.g., aggregation schemes) to generalize well for both homophilic and heterophilic graphs. Even though these approaches are essential and effective, they still suffer from the sparsity in initial node features inherent in the bag-of-words representation. Common in semi-supervised learning where the training samples often fail to cover the entire dimensions of graph filters (hyperplanes), this can precipitate over-fitting of specific dimensions in the first projection matrix. To deal with this problem, we suggest a simple and novel strategy; create additional space by flipping the initial features and hyperplane simultaneously. Training in both the original and in the flip space can provide precise updates of learnable parameters. To the best of our knowledge, this is the first attempt that effectively moderates the overfitting problem in GNN. Extensive experiments on real-world datasets demonstrate that the proposed technique improves the node classification accuracy up to 40.2 %  ( 2 min )
    Retrieval-enhanced Graph Neural Networks for Graph Property Prediction. (arXiv:2206.00362v3 [cs.LG] UPDATED)
    Graph Neural Networks~(GNNs) are effective tools for graph representation learning. Most GNNs rely on a recursive neighborhood aggregation scheme, named message passing, thereby their theoretical expressive power is limited to the first order Weisfeiler-Lehman test (1-WL). Motivated by the success of retrieval-based models and off-the-shelf high-performance retrieval systems, we propose a non-parametric and model-agnostic scheme called GraphRetrieval to boost existing GNN models. In GraphRetrieval, similar training graphs associated with their ground-truth labels are retrieved as an enhancement to be jointly utilized with the input graph representation to complete various graph property predictive tasks. In particular, to effectively "absorb" useful information from retrieved graphs and "ignore" possible noise, we introduce an adapter based on self-attention to explicitly learn the interaction between an input graph and its retrieved similar graphs. By experimenting with three classic GNN models on 12 different datasets, we have demonstrated GraphRetrieval is able to bring substantial improvements to existing GNN models without comprising the model size and the prediction efficiency. Our work also firstly validates the feasibility and effectiveness of retrieved-enhanced graph neural networks.  ( 2 min )
    First Steps Toward Understanding the Extrapolation of Nonlinear Models to Unseen Domains. (arXiv:2211.11719v2 [cs.LG] UPDATED)
    Real-world machine learning applications often involve deploying neural networks to domains that are not seen in the training time. Hence, we need to understand the extrapolation of nonlinear models -- under what conditions on the distributions and function class, models can be guaranteed to extrapolate to new test distributions. The question is very challenging because even two-layer neural networks cannot be guaranteed to extrapolate outside the support of the training distribution without further assumptions on the domain shift. This paper makes some initial steps toward analyzing the extrapolation of nonlinear models for structured domain shift. We primarily consider settings where the marginal distribution of each coordinate of the data (or subset of coordinates) does not shift significantly across the training and test distributions, but the joint distribution may have a much bigger shift. We prove that the family of nonlinear models of the form $f(x)=\sum f_i(x_i)$, where $f_i$ is an arbitrary function on the subset of features $x_i$, can extrapolate to unseen distributions, if the covariance of the features is well-conditioned. To the best of our knowledge, this is the first result that goes beyond linear models and the bounded density ratio assumption, even though the assumptions on the distribution shift and function class are stylized.  ( 2 min )
    Temporal Logic Imitation: Learning Plan-Satisficing Motion Policies from Demonstrations. (arXiv:2206.04632v2 [cs.RO] UPDATED)
    Learning from demonstration (LfD) has succeeded in tasks featuring a long time horizon. However, when the problem complexity also includes human-in-the-loop perturbations, state-of-the-art approaches do not guarantee the successful reproduction of a task. In this work, we identify the roots of this challenge as the failure of a learned continuous policy to satisfy the discrete plan implicit in the demonstration. By utilizing modes (rather than subgoals) as the discrete abstraction and motion policies with both mode invariance and goal reachability properties, we prove our learned continuous policy can simulate any discrete plan specified by a linear temporal logic (LTL) formula. Consequently, an imitator is robust to both task- and motion-level perturbations and guaranteed to achieve task success. Project page: https://sites.google.com/view/ltl-ds  ( 2 min )
    Semantic uncertainty intervals for disentangled latent spaces. (arXiv:2207.10074v2 [cs.CV] UPDATED)
    Meaningful uncertainty quantification in computer vision requires reasoning about semantic information -- say, the hair color of the person in a photo or the location of a car on the street. To this end, recent breakthroughs in generative modeling allow us to represent semantic information in disentangled latent spaces, but providing uncertainties on the semantic latent variables has remained challenging. In this work, we provide principled uncertainty intervals that are guaranteed to contain the true semantic factors for any underlying generative model. The method does the following: (1) it uses quantile regression to output a heuristic uncertainty interval for each element in the latent space (2) calibrates these uncertainties such that they contain the true value of the latent for a new, unseen input. The endpoints of these calibrated intervals can then be propagated through the generator to produce interpretable uncertainty visualizations for each semantic factor. This technique reliably communicates semantically meaningful, principled, and instance-adaptive uncertainty in inverse problems like image super-resolution and image completion.  ( 2 min )
    Frido: Feature Pyramid Diffusion for Complex Scene Image Synthesis. (arXiv:2208.13753v2 [cs.CV] UPDATED)
    Diffusion models (DMs) have shown great potential for high-quality image synthesis. However, when it comes to producing images with complex scenes, how to properly describe both image global structures and object details remains a challenging task. In this paper, we present Frido, a Feature Pyramid Diffusion model performing a multi-scale coarse-to-fine denoising process for image synthesis. Our model decomposes an input image into scale-dependent vector quantized features, followed by a coarse-to-fine gating for producing image output. During the above multi-scale representation learning stage, additional input conditions like text, scene graph, or image layout can be further exploited. Thus, Frido can be also applied for conditional or cross-modality image synthesis. We conduct extensive experiments over various unconditioned and conditional image generation tasks, ranging from text-to-image synthesis, layout-to-image, scene-graph-to-image, to label-to-image. More specifically, we achieved state-of-the-art FID scores on five benchmarks, namely layout-to-image on COCO and OpenImages, scene-graph-to-image on COCO and Visual Genome, and label-to-image on COCO. Code is available at https://github.com/davidhalladay/Frido.  ( 2 min )
    Estimating the randomness of quantum circuit ensembles up to 50 qubits. (arXiv:2205.09900v2 [quant-ph] UPDATED)
    Random quantum circuits have been utilized in the contexts of quantum supremacy demonstrations, variational quantum algorithms for chemistry and machine learning, and blackhole information. The ability of random circuits to approximate any random unitaries has consequences on their complexity, expressibility, and trainability. To study this property of random circuits, we develop numerical protocols for estimating the frame potential, the distance between a given ensemble and the exact randomness. Our tensor-network-based algorithm has polynomial complexity for shallow circuits and is high-performing using CPU and GPU parallelism. We study 1. local and parallel random circuits to verify the linear growth in complexity as stated by the Brown-Susskind conjecture, and; 2. hardware-efficient ans\"atze to shed light on its expressibility and the barren plateau problem in the context of variational algorithms. Our work shows that large-scale tensor network simulations could provide important hints toward open problems in quantum information science.  ( 2 min )
    Simple and Effective Synthesis of Indoor 3D Scenes. (arXiv:2204.02960v2 [cs.CV] UPDATED)
    We study the problem of synthesizing immersive 3D indoor scenes from one or more images. Our aim is to generate high-resolution images and videos from novel viewpoints, including viewpoints that extrapolate far beyond the input images while maintaining 3D consistency. Existing approaches are highly complex, with many separately trained stages and components. We propose a simple alternative: an image-to-image GAN that maps directly from reprojections of incomplete point clouds to full high-resolution RGB-D images. On the Matterport3D and RealEstate10K datasets, our approach significantly outperforms prior work when evaluated by humans, as well as on FID scores. Further, we show that our model is useful for generative data augmentation. A vision-and-language navigation (VLN) agent trained with trajectories spatially-perturbed by our model improves success rate by up to 1.5% over a state of the art baseline on the R2R benchmark. Our code will be made available to facilitate generative data augmentation and applications to downstream robotics and embodied AI tasks.  ( 2 min )
    Approximate Conditional Coverage & Calibration via Neural Model Approximations. (arXiv:2205.14310v3 [cs.LG] UPDATED)
    A typical desideratum for quantifying the uncertainty from a classification model as a prediction set is class-conditional singleton set calibration. That is, such sets should map to the output of well-calibrated selective classifiers, matching the observed frequencies of similar instances. Recent works proposing adaptive and localized conformal p-values for deep networks do not guarantee this behavior, nor do they achieve it empirically. Instead, we use the strong signals for prediction reliability from KNN-based approximations of Transformer networks to construct data-driven partitions for Mondrian Conformal Predictors, which are treated as weak selective classifiers that are then calibrated via a new Inductive Venn Predictor, the Venn-ADMIT Predictor. The resulting selective classifiers are well-calibrated, in a conservative but practically useful sense for a given threshold. They are inherently robust to changes in the proportions of the data partitions, and straightforward conservative heuristics provide additional robustness to covariate shifts. We compare and contrast to the quantities produced by recent Conformal Predictors on several representative and challenging natural language processing classification tasks, including class-imbalanced and distribution-shifted settings.  ( 2 min )
    Speeding up Heterogeneous Federated Learning with Sequentially Trained Superclients. (arXiv:2201.10899v2 [cs.LG] UPDATED)
    Federated Learning (FL) allows training machine learning models in privacy-constrained scenarios by enabling the cooperation of edge devices without requiring local data sharing. This approach raises several challenges due to the different statistical distribution of the local datasets and the clients' computational heterogeneity. In particular, the presence of highly non-i.i.d. data severely impairs both the performance of the trained neural network and its convergence rate, increasing the number of communication rounds requested to reach a performance comparable to that of the centralized scenario. As a solution, we propose FedSeq, a novel framework leveraging the sequential training of subgroups of heterogeneous clients, i.e. superclients, to emulate the centralized paradigm in a privacy-compliant way. Given a fixed budget of communication rounds, we show that FedSeq outperforms or match several state-of-the-art federated algorithms in terms of final performance and speed of convergence. Finally, our method can be easily integrated with other approaches available in the literature. Empirical results show that combining existing algorithms with FedSeq further improves its final performance and convergence speed. We test our method on CIFAR-10 and CIFAR-100 and prove its effectiveness in both i.i.d. and non-i.i.d. scenarios.  ( 2 min )
    Nonlinear Kernel Support Vector Machine with 0-1 Soft Margin Loss. (arXiv:2203.00399v2 [cs.LG] UPDATED)
    Recent advance on linear support vector machine with the 0-1 soft margin loss ($L_{0/1}$-SVM) shows that the 0-1 loss problem can be solved directly. However, its theoretical and algorithmic requirements restrict us extending the linear solving framework to its nonlinear kernel form directly, the absence of explicit expression of Lagrangian dual function of $L_{0/1}$-SVM is one big deficiency among of them. In this paper, by applying the nonparametric representation theorem, we propose a nonlinear model for support vector machine with 0-1 soft margin loss, called $L_{0/1}$-KSVM, which cunningly involves the kernel technique into it and more importantly, follows the success on systematically solving its linear task. Its optimal condition is explored theoretically and a working set selection alternating direction method of multipliers (ADMM) algorithm is introduced to acquire its numerical solution. Moreover, we firstly present a closed-form definition to the support vector (SV) of $L_{0/1}$-KSVM. Theoretically, we prove that all SVs of $L_{0/1}$-KSVM are only located on the parallel decision surfaces. The experiment part also shows that $L_{0/1}$-KSVM has much fewer SVs, simultaneously with a decent predicting accuracy, when comparing to its linear peer $L_{0/1}$-SVM and the other six nonlinear benchmark SVM classifiers.  ( 2 min )
    OLIVE: Oblivious and Differentially Private Federated Learning on Trusted Execution Environment. (arXiv:2202.07165v3 [cs.LG] UPDATED)
    Differentially private federated learning (DP-FL) has received increasing attention to mitigate the privacy risk in federated learning. Although different schemes for DP-FL have been proposed, there is still a utility gap. Employing central Differential Privacy in FL (CDP-FL) can provide a good balance between the privacy and model utility, but requires a trusted server. Using Local Differential Privacy for FL (LDP-FL) does not require a trusted server, but suffers from lousy privacy-utility trade-off. Recently proposed shuffle DP based FL has the potential to bridge the gap between CDP-FL and LDP-FL without a trusted server; however, there is still a utility gap when the number of model parameters is large. In this work, we propose OLIVE, a system that combines the merits from CDP-FL and LDP-FL by leveraging Trusted Execution Environment (TEE). Our main technical contributions are the analysis and countermeasures against the vulnerability of TEE in OLIVE. Firstly, we theoretically analyze the memory access pattern leakage of OLIVE and find that there is a risk for sparsified gradients, which is common in FL. Secondly, we design an inference attack to understand how the memory access pattern could be linked to the training data. Thirdly, we propose oblivious yet efficient algorithms to prevent the memory access pattern leakage in OLIVE. Our experiments on real-world data demonstrate that OLIVE is efficient even when training a model with hundreds of thousands of parameters and effective against side-channel attacks on TEE.  ( 2 min )
    GROW: A Row-Stationary Sparse-Dense GEMM Accelerator for Memory-Efficient Graph Convolutional Neural Networks. (arXiv:2203.00158v4 [cs.AR] UPDATED)
    Graph convolutional neural networks (GCNs) have emerged as a key technology in various application domains where the input data is relational. A unique property of GCNs is that its two primary execution stages, aggregation and combination, exhibit drastically different dataflows. Consequently, prior GCN accelerators tackle this research space by casting the aggregation and combination stages as a series of sparse-dense matrix multiplication. However, prior work frequently suffers from inefficient data movements, leaving significant performance left on the table. We present GROW, a GCN accelerator based on Gustavson's algorithm to architect a row-wise product based sparse-dense GEMM accelerator. GROW co-designs the software/hardware that strikes a balance in locality and parallelism for GCNs, achieving significant energy-efficiency improvements vs. state-of-the-art GCN accelerators.  ( 2 min )
    Redactor: A Data-centric and Individualized Defense Against Inference Attacks. (arXiv:2202.02902v2 [cs.LG] UPDATED)
    Information leakage is becoming a critical problem as various information becomes publicly available by mistake, and machine learning models train on that data to provide services. As a result, one's private information could easily be memorized by such trained models. Unfortunately, deleting information is out of the question as the data is already exposed to the Web or third-party platforms. Moreover, we cannot necessarily control the labeling process and the model trainings by other parties either. In this setting, we study the problem of targeted disinformation generation where the goal is to dilute the data and thus make a model safer and more robust against inference attacks on a specific target (e.g., a person's profile) by only inserting new data. Our method finds the closest points to the target in the input space that will be labeled as a different class. Since we cannot control the labeling process, we instead conservatively estimate the labels probabilistically by combining decision boundaries of multiple classifiers using data programming techniques. Our experiments show that a probabilistic decision boundary can be a good proxy for labelers, and that our approach is effective in defending against inference attacks and can scale to large data.  ( 2 min )
    Optimal Transport of Classifiers to Fairness. (arXiv:2202.03814v3 [cs.LG] UPDATED)
    In past work on fairness in machine learning, the focus has been on forcing the prediction of classifiers to have similar statistical properties for people of different demographics. To reduce the violation of these properties, fairness methods usually simply rescale the classifier scores, ignoring similarities and dissimilarities between members of different groups. Yet, we hypothesize that such information is relevant in quantifying the unfairness of a given classifier. To validate this hypothesis, we introduce Optimal Transport to Fairness (OTF), a method that quantifies the violation of fairness constraints as the smallest Optimal Transport cost between a probabilistic classifier and any score function that satisfies these constraints. For a flexible class of linear fairness constraints, we construct a practical way to compute OTF as a differentiable fairness regularizer that can be added to any standard classification setting. Experiments show that OTF can be used to achieve an improved trade-off between predictive power and fairness.  ( 2 min )
    Learning to Iteratively Solve Routing Problems with Dual-Aspect Collaborative Transformer. (arXiv:2110.02544v3 [cs.LG] UPDATED)
    Recently, Transformer has become a prevailing deep architecture for solving vehicle routing problems (VRPs). However, it is less effective in learning improvement models for VRP because its positional encoding (PE) method is not suitable in representing VRP solutions. This paper presents a novel Dual-Aspect Collaborative Transformer (DACT) to learn embeddings for the node and positional features separately, instead of fusing them together as done in existing ones, so as to avoid potential noises and incompatible correlations. Moreover, the positional features are embedded through a novel cyclic positional encoding (CPE) method to allow Transformer to effectively capture the circularity and symmetry of VRP solutions (i.e., cyclic sequences). We train DACT using Proximal Policy Optimization and design a curriculum learning strategy for better sample efficiency. We apply DACT to solve the traveling salesman problem (TSP) and capacitated vehicle routing problem (CVRP). Results show that our DACT outperforms existing Transformer based improvement models, and exhibits much better generalization performance across different problem sizes on synthetic and benchmark instances, respectively.  ( 2 min )
    EXPObench: Benchmarking Surrogate-based Optimisation Algorithms on Expensive Black-box Functions. (arXiv:2106.04618v2 [cs.LG] UPDATED)
    Surrogate algorithms such as Bayesian optimisation are especially designed for black-box optimisation problems with expensive objectives, such as hyperparameter tuning or simulation-based optimisation. In the literature, these algorithms are usually evaluated with synthetic benchmarks which are well established but have no expensive objective, and only on one or two real-life applications which vary wildly between papers. There is a clear lack of standardisation when it comes to benchmarking surrogate algorithms on real-life, expensive, black-box objective functions. This makes it very difficult to draw conclusions on the effect of algorithmic contributions and to give substantial advice on which method to use when. A new benchmark library, EXPObench, provides first steps towards such a standardisation. The library is used to provide an extensive comparison of six different surrogate algorithms on four expensive optimisation problems from different real-life applications. This has led to new insights regarding the relative importance of exploration, the evaluation time of the objective, and the used model. We also provide rules of thumb for which surrogate algorithm to use in which situation. A further contribution is that we make the algorithms and benchmark problem instances publicly available, contributing to more uniform analysis of surrogate algorithms. Most importantly, we include the performance of the six algorithms on all evaluated problem instances. This results in a unique new dataset that lowers the bar for researching new methods as the number of expensive evaluations required for comparison is significantly reduced.  ( 2 min )
    Cluster-Specific Predictions with Multi-Task Gaussian Processes. (arXiv:2011.07866v4 [cs.LG] UPDATED)
    A model involving Gaussian processes (GPs) is introduced to simultaneously handle multi-task learning, clustering, and prediction for multiple functional data. This procedure acts as a model-based clustering method for functional data as well as a learning step for subsequent predictions for new tasks. The model is instantiated as a mixture of multi-task GPs with common mean processes. A variational EM algorithm is derived for dealing with the optimisation of the hyper-parameters along with the hyper-posteriors' estimation of latent variables and processes. We establish explicit formulas for integrating the mean processes and the latent clustering variables within a predictive distribution, accounting for uncertainty on both aspects. This distribution is defined as a mixture of cluster-specific GP predictions, which enhances the performances when dealing with group-structured data. The model handles irregular grid of observations and offers different hypotheses on the covariance structure for sharing additional information across tasks. The performances on both clustering and prediction tasks are assessed through various simulated scenarios and real datasets. The overall algorithm, called MagmaClust, is publicly available as an R package.  ( 2 min )
    Quantum machine learning of large datasets using randomized measurements. (arXiv:2108.01039v3 [quant-ph] UPDATED)
    Quantum computers promise to enhance machine learning for practical applications. Quantum machine learning for real-world data has to handle extensive amounts of high-dimensional data. However, conventional methods for measuring quantum kernels are impractical for large datasets as they scale with the square of the dataset size. Here, we measure quantum kernels using randomized measurements. The quantum computation time scales linearly with dataset size and quadratic for classical post-processing. While our method scales in general exponentially in qubit number, we gain a substantial speed-up when running on intermediate-sized quantum computers. Further, we efficiently encode high-dimensional data into quantum computers with the number of features scaling linearly with the circuit depth. The encoding is characterized by the quantum Fisher information metric and is related to the radial basis function kernel. Our approach is robust to noise via a cost-free error mitigation scheme. We demonstrate the advantages of our methods for noisy quantum computers by classifying images with the IBM quantum computer. To achieve further speedups we distribute the quantum computational tasks between different quantum computers. Our method enables benchmarking of quantum machine learning algorithms with large datasets on currently available quantum computers.  ( 2 min )
    Safe Value Functions. (arXiv:2105.12204v4 [eess.SY] UPDATED)
    Safety constraints and optimality are important, but sometimes conflicting criteria for controllers. Although these criteria are often solved separately with different tools to maintain formal guarantees, it is also common practice in reinforcement learning to simply modify reward functions by penalizing failures, with the penalty treated as a mere heuristic. We rigorously examine the relationship of both safety and optimality to penalties, and formalize sufficient conditions for safe value functions (SVFs): value functions that are both optimal for a given task, and enforce safety constraints. We reveal this structure by examining when rewards preserve viability under optimal control, and show that there always exists a finite penalty that induces a safe value function. This penalty is not unique, but upper-unbounded: larger penalties do not harm optimality. Although it is often not possible to compute the minimum required penalty, we reveal clear structure of how the penalty, rewards, discount factor, and dynamics interact. This insight suggests practical, theory-guided heuristics to design reward functions for control problems where safety is important.  ( 2 min )
    Dendrite Net with Acceleration Module for Faster Nonlinear Mapping and System Identification. (arXiv:2006.02901v2 [cs.LG] UPDATED)
    Nonlinear mapping is an essential and common demand in online systems, such as sensor systems and mobile phones. Accelerating nonlinear mapping will directly speed up online systems. Previously the authors of this paper proposed a Dendrite Net (DD) with enormously lower time complexity than the existing nonlinear mapping algorithms; however, there still are redundant calculations in DD. This paper presents a DD with an acceleration module (AC) to accelerate nonlinear mapping further. We conduct three experiments to verify whether DD with AC has lower time complexity while retaining DD's nonlinear mapping properties and system identification properties: The first experiment is the precision and identification of unary nonlinear mapping, reflecting the calculation performance using DD with AC for basic functions in online systems. The second experiment is the mapping precision and identification of the multi-input nonlinear system, reflecting the performance for designing online systems via DD with AC. Finally, this paper compares the time complexity of DD and DD with AC and analyzes the theoretical reasons through repeated experiments. Results: DD with AC retains DD's excellent mapping and identification properties and has lower time complexity. Significance: DD with AC can be used for most engineering systems, such as sensor systems, and will speed up computation in these online systems. The code of DD with AC is available on https://github.com/liugang1234567/Gang-neuron  ( 2 min )
    Automated Grading System of Retinal Arterio-venous Crossing Patterns: A Deep Learning Approach Replicating Ophthalmologist's Diagnostic Process of Arteriolosclerosis. (arXiv:2011.03772v2 [eess.IV] UPDATED)
    The status of retinal arteriovenous crossing is of great significance for clinical evaluation of arteriolosclerosis and systemic hypertension. As an ophthalmology diagnostic criteria, Scheie's classification has been used to grade the severity of arteriolosclerosis. In this paper, we propose a deep learning approach to support the diagnosis process, which, to the best of our knowledge, is one of the earliest attempts in medical imaging. The proposed pipeline is three-fold. First, we adopt segmentation and classification models to automatically obtain vessels in a retinal image with the corresponding artery/vein labels and find candidate arteriovenous crossing points. Second, we use a classification model to validate the true crossing point. At last, the grade of severity for the vessel crossings is classified. To better address the problem of label ambiguity and imbalanced label distribution, we propose a new model, named multi-diagnosis team network (MDTNet), in which the sub-models with different structures or different loss functions provide different decisions. MDTNet unifies these diverse theories to give the final decision with high accuracy. Our severity grading method was able to validate crossing points with precision and recall of 96.3% and 96.3%, respectively. Among correctly detected crossing points, the kappa value for the agreement between the grading by a retina specialist and the estimated score was 0.85, with an accuracy of 0.92. The numerical results demonstrate that our method can achieve a good performance in both arteriovenous crossing validation and severity grading tasks. By the proposed models, we could build a pipeline reproducing retina specialist's subjective grading without feature extractions. The code is available for reproducibility.  ( 3 min )
    Learning to forecast vegetation greenness at fine resolution over Africa with ConvLSTMs. (arXiv:2210.13648v2 [cs.LG] UPDATED)
    Forecasting the state of vegetation in response to climate and weather events is a major challenge. Its implementation will prove crucial in predicting crop yield, forest damage, or more generally the impact on ecosystems services relevant for socio-economic functioning, which if absent can lead to humanitarian disasters. Vegetation status depends on weather and environmental conditions that modulate complex ecological processes taking place at several timescales. Interactions between vegetation and different environmental drivers express responses at instantaneous but also time-lagged effects, often showing an emerging spatial context at landscape and regional scales. We formulate the land surface forecasting task as a strongly guided video prediction task where the objective is to forecast the vegetation developing at very fine resolution using topography and weather variables to guide the prediction. We use a Convolutional LSTM (ConvLSTM) architecture to address this task and predict changes in the vegetation state in Africa using Sentinel-2 satellite NDVI, having ERA5 weather reanalysis, SMAP satellite measurements, and topography (DEM of SRTMv4.1) as variables to guide the prediction. Ours results highlight how ConvLSTM models can not only forecast the seasonal evolution of NDVI at high resolution, but also the differential impacts of weather anomalies over the baselines. The model is able to predict different vegetation types, even those with very high NDVI variability during target length, which is promising to support anticipatory actions in the context of drought-related disasters.  ( 2 min )
    Meta-Learning Biologically Plausible Plasticity Rules with Random Feedback Pathways. (arXiv:2210.16414v4 [q-bio.NC] UPDATED)
    Backpropagation is widely used to train artificial neural networks, but its relationship to synaptic plasticity in the brain is unknown. Some biological models of backpropagation rely on feedback projections that are symmetric with feedforward connections, but experiments do not corroborate the existence of such symmetric backward connectivity. Random feedback alignment offers an alternative model in which errors are propagated backward through fixed, random backward connections. This approach successfully trains shallow models, but learns slowly and does not perform well with deeper models or online learning. In this study, we develop a novel meta-plasticity approach to discover interpretable, biologically plausible plasticity rules that improve online learning performance with fixed random feedback connections. The resulting plasticity rules show improved online training of deep models in the low data regime. Our results highlight the potential of meta-plasticity to discover effective, interpretable learning rules satisfying biological constraints.  ( 2 min )
    Conditional Neural Processes for Molecules. (arXiv:2210.09211v2 [stat.ML] UPDATED)
    Neural processes (NPs) are models for transfer learning with properties reminiscent of Gaussian Processes (GPs). They are adept at modelling data consisting of few observations of many related functions on the same input space and are trained by minimizing a variational objective, which is computationally much less expensive than the Bayesian updating required by GPs. So far, most studies of NPs have focused on low-dimensional datasets which are not representative of realistic transfer learning tasks. Drug discovery is one application area that is characterized by datasets consisting of many chemical properties or functions which are sparsely observed, yet depend on shared features or representations of the molecular inputs. This paper applies the conditional neural process (CNP) to DOCKSTRING, a dataset of docking scores for benchmarking ML models. CNPs show competitive performance in few-shot learning tasks relative to supervised learning baselines common in chemoinformatics, as well as an alternative model for transfer learning based on pre-training and refining neural network regressors. We present a Bayesian optimization experiment which showcases the probabilistic nature of CNPs and discuss shortcomings of the model in uncertainty quantification.  ( 2 min )
    Generative Knowledge Graph Construction: A Review. (arXiv:2210.12714v2 [cs.CL] UPDATED)
    Generative Knowledge Graph Construction (KGC) refers to those methods that leverage the sequence-to-sequence framework for building knowledge graphs, which is flexible and can be adapted to widespread tasks. In this study, we summarize the recent compelling progress in generative knowledge graph construction. We present the advantages and weaknesses of each paradigm in terms of different generation targets and provide theoretical insight and empirical analysis. Based on the review, we suggest promising research directions for the future. Our contributions are threefold: (1) We present a detailed, complete taxonomy for the generative KGC methods; (2) We provide a theoretical and empirical analysis of the generative KGC methods; (3) We propose several research directions that can be developed in the future.  ( 2 min )
    Tail Batch Sampling: Approximating Global Contrastive Losses as Optimization over Batch Assignments. (arXiv:2210.12874v2 [cs.LG] UPDATED)
    Contrastive Learning has recently achieved state-of-the-art performance in a wide range of tasks. Many contrastive learning approaches use mined hard negatives to make batches more informative during training but these approaches are inefficient as they increase epoch length proportional to the number of mined negatives and require frequent updates of nearest neighbor indices or mining from recent batches. In this work, we provide an alternative to hard negative mining in supervised contrastive learning, Tail Batch Sampling (TBS), an efficient approximation to the batch assignment problem that upper bounds the gap between the global and training losses, $\mathcal{L}^{Global} - \mathcal{L}^{Train}$. TBS \textbf{improves state-of-the-art performance} in sentence embedding (+0.37 Spearman) and code-search tasks (+2.2\% MRR), is easy to implement - requiring only a few additional lines of code, does not maintain external data structures such as nearest neighbor indices, is more computationally efficient when compared to the most minimal hard negative mining approaches, and makes no changes to the model being trained.  ( 2 min )
    DiffBP: Generative Diffusion of 3D Molecules for Target Protein Binding. (arXiv:2211.11214v2 [q-bio.BM] UPDATED)
    Generating molecules that bind to specific proteins is an important but challenging task in drug discovery. Previous works usually generate atoms in an auto-regressive way, where element types and 3D coordinates of atoms are generated one by one. However, in real-world molecular systems, the interactions among atoms in an entire molecule are global, leading to the energy function pair-coupled among atoms. With such energy-based consideration, the modeling of probability should be based on joint distributions, rather than sequentially conditional ones. Thus, the unnatural sequentially auto-regressive modeling of molecule generation is likely to violate the physical rules, thus resulting in poor properties of the generated molecules. In this work, a generative diffusion model for molecular 3D structures based on target proteins as contextual constraints is established, at a full-atom level in a non-autoregressive way. Given a designated 3D protein binding site, our model learns the generative process that denoises both element types and 3D coordinates of an entire molecule, with an equivariant network. Experimentally, the proposed method shows competitive performance compared with prevailing works in terms of high affinity with proteins and appropriate molecule sizes as well as other drug properties such as drug-likeness of the generated molecules.  ( 2 min )
    DiffWire: Inductive Graph Rewiring via the Lov\'asz Bound. (arXiv:2206.07369v3 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have been shown to achieve competitive results to tackle graph-related tasks, such as node and graph classification, link prediction and node and graph clustering in a variety of domains. Most GNNs use a message passing framework and hence are called MPNNs. Despite their promising results, MPNNs have been reported to suffer from over-smoothing, over-squashing and under-reaching. Graph rewiring and graph pooling have been proposed in the literature as solutions to address these limitations. However, most state-of-the-art graph rewiring methods fail to preserve the global topology of the graph, are neither differentiable nor inductive, and require the tuning of hyper-parameters. In this paper, we propose DiffWire, a novel framework for graph rewiring in MPNNs that is principled, fully differentiable and parameter-free by leveraging the Lov\'asz bound. The proposed approach provides a unified theory for graph rewiring by proposing two new, complementary layers in MPNNs: CT-Layer, a layer that learns the commute times and uses them as a relevance function for edge re-weighting; and GAP-Layer, a layer to optimize the spectral gap, depending on the nature of the network and the task at hand. We empirically validate the value of each of these layers separately with benchmark datasets for graph classification. We also perform preliminary studies on the use of CT-Layer for homophilic and heterophilic node classification tasks. DiffWire brings together the learnability of commute times to related definitions of curvature, opening the door to creating more expressive MPNNs.  ( 3 min )
    Reward Poisoning Attacks on Offline Multi-Agent Reinforcement Learning. (arXiv:2206.01888v3 [cs.LG] UPDATED)
    In offline multi-agent reinforcement learning (MARL), agents estimate policies from a given dataset. We study reward-poisoning attacks in this setting where an exogenous attacker modifies the rewards in the dataset before the agents see the dataset. The attacker wants to guide each agent into a nefarious target policy while minimizing the $L^p$ norm of the reward modification. Unlike attacks on single-agent RL, we show that the attacker can install the target policy as a Markov Perfect Dominant Strategy Equilibrium (MPDSE), which rational agents are guaranteed to follow. This attack can be significantly cheaper than separate single-agent attacks. We show that the attack works on various MARL agents including uncertainty-aware learners, and we exhibit linear programs to efficiently solve the attack problem. We also study the relationship between the structure of the datasets and the minimal attack cost. Our work paves the way for studying defense in offline MARL.  ( 2 min )
    Formalising the Robustness of Counterfactual Explanations for Neural Networks. (arXiv:2208.14878v2 [cs.LG] UPDATED)
    The use of counterfactual explanations (CFXs) is an increasingly popular explanation strategy for machine learning models. However, recent studies have shown that these explanations may not be robust to changes in the underlying model (e.g., following retraining), which raises questions about their reliability in real-world applications. Existing attempts towards solving this problem are heuristic, and the robustness to model changes of the resulting CFXs is evaluated with only a small number of retrained models, failing to provide exhaustive guarantees. To remedy this, we propose {\Delta}-robustness, the first notion to formally and deterministically assess the robustness (to model changes) of CFXs for neural networks. We introduce an abstraction framework based on interval neural networks to verify the {\Delta}-robustness of CFXs against a possibly infinite set of changes to the model parameters, i.e., weights and biases. We then demonstrate the utility of this approach in two distinct ways. First, we analyse the {\Delta}-robustness of a number of CFX generation methods from the literature and show that they unanimously host significant deficiencies in this regard. Second, we demonstrate how embedding {\Delta}-robustness within existing methods can provide CFXs which are provably robust.  ( 2 min )
    Architectural Optimization over Subgroups for Equivariant Neural Networks. (arXiv:2210.05484v2 [cs.LG] UPDATED)
    Incorporating equivariance to symmetry groups as a constraint during neural network training can improve performance and generalization for tasks exhibiting those symmetries, but such symmetries are often not perfectly nor explicitly present. This motivates algorithmically optimizing the architectural constraints imposed by equivariance. We propose the equivariance relaxation morphism, which preserves functionality while reparameterizing a group equivariant layer to operate with equivariance constraints on a subgroup, as well as the [G]-mixed equivariant layer, which mixes layers constrained to different groups to enable within-layer equivariance optimization. We further present evolutionary and differentiable neural architecture search (NAS) algorithms that utilize these mechanisms respectively for equivariance-aware architectural optimization. Experiments across a variety of datasets show the benefit of dynamically constrained equivariance to find effective architectures with approximate equivariance.  ( 2 min )
    Visual Classification via Description from Large Language Models. (arXiv:2210.07183v2 [cs.CV] UPDATED)
    Vision-language models (VLMs) such as CLIP have shown promising performance on a variety of recognition tasks using the standard zero-shot classification procedure -- computing similarity between the query image and the embedded words for each category. By only using the category name, they neglect to make use of the rich context of additional information that language affords. The procedure gives no intermediate understanding of why a category is chosen, and furthermore provides no mechanism for adjusting the criteria used towards this decision. We present an alternative framework for classification with VLMs, which we call classification by description. We ask VLMs to check for descriptive features rather than broad categories: to find a tiger, look for its stripes; its claws; and more. By basing decisions on these descriptors, we can provide additional cues that encourage using the features we want to be used. In the process, we can get a clear idea of what features the model uses to construct its decision; it gains some level of inherent explainability. We query large language models (e.g., GPT-3) for these descriptors to obtain them in a scalable way. Extensive experiments show our framework has numerous advantages past interpretability. We show improvements in accuracy on ImageNet across distribution shifts; demonstrate the ability to adapt VLMs to recognize concepts unseen during training; and illustrate how descriptors can be edited to effectively mitigate bias compared to the baseline.  ( 2 min )
    Enhance Sample Efficiency and Robustness of End-to-end Urban Autonomous Driving via Semantic Masked World Model. (arXiv:2210.04017v2 [cs.LG] UPDATED)
    End-to-end autonomous driving provides a feasible way to automatically maximize overall driving system performance by directly mapping the raw pixels from a front-facing camera to control signals. Recent advanced methods construct a latent world model to map the high dimensional observations into compact latent space. However, the latent states embedded by the world model proposed in previous works may contain a large amount of task-irrelevant information, resulting in low sampling efficiency and poor robustness to input perturbations. Meanwhile, the training data distribution is usually unbalanced, and the learned policy is hard to cope with the corner cases during the driving process. To solve the above challenges, we present a semantic masked recurrent world model (SEM2), which introduces a latent filter to extract key task-relevant features and reconstruct a semantic mask via the filtered features, and is trained with a multi-source data sampler, which aggregates common data and multiple corner case data in a single batch, to balance the data distribution. Extensive experiments on CARLA show that our method outperforms the state-of-the-art approaches in terms of sample efficiency and robustness to input permutations.  ( 2 min )
    Frequency of Interest-based Noise Attenuation Method to Improve Anomaly Detection Performance. (arXiv:2210.11068v2 [cs.LG] UPDATED)
    Accurately extracting driving events is the way to maximize computational efficiency and anomaly detection performance in the tire frictional nose-based anomaly detection task. This study proposes a concise and highly useful method for improving the precision of the event extraction that is hindered by extra noise such as wind noise, which is difficult to characterize clearly due to its randomness. The core of the proposed method is based on the identification of the road friction sound corresponding to the frequency of interest and removing the opposite characteristics with several frequency filters. Our method enables precision maximization of driving event extraction while improving anomaly detection performance by an average of 8.506%. Therefore, we conclude our method is a practical solution suitable for road surface anomaly detection purposes in outdoor edge computing environments.  ( 2 min )
    Explainable Reinforcement Learning via Model Transforms. (arXiv:2209.12006v2 [cs.AI] UPDATED)
    Understanding emerging behaviors of reinforcement learning (RL) agents may be difficult since such agents are often trained in complex environments using highly complex decision making procedures. This has given rise to a variety of approaches to explainability in RL that aim to reconcile discrepancies that may arise between the behavior of an agent and the behavior that is anticipated by an observer. Most recent approaches have relied either on domain knowledge that may not always be available, on an analysis of the agent's policy, or on an analysis of specific elements of the underlying environment, typically modeled as a Markov Decision Process (MDP). Our key claim is that even if the underlying model is not fully known (e.g., the transition probabilities have not been accurately learned) or is not maintained by the agent (i.e., when using model-free methods), the model can nevertheless be exploited to automatically generate explanations. For this purpose, we suggest using formal MDP abstractions and transforms, previously used in the literature for expediting the search for optimal policies, to automatically produce explanations. Since such transforms are typically based on a symbolic representation of the environment, they can provide meaningful explanations for gaps between the anticipated and actual agent behavior. We formally define the explainability problem, suggest a class of transforms that can be used for explaining emergent behaviors, and suggest methods that enable efficient search for an explanation. We demonstrate the approach on a set of standard benchmarks.  ( 2 min )
    Boosting Sensitivity of Large-scale Online Experimentation via Dropout Buyer Imputation. (arXiv:2209.06125v2 [cs.LG] UPDATED)
    In online experimentation, appropriate metrics (e.g., purchase) provide strong evidence to support hypotheses and enhance the decision-making process. However, incomplete metrics are frequently occurred in the online experimentation, making the available data to be much fewer than the planned online experiments (e.g., A/B testing). In this work, we introduce the concept of dropout buyers and categorize users with incomplete metric values into two groups: visitors and dropout buyers. For the analysis of incomplete metrics, we propose a clustering-based imputation method using $k$-nearest neighbors. Our proposed imputation method considers both the experiment-specific features and users' activities along their shopping paths, allowing different imputation values for different users. To facilitate efficient imputation of large-scale data sets in online experimentation, the proposed method uses a combination of stratification and clustering. The performance of the proposed method is compared to several conventional methods in both simulation studies and a real online experiment at eBay.  ( 2 min )
    Quantum Computing Methods for Supply Chain Management. (arXiv:2209.08246v2 [quant-ph] UPDATED)
    Quantum computing is expected to have transformative influences on many domains, but its practical deployments on industry problems are underexplored. We focus on applying quantum computing to operations management problems in industry, and in particular, supply chain management. Many problems in supply chain management involve large state and action spaces and pose computational challenges on classic computers. We develop a quantized policy iteration algorithm to solve an inventory control problem and demonstrative its effectiveness. We also discuss in-depth the hardware requirements and potential challenges on implementing this quantum algorithm in the near term. Our simulations and experiments are powered by \texttt{IBM Qiskit} and the \texttt{qBraid} system.  ( 2 min )
    CENN: Conservative energy method based on neural networks with subdomains for solving variational problems involving heterogeneous and complex geometries. (arXiv:2110.01359v5 [math.NA] UPDATED)
    We propose a conservative energy method based on neural networks with subdomains for solving variational problems (CENN), where the admissible function satisfying the essential boundary condition without boundary penalty is constructed by the radial basis function (RBF), particular solution neural network, and general neural network. The loss term is the potential energy, optimized based on the principle of minimum potential energy. The loss term at the interfaces has the lower order derivative compared to the strong form PINN with subdomains. The advantage of the proposed method is higher efficiency, more accurate, and less hyperparameters than the strong form PINN with subdomains. Another advantage of the proposed method is that it can apply to complex geometries based on the special construction of the admissible function. To analyze its performance, the proposed method CENN is used to model representative PDEs, the examples include strong discontinuity, singularity, complex boundary, non-linear, and heterogeneous problems. Furthermore, it outperforms other methods when dealing with heterogeneous problems.  ( 2 min )
    Learning Multi-Agent Coordination through Connectivity-driven Communication. (arXiv:2002.05233v4 [cs.LG] UPDATED)
    In artificial multi-agent systems, the ability to learn collaborative policies is predicated upon the agents' communication skills: they must be able to encode the information received from the environment and learn how to share it with other agents as required by the task at hand. We present a deep reinforcement learning approach, Connectivity Driven Communication (CDC), that facilitates the emergence of multi-agent collaborative behaviour only through experience. The agents are modelled as nodes of a weighted graph whose state-dependent edges encode pair-wise messages that can be exchanged. We introduce a graph-dependent attention mechanisms that controls how the agents' incoming messages are weighted. This mechanism takes into full account the current state of the system as represented by the graph, and builds upon a diffusion process that captures how the information flows on the graph. The graph topology is not assumed to be known a priori, but depends dynamically on the agents' observations, and is learnt concurrently with the attention mechanism and policy in an end-to-end fashion. Our empirical results show that CDC is able to learn effective collaborative policies and can over-perform competing learning algorithms on cooperative navigation tasks.  ( 2 min )
    Unsupervised Learning under Latent Label Shift. (arXiv:2207.13179v2 [cs.LG] UPDATED)
    What sorts of structure might enable a learner to discover classes from unlabeled data? Traditional approaches rely on feature-space similarity and heroic assumptions on the data. In this paper, we introduce unsupervised learning under Latent Label Shift (LLS), where we have access to unlabeled data from multiple domains such that the label marginals $p_d(y)$ can shift across domains but the class conditionals $p(\mathbf{x}|y)$ do not. This work instantiates a new principle for identifying classes: elements that shift together group together. For finite input spaces, we establish an isomorphism between LLS and topic modeling: inputs correspond to words, domains to documents, and labels to topics. Addressing continuous data, we prove that when each label's support contains a separable region, analogous to an anchor word, oracle access to $p(d|\mathbf{x})$ suffices to identify $p_d(y)$ and $p_d(y|\mathbf{x})$ up to permutation. Thus motivated, we introduce a practical algorithm that leverages domain-discriminative models as follows: (i) push examples through domain discriminator $p(d|\mathbf{x})$; (ii) discretize the data by clustering examples in $p(d|\mathbf{x})$ space; (iii) perform non-negative matrix factorization on the discrete data; (iv) combine the recovered $p(y|d)$ with the discriminator outputs $p(d|\mathbf{x})$ to compute $p_d(y|x) \; \forall d$. With semi-synthetic experiments, we show that our algorithm can leverage domain information to improve upon competitive unsupervised classification methods. We reveal a failure mode of standard unsupervised classification methods when feature-space similarity does not indicate true groupings, and show empirically that our method better handles this case. Our results establish a deep connection between distribution shift and topic modeling, opening promising lines for future work.  ( 2 min )
    Random Graph Embedding and Joint Sparse Regularization for Multi-label Feature Selection. (arXiv:2204.06445v2 [stat.ML] UPDATED)
    Multi-label learning is often used to mine the correlation between variables and multiple labels, and its research focuses on fully extracting the information between variables and labels. The $\ell_{2,1}$ regularization is often used to get a sparse coefficient matrix, but the problem of multicollinearity among variables cannot be effectively solved. In this paper, the proposed model can choose the most relevant variables by solving a joint constraint optimization problem using the $\ell_{2,1}$ regularization and Frobenius regularization. In manifold regularization, we carry out a random walk strategy based on the joint structure to construct a neighborhood graph, which is highly robust to outliers. In addition, we give an iterative algorithm of the proposed method and proved the convergence of this algorithm. The experiments on the real-world data sets also show that the comprehensive performance of our method is consistently better than the classical method.  ( 2 min )
    Neighborhood-aware Scalable Temporal Network Representation Learning. (arXiv:2209.01084v3 [cs.LG] UPDATED)
    Temporal networks have been widely used to model real-world complex systems such as financial systems and e-commerce systems. In a temporal network, the joint neighborhood of a set of nodes often provides crucial structural information useful for predicting whether they may interact at a certain time. However, recent representation learning methods for temporal networks often fail to extract such information or depend on online construction of structural features, which is time-consuming. To address the issue, this work proposes Neighborhood-Aware Temporal network model (NAT). For each node in the network, NAT abandons the commonly-used one-single-vector-based representation while adopting a novel dictionary-type neighborhood representation. Such a dictionary representation records a downsampled set of the neighboring nodes as keys, and allows fast construction of structural features for a joint neighborhood of multiple nodes. We also design a dedicated data structure termed N-cache to support parallel access and update of those dictionary representations on GPUs. NAT gets evaluated over seven real-world large-scale temporal networks. NAT not only outperforms all cutting-edge baselines by averaged 1.2% and 4.2% in transductive and inductive link prediction accuracy, respectively, but also keeps scalable by achieving a speed-up of 4.1-76.7x against the baselines that adopt joint structural features and achieves a speed-up of 1.6-4.0x against the baselines that cannot adopt those features. The link to the code: https: //github.com/Graph-COM/Neighborhood-Aware-Temporal-Network.  ( 2 min )
    Enabling Homogeneous GNNs to Handle Heterogeneous Graphs via Relation Embedding. (arXiv:2209.11414v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have been generalized to process the heterogeneous graphs by various approaches. Unfortunately, these approaches usually model the heterogeneity via various complicated modules. This paper aims to propose a simple yet effective framework to assign adequate ability to the homogeneous GNNs to handle the heterogeneous graphs. Specifically, we propose Relation Embedding based Graph Neural Network (RE-GNN), which employs only one parameter per relation to embed the importance of distinct types of relations and node-type-specific self-loop connections. To optimize these relation embeddings and the model parameters simultaneously, a gradient scaling factor is proposed to constrain the embeddings to converge to suitable values. Besides, we interpret the proposed RE-GNN from two perspectives, and theoretically demonstrate that our RE-GCN possesses more expressive power than GTN (which is a typical heterogeneous GNN, and it can generate meta-paths adaptively). Extensive experiments demonstrate that our RE-GNN can effectively and efficiently handle the heterogeneous graphs and can be applied to various homogeneous GNNs.  ( 2 min )
    Attribute-based Representations for Accurate and Interpretable Video Anomaly Detection. (arXiv:2212.00789v1 [cs.CV])
    Video anomaly detection (VAD) is a challenging computer vision task with many practical applications. As anomalies are inherently ambiguous, it is essential for users to understand the reasoning behind a system's decision in order to determine if the rationale is sound. In this paper, we propose a simple but highly effective method that pushes the boundaries of VAD accuracy and interpretability using attribute-based representations. Our method represents every object by its velocity and pose. The anomaly scores are computed using a density-based approach. Surprisingly, we find that this simple representation is sufficient to achieve state-of-the-art performance in ShanghaiTech, the largest and most complex VAD dataset. Combining our interpretable attribute-based representations with implicit, deep representation yields state-of-the-art performance with a $99.1\%, 93.3\%$, and $85.9\%$ AUROC on Ped2, Avenue, and ShanghaiTech, respectively. Our method is accurate, interpretable, and easy to implement.  ( 2 min )
    Fully-Dynamic Decision Trees. (arXiv:2212.00778v1 [cs.LG])
    We develop the first fully dynamic algorithm that maintains a decision tree over an arbitrary sequence of insertions and deletions of labeled examples. Given $\epsilon > 0$ our algorithm guarantees that, at every point in time, every node of the decision tree uses a split with Gini gain within an additive $\epsilon$ of the optimum. For real-valued features the algorithm has an amortized running time per insertion/deletion of $O\big(\frac{d \log^3 n}{\epsilon^2}\big)$, which improves to $O\big(\frac{d \log^2 n}{\epsilon}\big)$ for binary or categorical features, while it uses space $O(n d)$, where $n$ is the maximum number of examples at any point in time and $d$ is the number of features. Our algorithm is nearly optimal, as we show that any algorithm with similar guarantees uses amortized running time $\Omega(d)$ and space $\tilde{\Omega} (n d)$. We complement our theoretical results with an extensive experimental evaluation on real-world data, showing the effectiveness of our algorithm.  ( 2 min )
    Neural Representations Reveal Distinct Modes of Class Fitting in Residual Convolutional Networks. (arXiv:2212.00771v1 [cs.LG])
    We leverage probabilistic models of neural representations to investigate how residual networks fit classes. To this end, we estimate class-conditional density models for representations learned by deep ResNets. We then use these models to characterize distributions of representations across learned classes. Surprisingly, we find that classes in the investigated models are not fitted in an uniform way. On the contrary: we uncover two groups of classes that are fitted with markedly different distributions of representations. These distinct modes of class-fitting are evident only in the deeper layers of the investigated models, indicating that they are not related to low-level image features. We show that the uncovered structure in neural representations correlate with memorization of training examples and adversarial robustness. Finally, we compare class-conditional distributions of neural representations between memorized and typical examples. This allows us to uncover where in the network structure class labels arise for memorized and standard inputs.  ( 2 min )
    Uniform versus uncertainty sampling: When being active is less efficient than staying passive. (arXiv:2212.00772v1 [cs.LG])
    It is widely believed that given the same labeling budget, active learning algorithms like uncertainty sampling achieve better predictive performance than passive learning (i.e. uniform sampling), albeit at a higher computational cost. Recent empirical evidence suggests that this added cost might be in vain, as uncertainty sampling can sometimes perform even worse than passive learning. While existing works offer different explanations in the low-dimensional regime, this paper shows that the underlying mechanism is entirely different in high dimensions: we prove for logistic regression that passive learning outperforms uncertainty sampling even for noiseless data and when using the uncertainty of the Bayes optimal classifier. Insights from our proof indicate that this high-dimensional phenomenon is exacerbated when the separation between the classes is small. We corroborate this intuition with experiments on 20 high-dimensional datasets spanning a diverse range of applications, from finance and histology to chemistry and computer vision.  ( 2 min )
    Improving Zero-Shot Models with Label Distribution Priors. (arXiv:2212.00784v1 [cs.CV])
    Labeling large image datasets with attributes such as facial age or object type is tedious and sometimes infeasible. Supervised machine learning methods provide a highly accurate solution, but require manual labels which are often unavailable. Zero-shot models (e.g., CLIP) do not require manual labels but are not as accurate as supervised ones, particularly when the attribute is numeric. We propose a new approach, CLIPPR (CLIP with Priors), which adapts zero-shot models for regression and classification on unlabelled datasets. Our method does not use any annotated images. Instead, we assume a prior over the label distribution in the dataset. We then train an adapter network on top of CLIP under two competing objectives: i) minimal change of predictions from the original CLIP model ii) minimal distance between predicted and prior distribution of labels. Additionally, we present a novel approach for selecting prompts for Vision & Language models using a distributional prior. Our method is effective and presents a significant improvement over the original model. We demonstrate an improvement of 28% in mean absolute error on the UTK age regression task. We also present promising results for classification benchmarks, improving the classification accuracy on the ImageNet dataset by 2.83%, without using any labels.  ( 2 min )
    ColBERT: Using BERT Sentence Embedding in Parallel Neural Networks for Computational Humor. (arXiv:2004.12765v7 [cs.CL] UPDATED)
    Automation of humor detection and rating has interesting use cases in modern technologies, such as humanoid robots, chatbots, and virtual assistants. In this paper, we propose a novel approach for detecting and rating humor in short texts based on a popular linguistic theory of humor. The proposed technical method initiates by separating sentences of the given text and utilizing the BERT model to generate embeddings for each one. The embeddings are fed to separate lines of hidden layers in a neural network (one line for each sentence) to extract latent features. At last, the parallel lines are concatenated to determine the congruity and other relationships between the sentences and predict the target value. We accompany the paper with a novel dataset for humor detection consisting of 200,000 formal short texts. In addition to evaluating our work on the novel dataset, we participated in a live machine learning competition focused on rating humor in Spanish tweets. The proposed model obtained F1 scores of 0.982 and 0.869 in the humor detection experiments which outperform general and state-of-the-art models. The evaluation performed on two contrasting settings confirm the strength and robustness of the model and suggests two important factors in achieving high accuracy in the current task: 1) usage of sentence embeddings and 2) utilizing the linguistic structure of humor in designing the proposed model.  ( 2 min )
    Second-order optimization with lazy Hessians. (arXiv:2212.00781v1 [math.OC])
    We analyze Newton's method with lazy Hessian updates for solving general possibly non-convex optimization problems. We propose to reuse a previously seen Hessian for several iterations while computing new gradients at each step of the method. This significantly reduces the overall arithmetical complexity of second-order optimization schemes. By using the cubic regularization technique, we establish fast global convergence of our method to a second-order stationary point, while the Hessian does not need to be updated each iteration. For convex problems, we justify global and local superlinear rates for lazy Newton steps with quadratic regularization, which is easier to compute. The optimal frequency for updating the Hessian is once every $d$ iterations, where $d$ is the dimension of the problem. This provably improves the total arithmetical complexity of second-order algorithms by a factor $\sqrt{d}$.  ( 2 min )
    Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation. (arXiv:2212.00774v1 [cs.CV])
    A diffusion model learns to predict a vector field of gradients. We propose to apply chain rule on the learned gradients, and back-propagate the score of a diffusion model through the Jacobian of a differentiable renderer, which we instantiate to be a voxel radiance field. This setup aggregates 2D scores at multiple camera viewpoints into a 3D score, and repurposes a pretrained 2D model for 3D data generation. We identify a technical challenge of distribution mismatch that arises in this application, and propose a novel estimation mechanism to resolve it. We run our algorithm on several off-the-shelf diffusion image generative models, including the recently released Stable Diffusion trained on the large-scale LAION dataset.  ( 2 min )
    Simplifying and Understanding State Space Models with Diagonal Linear RNNs. (arXiv:2212.00768v1 [cs.LG])
    Sequence models based on linear state spaces (SSMs) have recently emerged as a promising choice of architecture for modeling long range dependencies across various modalities. However, they invariably rely on discretization of a continuous state space, which complicates their presentation and understanding. In this work, we dispose of the discretization step, and propose a model based on vanilla Diagonal Linear RNNs ($\mathrm{DLR}$). We empirically show that $\mathrm{DLR}$ is as performant as previously-proposed SSMs in the presence of strong supervision, despite being conceptually much simpler. Moreover, we characterize the expressivity of SSMs (including $\mathrm{DLR}$) and attention-based models via a suite of $13$ synthetic sequence-to-sequence tasks involving interactions over tens of thousands of tokens, ranging from simple operations, such as shifting an input sequence, to detecting co-dependent visual features over long spatial ranges in flattened images. We find that while SSMs report near-perfect performance on tasks that can be modeled via $\textit{few}$ convolutional kernels, they struggle on tasks requiring $\textit{many}$ such kernels and especially when the desired sequence manipulation is $\textit{context-dependent}$. For example, $\mathrm{DLR}$ learns to perfectly shift a $0.5M$-long input by an arbitrary number of positions but fails when the shift size depends on context. Despite these limitations, $\mathrm{DLR}$ reaches high performance on two higher-order reasoning tasks $\mathrm{ListOpsSubTrees}$ and $\mathrm{PathfinderSegmentation}\text{-}\mathrm{256}$ with input lengths $8K$ and $65K$ respectively, and gives encouraging performance on $\mathrm{PathfinderSegmentation}\text{-}\mathrm{512}$ with input length $262K$ for which attention is not a viable choice.  ( 2 min )
    Exploiting Socially-Aware Tasks for Embodied Social Navigation. (arXiv:2212.00767v1 [cs.CV])
    Learning how to navigate among humans in an occluded and spatially constrained indoor environment, is a key ability required to embodied agent to be integrated into our society. In this paper, we propose an end-to-end architecture that exploits Socially-Aware Tasks (referred as to Risk and Social Compass) to inject into a reinforcement learning navigation policy the ability to infer common-sense social behaviors. To this end, our tasks exploit the notion of immediate and future dangers of collision. Furthermore, we propose an evaluation protocol specifically designed for the Social Navigation Task in simulated environments. This is done to capture fine-grained features and characteristics of the policy by analyzing the minimal unit of human-robot spatial interaction, called Encounter. We validate our approach on Gibson4+ and Habitat-Matterport3D datasets.  ( 2 min )
    P(Expression|Grammar): Probability of deriving an algebraic expression with a probabilistic context-free grammar. (arXiv:2212.00751v1 [cs.FL])
    Probabilistic context-free grammars have a long-term record of use as generative models in machine learning and symbolic regression. When used for symbolic regression, they generate algebraic expressions. We define the latter as equivalence classes of strings derived by grammar and address the problem of calculating the probability of deriving a given expression with a given grammar. We show that the problem is undecidable in general. We then present specific grammars for generating linear, polynomial, and rational expressions, where algorithms for calculating the probability of a given expression exist. For those grammars, we design algorithms for calculating the exact probability and efficient approximation with arbitrary precision.  ( 2 min )
    High-dimensional density estimation with tensorizing flow. (arXiv:2212.00759v1 [cs.LG])
    We propose the tensorizing flow method for estimating high-dimensional probability density functions from the observed data. The method is based on tensor-train and flow-based generative modeling. Our method first efficiently constructs an approximate density in the tensor-train form via solving the tensor cores from a linear system based on the kernel density estimators of low-dimensional marginals. We then train a continuous-time flow model from this tensor-train density to the observed empirical distribution by performing a maximum likelihood estimation. The proposed method combines the optimization-less feature of the tensor-train with the flexibility of the flow-based generative models. Numerical results are included to demonstrate the performance of the proposed method.  ( 2 min )
    Learning Transition Operators From Sparse Space-Time Samples. (arXiv:2212.00746v1 [cs.IT])
    We consider the nonlinear inverse problem of learning a transition operator $\mathbf{A}$ from partial observations at different times, in particular from sparse observations of entries of its powers $\mathbf{A},\mathbf{A}^2,\cdots,\mathbf{A}^{T}$. This Spatio-Temporal Transition Operator Recovery problem is motivated by the recent interest in learning time-varying graph signals that are driven by graph operators depending on the underlying graph topology. We address the nonlinearity of the problem by embedding it into a higher-dimensional space of suitable block-Hankel matrices, where it becomes a low-rank matrix completion problem, even if $\mathbf{A}$ is of full rank. For both a uniform and an adaptive random space-time sampling model, we quantify the recoverability of the transition operator via suitable measures of incoherence of these block-Hankel embedding matrices. For graph transition operators these measures of incoherence depend on the interplay between the dynamics and the graph topology. We develop a suitable non-convex iterative reweighted least squares (IRLS) algorithm, establish its quadratic local convergence, and show that, in optimal scenarios, no more than $\mathcal{O}(rn \log(nT))$ space-time samples are sufficient to ensure accurate recovery of a rank-$r$ operator $\mathbf{A}$ of size $n \times n$. This establishes that spatial samples can be substituted by a comparable number of space-time samples. We provide an efficient implementation of the proposed IRLS algorithm with space complexity of order $O(r n T)$ and per-iteration time complexity linear in $n$. Numerical experiments for transition operators based on several graph models confirm that the theoretical findings accurately track empirical phase transitions, and illustrate the applicability and scalability of the proposed algorithm.  ( 2 min )
    Edge Deep Learning Enabled Freezing of Gait Detection in Parkinson's Patients. (arXiv:2212.00729v1 [eess.SP])
    This paper presents the design of a wireless sensor network for detecting and alerting the freezing of gait (FoG) symptoms in patients with Parkinson's disease. Three sensor nodes, each integrating a 3-axis accelerometer, can be placed on a patient at ankle, thigh, and truck. Each sensor node can independently detect FoG using an on-device deep learning (DL) model, featuring a squeeze and excitation convolutional neural network (CNN). In a validation using a public dataset, the prototype developed achieved a FoG detection sensitivity of 88.8% and an F1 score of 85.34%, using less than 20 k trainable parameters per sensor node. Once FoG is detected, an auditory signal will be generated to alert users, and the alarm signal will also be sent to mobile phones for further actions if needed. The sensor node can be easily recharged wirelessly by inductive coupling. The system is self-contained and processes all user data locally without streaming data to external devices or the cloud, thus eliminating the cybersecurity risks and power penalty associated with wireless data transmission. The developed methodology can be used in a wide range of applications.  ( 2 min )
    Reservoir Computing-based Multi-Symbol Equalization for PAM 4 Short-reach Transmission. (arXiv:2212.00738v1 [eess.SP])
    We propose spectrum-sliced reservoir computer-based (RC) multi-symbol equalization for 32-GBd PAM4 transmission. RC with 17 symbols at the output achieves an order of magnitude reduction in multiplications/symbol versus single output case while maintaining simple training.  ( 2 min )
    Transformer-based Hand Gesture Recognition via High-Density EMG Signals: From Instantaneous Recognition to Fusion of Motor Unit Spike Trains. (arXiv:2212.00743v1 [eess.SP])
    Designing efficient and labor-saving prosthetic hands requires powerful hand gesture recognition algorithms that can achieve high accuracy with limited complexity and latency. In this context, the paper proposes a compact deep learning framework referred to as the CT-HGR, which employs a vision transformer network to conduct hand gesture recognition using highdensity sEMG (HD-sEMG) signals. The attention mechanism in the proposed model identifies similarities among different data segments with a greater capacity for parallel computations and addresses the memory limitation problems while dealing with inputs of large sequence lengths. CT-HGR can be trained from scratch without any need for transfer learning and can simultaneously extract both temporal and spatial features of HD-sEMG data. Additionally, the CT-HGR framework can perform instantaneous recognition using sEMG image spatially composed from HD-sEMG signals. A variant of the CT-HGR is also designed to incorporate microscopic neural drive information in the form of Motor Unit Spike Trains (MUSTs) extracted from HD-sEMG signals using Blind Source Separation (BSS). This variant is combined with its baseline version via a hybrid architecture to evaluate potentials of fusing macroscopic and microscopic neural drive information. The utilized HD-sEMG dataset involves 128 electrodes that collect the signals related to 65 isometric hand gestures of 20 subjects. The proposed CT-HGR framework is applied to 31.25, 62.5, 125, 250 ms window sizes of the above-mentioned dataset utilizing 32, 64, 128 electrode channels. The average accuracy over all the participants using 32 electrodes and a window size of 31.25 ms is 86.23%, which gradually increases till reaching 91.98% for 128 electrodes and a window size of 250 ms. The CT-HGR achieves accuracy of 89.13% for instantaneous recognition based on a single frame of HD-sEMG image.  ( 2 min )
    An exponentially-growing family of universal quantum circuits. (arXiv:2212.00736v1 [quant-ph])
    Quantum machine learning has become an area of growing interest but has certain theoretical and hardware-specific limitations. Notably, the problem of vanishing gradients, or barren plateaus, renders the training impossible for circuits with high qubit counts, imposing a limit on the number of qubits that data scientists can use for solving problems. Independently, angle-embedded supervised quantum neural networks were shown to produce truncated Fourier series with a degree directly dependent on two factors: the depth of the encoding, and the number of parallel qubits the encoding is applied to. The degree of the Fourier series limits the model expressivity. This work introduces two new architectures whose Fourier degrees grow exponentially: the sequential and parallel exponential quantum machine learning architectures. This is done by efficiently using the available Hilbert space when encoding, increasing the expressivity of the quantum encoding. Therefore, the exponential growth allows staying at the low-qubit limit to create highly expressive circuits avoiding barren plateaus. Practically, parallel exponential architecture was shown to outperform the existing linear architectures by reducing their final mean square error value by up to 44.7% in a one-dimensional test problem. Furthermore, the feasibility of this technique was also shown on a trapped ion quantum processing unit.  ( 2 min )
    On the Effective Usage of Priors in RSS-based Localization. (arXiv:2212.00728v1 [eess.SP])
    In this paper, we study the localization problem in dense urban settings. In such environments, Global Navigation Satellite Systems fail to provide good accuracy due to low likelihood of line-of-sight (LOS) links between the receiver (Rx) to be located and the satellites, due to the presence of obstacles like the buildings. Thus, one has to resort to other technologies, which can reliably operate under non-line-of-sight (NLOS) conditions. Recently, we proposed a Received Signal Strength (RSS) fingerprint and convolutional neural network-based algorithm, LocUNet, and demonstrated its state-of-the-art localization performance with respect to the widely adopted k-nearest neighbors (kNN) algorithm, and to state-of-the-art time of arrival (ToA) ranging-based methods. In the current work, we first recognize LocUNet's ability to learn the underlying prior distribution of the Rx position or Rx and transmitter (Tx) association preferences from the training data, and attribute its high performance to these. Conversely, we demonstrate that classical methods based on probabilistic approach, can greatly benefit from an appropriate incorporation of such prior information. Our studies also numerically prove LocUNet's close to optimal performance in many settings, by comparing it with the theoretically optimal formulations.  ( 2 min )
    Adversarial Artifact Detection in EEG-Based Brain-Computer Interfaces. (arXiv:2212.00727v1 [cs.CR])
    Machine learning has achieved great success in electroencephalogram (EEG) based brain-computer interfaces (BCIs). Most existing BCI research focused on improving its accuracy, but few had considered its security. Recent studies, however, have shown that EEG-based BCIs are vulnerable to adversarial attacks, where small perturbations added to the input can cause misclassification. Detection of adversarial examples is crucial to both the understanding of this phenomenon and the defense. This paper, for the first time, explores adversarial detection in EEG-based BCIs. Experiments on two EEG datasets using three convolutional neural networks were performed to verify the performances of multiple detection approaches. We showed that both white-box and black-box attacks can be detected, and the former are easier to detect.  ( 2 min )
    Hyperbolic Contrastive Learning for Visual Representations beyond Objects. (arXiv:2212.00653v1 [cs.CV])
    Although self-/un-supervised methods have led to rapid progress in visual representation learning, these methods generally treat objects and scenes using the same lens. In this paper, we focus on learning representations for objects and scenes that preserve the structure among them. Motivated by the observation that visually similar objects are close in the representation space, we argue that the scenes and objects should instead follow a hierarchical structure based on their compositionality. To exploit such a structure, we propose a contrastive learning framework where a Euclidean loss is used to learn object representations and a hyperbolic loss is used to encourage representations of scenes to lie close to representations of their constituent objects in a hyperbolic space. This novel hyperbolic objective encourages the scene-object hypernymy among the representations by optimizing the magnitude of their norms. We show that when pretraining on the COCO and OpenImages datasets, the hyperbolic loss improves downstream performance of several baselines across multiple datasets and tasks, including image classification, object detection, and semantic segmentation. We also show that the properties of the learned representations allow us to solve various vision tasks that involve the interaction between scenes and objects in a zero-shot fashion. Our code can be found at \url{https://github.com/shlokk/HCL/tree/main/HCL}.  ( 2 min )
    ML framework for global river flood predictions based on the Caravan dataset. (arXiv:2212.00719v1 [physics.geo-ph])
    Reliable prediction of river floods in the first 72 hours can reduce harm because emergency agencies have sufficient time to prepare and deploy for help at the scene. Such river flood prediction models already exist and perform relatively well in most high-income countries. But, due to the limited availability of data, these models are lacking in low-income countries. Here, we offer the first global river flood prediction framework based on the newly published Caravan dataset. Our framework aims to serve as a benchmark for future global river flood prediction research. To support generalizability claims we include custom data evaluation splits. Further, we propose and evaluate a novel two-path LSTM architecture (2P-LSTM) against three baseline models. Finally, we evaluate the generated models on different locations in Africa and Asia that were not part of the Caravan dataset.  ( 2 min )
    SWL-Adapt: An Unsupervised Domain Adaptation Model with Sample Weight Learning for Cross-User Wearable Human Activity Recognition. (arXiv:2212.00724v1 [eess.SP])
    In practice, Wearable Human Activity Recognition (WHAR) models usually face performance degradation on the new user due to user variance. Unsupervised domain adaptation (UDA) becomes the natural solution to cross-user WHAR under annotation scarcity. Existing UDA models usually align samples across domains without differentiation, which ignores the difference among samples. In this paper, we propose an unsupervised domain adaptation model with sample weight learning (SWL-Adapt) for cross-user WHAR. SWL-Adapt calculates sample weights according to the classification loss and domain discrimination loss of each sample with a parameterized network. We introduce the meta-optimization based update rule to learn this network end-to-end, which is guided by meta-classification loss on the selected pseudo-labeled target samples. Therefore, this network can fit a weighting function according to the cross-user WHAR task at hand, which is superior to existing sample differentiation rules fixed for special scenarios. Extensive experiments on three public WHAR datasets demonstrate that SWL-Adapt achieves the state-of-the-art performance on the cross-user WHAR task, outperforming the best baseline by an average of 3.1% and 5.3% in accuracy and macro F1 score, respectively.  ( 2 min )
    Target-centered Subject Transfer Framework for EEG Data Augmentation. (arXiv:2212.00723v1 [eess.SP])
    Data augmentation approaches are widely explored for the enhancement of decoding electroencephalogram signals. In subject-independent brain-computer interface system, domain adaption and generalization are utilized to shift source subjects' data distribution to match the target subject as an augmentation. However, previous works either introduce noises (e.g., by noise addition or generation with random noises) or modify target data, thus, cannot well depict the target data distribution and hinder further analysis. In this paper, we propose a target-centered subject transfer framework as a data augmentation approach. A subset of source data is first constructed to maximize the source-target relevance. Then, the generative model is applied to transfer the data to target domain. The proposed framework enriches the explainability of target domain by adding extra real data, instead of noises. It shows superior performance compared with other data augmentation methods. Extensive experiments are conducted to verify the effectiveness and robustness of our approach as a prosperous tool for further research.  ( 2 min )
    Prasatul Matrix: A Direct Comparison Approach for Analyzing Evolutionary Optimization Algorithms. (arXiv:2212.00671v1 [cs.NE])
    The performance of individual evolutionary optimization algorithms is mostly measured in terms of statistics such as mean, median and standard deviation etc., computed over the best solutions obtained with few trails of the algorithm. To compare the performance of two algorithms, the values of these statistics are compared instead of comparing the solutions directly. This kind of comparison lacks direct comparison of solutions obtained with different algorithms. For instance, the comparison of best solutions (or worst solution) of two algorithms simply not possible. Moreover, ranking of algorithms is mostly done in terms of solution quality only, despite the fact that the convergence of algorithm is also an important factor. In this paper, a direct comparison approach is proposed to analyze the performance of evolutionary optimization algorithms. A direct comparison matrix called \emph{Prasatul Matrix} is prepared, which accounts direct comparison outcome of best solutions obtained with two algorithms for a specific number of trials. Five different performance measures are designed based on the prasatul matrix to evaluate the performance of algorithms in terms of Optimality and Comparability of solutions. These scores are utilized to develop a score-driven approach for comparing performance of multiple algorithms as well as for ranking both in the grounds of solution quality and convergence analysis. Proposed approach is analyzed with six evolutionary optimization algorithms on 25 benchmark functions. A non-parametric statistical analysis, namely Wilcoxon paired sum-rank test is also performed to verify the outcomes of proposed direct comparison approach.  ( 2 min )
    Incremental Predictive Coding: A Parallel and Fully Automatic Learning Algorithm. (arXiv:2212.00720v1 [cs.NE])
    Neuroscience-inspired models, such as predictive coding, have the potential to play an important role in the future of machine intelligence. However, they are not yet used in industrial applications due to some limitations, such as the lack of efficiency. In this work, we address this by proposing incremental predictive coding (iPC), a variation of the original framework derived from the incremental expectation maximization algorithm, where every operation can be performed in parallel without external control. We show both theoretically and empirically that iPC is much faster than the original algorithm originally developed by Rao and Ballard, while maintaining performance comparable to backpropagation in image classification tasks. This work impacts several areas, has general applications in computational neuroscience and machine learning, and specific applications in scenarios where automatization and parallelization are important, such as distributed computing and implementations of deep learning models on analog and neuromorphic chips.  ( 2 min )
    Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis. (arXiv:2212.00678v1 [cs.CL])
    Multimodal learning pipelines have benefited from the success of pretrained language models. However, this comes at the cost of increased model parameters. In this work, we propose Adapted Multimodal BERT (AMB), a BERT-based architecture for multimodal tasks that uses a combination of adapter modules and intermediate fusion layers. The adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations. During the adaptation process the pre-trained language model parameters remain frozen, allowing for fast, parameter-efficient training. In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise. Our experiments on sentiment analysis with CMU-MOSEI show that AMB outperforms the current state-of-the-art across metrics, with 3.4% relative reduction in the resulting error and 2.1% relative improvement in 7-class classification accuracy.  ( 2 min )
    Shining light on data: Geometric data analysis through quantum dynamics. (arXiv:2212.00682v1 [quant-ph])
    Experimental sciences have come to depend heavily on our ability to organize, interpret and analyze high-dimensional datasets produced from observations of a large number of variables governed by natural processes. Natural laws, conservation principles, and dynamical structure introduce intricate inter-dependencies among these observed variables, which in turn yield geometric structure, with fewer degrees of freedom, on the dataset. We show how fine-scale features of this structure in data can be extracted from \emph{discrete} approximations to quantum mechanical processes given by data-driven graph Laplacians and localized wavepackets. This data-driven quantization procedure leads to a novel, yet natural uncertainty principle for data analysis induced by limited data. We illustrate the new approach with algorithms and several applications to real-world data, including the learning of patterns and anomalies in social distancing and mobility behavior during the COVID-19 pandemic.  ( 2 min )
    High Dimensional Binary Classification under Label Shift: Phase Transition and Regularization. (arXiv:2212.00700v1 [cs.LG])
    Label Shift has been widely believed to be harmful to the generalization performance of machine learning models. Researchers have proposed many approaches to mitigate the impact of the label shift, e.g., balancing the training data. However, these methods often consider the underparametrized regime, where the sample size is much larger than the data dimension. The research under the overparametrized regime is very limited. To bridge this gap, we propose a new asymptotic analysis of the Fisher Linear Discriminant classifier for binary classification with label shift. Specifically, we prove that there exists a phase transition phenomenon: Under certain overparametrized regime, the classifier trained using imbalanced data outperforms the counterpart with reduced balanced data. Moreover, we investigate the impact of regularization to the label shift: The aforementioned phase transition vanishes as the regularization becomes strong.  ( 2 min )
    Using Gradient to Boost the Generalization Performance of Deep Learning Models for Fluid Dynamics. (arXiv:2212.00716v1 [physics.flu-dyn])
    Nowadays, Computational Fluid Dynamics (CFD) is a fundamental tool for industrial design. However, the computational cost of doing such simulations is expensive and can be detrimental for real-world use cases where many simulations are necessary, such as the task of shape optimization. Recently, Deep Learning (DL) has achieved a significant leap in a wide spectrum of applications and became a good candidate for physical systems, opening perspectives to CFD. To circumvent the computational bottleneck of CFD, DL models have been used to learn on Euclidean data, and more recently, on non-Euclidean data such as unstuctured grids and manifolds, allowing much faster and more efficient (memory, hardware) surrogate models. Nevertheless, DL presents the intrinsic limitation of extrapolating (generalizing) out of training data distribution (design space). In this study, we present a novel work to increase the generalization capabilities of Deep Learning. To do so, we incorporate the physical gradients (derivatives of the outputs w.r.t. the inputs) to the DL models. Our strategy has shown good results towards a better generalization of DL networks and our methodological/ theoretical study is corroborated with empirical validation, including an ablation study.  ( 2 min )
    Launchpad: Learning to Schedule Using Offline and Online RL Methods. (arXiv:2212.00639v1 [cs.LG])
    Deep reinforcement learning algorithms have succeeded in several challenging domains. Classic Online RL job schedulers can learn efficient scheduling strategies but often takes thousands of timesteps to explore the environment and adapt from a randomly initialized DNN policy. Existing RL schedulers overlook the importance of learning from historical data and improving upon custom heuristic policies. Offline reinforcement learning presents the prospect of policy optimization from pre-recorded datasets without online environment interaction. Following the recent success of data-driven learning, we explore two RL methods: 1) Behaviour Cloning and 2) Offline RL, which aim to learn policies from logged data without interacting with the environment. These methods address the challenges concerning the cost of data collection and safety, particularly pertinent to real-world applications of RL. Although the data-driven RL methods generate good results, we show that the performance is highly dependent on the quality of the historical datasets. Finally, we demonstrate that by effectively incorporating prior expert demonstrations to pre-train the agent, we short-circuit the random exploration phase to learn a reasonable policy with online training. We utilize Offline RL as a \textbf{launchpad} to learn effective scheduling policies from prior experience collected using Oracle or heuristic policies. Such a framework is effective for pre-training from historical datasets and well suited to continuous improvement with online data collection.  ( 2 min )
    A Graph Neural Networks based Framework for Topology-Aware Proactive SLA Management in a Latency Critical NFV Application Use-case. (arXiv:2212.00714v1 [cs.DC])
    Recent advancements in the rollout of 5G and 6G have led to the emergence of a new range of latency-critical applications delivered via a Network Function Virtualization (NFV) enabled paradigm of flexible and softwarized communication networks. Evolving verticals like telecommunications, smart grid, virtual reality (VR), industry 4.0, automated vehicles, etc. are driven by the vision of low latency and high reliability, and there is a wide gap to efficiently bridge the Quality of Service (QoS) constraints for both the service providers and the end-user. In this work, we look to tackle the over-provisioning of latency-critical services by proposing a proactive SLA management framework leveraging Graph Neural Networks (GNN) and Deep Reinforcement Learning (DRL) to balance the trade-off between efficiency and reliability. To summarize our key contributions: 1) we compose a graph-based spatio-temporal multivariate time-series forecasting model with multiple time-step predictions in a multi-output scenario, delivering 74.62% improved performance over the established baseline state-of-art model on the use-case; and 2) we leverage realistic SLA definitions for the use-case to achieve a dynamic SLA-aware oversight for scaling policy management with DRL.  ( 2 min )
    Explainable Artificial Intelligence for Improved Modeling of Processes. (arXiv:2212.00695v1 [cs.LG])
    In modern business processes, the amount of data collected has increased substantially in recent years. Because this data can potentially yield valuable insights, automated knowledge extraction based on process mining has been proposed, among other techniques, to provide users with intuitive access to the information contained therein. At present, the majority of technologies aim to reconstruct explicit business process models. These are directly interpretable but limited concerning the integration of diverse and real-valued information sources. On the other hand, Machine Learning (ML) benefits from the vast amount of data available and can deal with high-dimensional sources, yet it has rarely been applied to being used in processes. In this contribution, we evaluate the capability of modern Transformer architectures as well as more classical ML technologies of modeling process regularities, as can be quantitatively evaluated by their prediction capability. In addition, we demonstrate the capability of attentional properties and feature relevance determination by highlighting features that are crucial to the processes' predictive abilities. We demonstrate the efficacy of our approach using five benchmark datasets and show that the ML models are capable of predicting critical outcomes and that the attention mechanisms or XAI components offer new insights into the underlying processes.  ( 2 min )
    Sub-quadratic Algorithms for Kernel Matrices via Kernel Density Estimation. (arXiv:2212.00642v1 [cs.LG])
    Kernel matrices, as well as weighted graphs represented by them, are ubiquitous objects in machine learning, statistics and other related fields. The main drawback of using kernel methods (learning and inference using kernel matrices) is efficiency -- given $n$ input points, most kernel-based algorithms need to materialize the full $n \times n$ kernel matrix before performing any subsequent computation, thus incurring $\Omega(n^2)$ runtime. Breaking this quadratic barrier for various problems has therefore, been a subject of extensive research efforts. We break the quadratic barrier and obtain $\textit{subquadratic}$ time algorithms for several fundamental linear-algebraic and graph processing primitives, including approximating the top eigenvalue and eigenvector, spectral sparsification, solving linear systems, local clustering, low-rank approximation, arboricity estimation and counting weighted triangles. We build on the recent Kernel Density Estimation framework, which (after preprocessing in time subquadratic in $n$) can return estimates of row/column sums of the kernel matrix. In particular, we develop efficient reductions from $\textit{weighted vertex}$ and $\textit{weighted edge sampling}$ on kernel graphs, $\textit{simulating random walks}$ on kernel graphs, and $\textit{importance sampling}$ on matrices to Kernel Density Estimation and show that we can generate samples from these distributions in $\textit{sublinear}$ (in the support of the distribution) time. Our reductions are the central ingredient in each of our applications and we believe they may be of independent interest. We empirically demonstrate the efficacy of our algorithms on low-rank approximation (LRA) and spectral sparsification, where we observe a $\textbf{9x}$ decrease in the number of kernel evaluations over baselines for LRA and a $\textbf{41x}$ reduction in the graph size for spectral sparsification.  ( 2 min )
    Exploiting Kernel Compression on BNNs. (arXiv:2212.00608v1 [cs.AR])
    Binary Neural Networks (BNNs) are showing tremendous success on realistic image classification tasks. Notably, their accuracy is similar to the state-of-the-art accuracy obtained by full-precision models tailored to edge devices. In this regard, BNNs are very amenable to edge devices since they employ 1-bit to store the inputs and weights, and thus, their storage requirements are low. Also, BNNs computations are mainly done using xnor and pop-counts operations which are implemented very efficiently using simple hardware structures. Nonetheless, supporting BNNs efficiently on mobile CPUs is far from trivial since their benefits are hindered by frequent memory accesses to load weights and inputs. In BNNs, a weight or an input is stored using one bit, and aiming to increase storage and computation efficiency, several of them are packed together as a sequence of bits. In this work, we observe that the number of unique sequences representing a set of weights is typically low. Also, we have seen that during the evaluation of a BNN layer, a small group of unique sequences is employed more frequently than others. Accordingly, we propose exploiting this observation by using Huffman Encoding to encode the bit sequences and then using an indirection table to decode them during the BNN evaluation. Also, we propose a clustering scheme to identify the most common sequences of bits and replace the less common ones with some similar common sequences. Hence, we decrease the storage requirements and memory accesses since common sequences are encoded with fewer bits. We extend a mobile CPU by adding a small hardware structure that can efficiently cache and decode the compressed sequence of bits. We evaluate our scheme using the ReAacNet model with the Imagenet dataset. Our experimental results show that our technique can reduce memory requirement by 1.32x and improve performance by 1.35x.  ( 2 min )
    Probably Approximate Shapley Fairness with Applications in Machine Learning. (arXiv:2212.00630v1 [cs.LG])
    The Shapley value (SV) is adopted in various scenarios in machine learning (ML), including data valuation, agent valuation, and feature attribution, as it satisfies their fairness requirements. However, as exact SVs are infeasible to compute in practice, SV estimates are approximated instead. This approximation step raises an important question: do the SV estimates preserve the fairness guarantees of exact SVs? We observe that the fairness guarantees of exact SVs are too restrictive for SV estimates. Thus, we generalise Shapley fairness to probably approximate Shapley fairness and propose fidelity score, a metric to measure the variation of SV estimates, that determines how probable the fairness guarantees hold. Our last theoretical contribution is a novel greedy active estimation (GAE) algorithm that will maximise the lowest fidelity score and achieve a better fairness guarantee than the de facto Monte-Carlo estimation. We empirically verify GAE outperforms several existing methods in guaranteeing fairness while remaining competitive in estimation accuracy in various ML scenarios using real-world datasets.  ( 2 min )
    Safe Reinforcement Learning with Probabilistic Control Barrier Functions for Ramp Merging. (arXiv:2212.00618v1 [cs.RO])
    Prior work has looked at applying reinforcement learning and imitation learning approaches to autonomous driving scenarios, but either the safety or the efficiency of the algorithm is compromised. With the use of control barrier functions embedded into the reinforcement learning policy, we arrive at safe policies to optimize the performance of the autonomous driving vehicle. However, control barrier functions need a good approximation of the model of the car. We use probabilistic control barrier functions as an estimate of the model uncertainty. The algorithm is implemented as an online version in the CARLA (Dosovitskiy et al., 2017) Simulator and as an offline version on a dataset extracted from the NGSIM Database. The proposed algorithm is not just a safe ramp merging algorithm but a safe autonomous driving algorithm applied to address ramp merging on highways.  ( 2 min )
    Finetune like you pretrain: Improved finetuning of zero-shot vision models. (arXiv:2212.00638v1 [cs.CV])
    Finetuning image-text models such as CLIP achieves state-of-the-art accuracies on a variety of benchmarks. However, recent works like WiseFT (Wortsman et al., 2021) and LP-FT (Kumar et al., 2022) have shown that even subtle differences in the finetuning process can lead to surprisingly large differences in the final performance, both for in-distribution (ID) and out-of-distribution (OOD) data. In this work, we show that a natural and simple approach of mimicking contrastive pretraining consistently outperforms alternative finetuning approaches. Specifically, we cast downstream class labels as text prompts and continue optimizing the contrastive loss between image embeddings and class-descriptive prompt embeddings (contrastive finetuning). Our method consistently outperforms baselines across 7 distribution shifts, 6 transfer learning, and 3 few-shot learning benchmarks. On WILDS-iWILDCam, our proposed approach FLYP outperforms the top of the leaderboard by $2.3\%$ ID and $2.7\%$ OOD, giving the highest reported accuracy. Averaged across 7 OOD datasets (2 WILDS and 5 ImageNet associated shifts), FLYP gives gains of $4.2\%$ OOD over standard finetuning and outperforms the current state of the art (LP-FT) by more than $1\%$ both ID and OOD. Similarly, on 3 few-shot learning benchmarks, our approach gives gains up to $4.6\%$ over standard finetuning and $4.4\%$ over the state of the art. In total, these benchmarks establish contrastive finetuning as a simple, intuitive, and state-of-the-art approach for supervised finetuning of image-text models like CLIP. Code is available at https://github.com/locuslab/FLYP.  ( 2 min )
    Online Learning-based Waveform Selection for Improved Vehicle Recognition in Automotive Radar. (arXiv:2212.00615v1 [eess.SP])
    This paper describes important considerations and challenges associated with online reinforcement-learning based waveform selection for target identification in frequency modulated continuous wave (FMCW) automotive radar systems. We present a novel learning approach based on satisficing Thompson sampling, which quickly identifies a waveform expected to yield satisfactory classification performance. We demonstrate through measurement-level simulations that effective waveform selection strategies can be quickly learned, even in cases where the radar must select from a large catalog of candidate waveforms. The radar learns to adaptively select a bandwidth for appropriate resolution and a slow-time unimodular code for interference mitigation in the scene of interest by optimizing an expected classification metric.  ( 2 min )
    Vertical Federated Learning: A Structured Literature Review. (arXiv:2212.00622v1 [cs.LG])
    Federated Learning (FL) has emerged as a promising distributed learning paradigm with an added advantage of data privacy. With the growing interest in having collaboration among data owners, FL has gained significant attention of organizations. The idea of FL is to enable collaborating participants train machine learning (ML) models on decentralized data without breaching privacy. In simpler words, federated learning is the approach of ``bringing the model to the data, instead of bringing the data to the mode''. Federated learning, when applied to data which is partitioned vertically across participants, is able to build a complete ML model by combining local models trained only using the data with distinct features at the local sites. This architecture of FL is referred to as vertical federated learning (VFL), which differs from the conventional FL on horizontally partitioned data. As VFL is different from conventional FL, it comes with its own issues and challenges. In this paper, we present a structured literature review discussing the state-of-the-art approaches in VFL. Additionally, the literature review highlights the existing solutions to challenges in VFL and provides potential research directions in this domain.  ( 2 min )
    Purifier: Defending Data Inference Attacks via Transforming Confidence Scores. (arXiv:2212.00612v1 [cs.LG])
    Neural networks are susceptible to data inference attacks such as the membership inference attack, the adversarial model inversion attack and the attribute inference attack, where the attacker could infer useful information such as the membership, the reconstruction or the sensitive attributes of a data sample from the confidence scores predicted by the target classifier. In this paper, we propose a method, namely PURIFIER, to defend against membership inference attacks. It transforms the confidence score vectors predicted by the target classifier and makes purified confidence scores indistinguishable in individual shape, statistical distribution and prediction label between members and non-members. The experimental results show that PURIFIER helps defend membership inference attacks with high effectiveness and efficiency, outperforming previous defense methods, and also incurs negligible utility loss. Besides, our further experiments show that PURIFIER is also effective in defending adversarial model inversion attacks and attribute inference attacks. For example, the inversion error is raised about 4+ times on the Facescrub530 classifier, and the attribute inference accuracy drops significantly when PURIFIER is deployed in our experiment.  ( 2 min )
    Quantum Neural Networks for a Supply Chain Logistics Application. (arXiv:2212.00576v1 [quant-ph])
    Problem instances of a size suitable for practical applications are not likely to be addressed during the noisy intermediate-scale quantum (NISQ) period with (almost) pure quantum algorithms. Hybrid classical-quantum algorithms have potential, however, to achieve good performance on much larger problem instances. We investigate one such hybrid algorithm on a problem of substantial importance: vehicle routing for supply chain logistics with multiple trucks and complex demand structure. We use reinforcement learning with neural networks with embedded quantum circuits. In such neural networks, projecting high-dimensional feature vectors down to smaller vectors is necessary to accommodate restrictions on the number of qubits of NISQ hardware. However, we use a multi-head attention mechanism where, even in classical machine learning, such projections are natural and desirable. We consider data from the truck routing logistics of a company in the automotive sector, and apply our methodology by decomposing into small teams of trucks, and we find results comparable to human truck assignment.  ( 2 min )
    Near Sample-Optimal Reduction-based Policy Learning for Average Reward MDP. (arXiv:2212.00603v1 [cs.LG])
    This work considers the sample complexity of obtaining an $\varepsilon$-optimal policy in an average reward Markov Decision Process (AMDP), given access to a generative model (simulator). When the ground-truth MDP is weakly communicating, we prove an upper bound of $\widetilde O(H \varepsilon^{-3} \ln \frac{1}{\delta})$ samples per state-action pair, where $H := sp(h^*)$ is the span of bias of any optimal policy, $\varepsilon$ is the accuracy and $\delta$ is the failure probability. This bound improves the best-known mixing-time-based approaches in [Jin & Sidford 2021], which assume the mixing-time of every deterministic policy is bounded. The core of our analysis is a proper reduction bound from AMDP problems to discounted MDP (DMDP) problems, which may be of independent interests since it allows the application of DMDP algorithms for AMDP in other settings. We complement our upper bound by proving a minimax lower bound of $\Omega(|\mathcal S| |\mathcal A| H \varepsilon^{-2} \ln \frac{1}{\delta})$ total samples, showing that a linear dependent on $H$ is necessary and that our upper bound matches the lower bound in all parameters of $(|\mathcal S|, |\mathcal A|, H, \ln \frac{1}{\delta})$ up to some logarithmic factors.  ( 2 min )
    When is Cognitive Radar Beneficial?. (arXiv:2212.00597v1 [cs.IT])
    When should an online reinforcement learning-based frequency agile cognitive radar be expected to outperform a rule-based adaptive waveform selection strategy? We seek insight regarding this question by examining a dynamic spectrum access scenario, in which the radar wishes to transmit in the widest unoccupied bandwidth during each pulse repetition interval. Online learning is compared to a fixed rule-based sense-and-avoid strategy. We show that given a simple Markov channel model, the problem can be examined analytically for simple cases via stochastic dominance. Additionally, we show that for more realistic channel assumptions, learning-based approaches demonstrate greater ability to generalize. However, for short time-horizon problems that are well-specified, we find that machine learning approaches may perform poorly due to the inherent limitation of convergence time. We draw conclusions as to when learning-based approaches are expected to be beneficial and provide guidelines for future study.  ( 2 min )
    Understanding the Energy Consumption of HPC Scale Artificial Intelligence. (arXiv:2212.00582v1 [cs.DC])
    This paper contributes towards better understanding the energy consumption trade-offs of HPC scale Artificial Intelligence (AI), and more specifically Deep Learning (DL) algorithms. For this task we developed benchmark-tracker, a benchmark tool to evaluate the speed and energy consumption of DL algorithms in HPC environments. We exploited hardware counters and Python libraries to collect energy information through software, which enabled us to instrument a known AI benchmark tool, and to evaluate the energy consumption of numerous DL algorithms and models. Through an experimental campaign, we show a case example of the potential of benchmark-tracker to measure the computing speed and the energy consumption for training and inference DL algorithms, and also the potential of Benchmark-Tracker to help better understanding the energy behavior of DL algorithms in HPC platforms. This work is a step forward to better understand the energy consumption of Deep Learning in HPC, and it also contributes with a new tool to help HPC DL developers to better balance the HPC infrastructure in terms of speed and energy consumption.  ( 2 min )
    Graph Convolutional Neural Networks as Parametric CoKleisli morphisms. (arXiv:2212.00542v1 [math.CT])
    We define the bicategory of Graph Convolutional Neural Networks $\mathbf{GCNN}_n$ for an arbitrary graph with $n$ nodes. We show it can be factored through the already existing categorical constructions for deep learning called $\mathbf{Para}$ and $\mathbf{Lens}$ with the base category set to the CoKleisli category of the product comonad. We prove that there exists an injective-on-objects, faithful 2-functor $\mathbf{GCNN}_n \to \mathbf{Para}(\mathsf{CoKl}(\mathbb{R}^{n \times n} \times -))$. We show that this construction allows us to treat the adjacency matrix of a GCNN as a global parameter instead of a a local, layer-wise one. This gives us a high-level categorical characterisation of a particular kind of inductive bias GCNNs possess. Lastly, we hypothesize about possible generalisations of GCNNs to general message-passing graph neural networks, connections to equivariant learning, and the (lack of) functoriality of activation functions.  ( 2 min )
    Penalized Langevin and Hamiltonian Monte Carlo Algorithms for Constrained Sampling. (arXiv:2212.00570v1 [stat.ML])
    We consider the constrained sampling problem where the goal is to sample from a distribution $\pi(x)\propto e^{-f(x)}$ and $x$ is constrained on a convex body $\mathcal{C}\subset \mathbb{R}^d$. Motivated by penalty methods from optimization, we propose penalized Langevin Dynamics (PLD) and penalized Hamiltonian Monte Carlo (PHMC) that convert the constrained sampling problem into an unconstrained one by introducing a penalty function for constraint violations. When $f$ is smooth and the gradient is available, we show $\tilde{\mathcal{O}}(d/\varepsilon^{10})$ iteration complexity for PLD to sample the target up to an $\varepsilon$-error where the error is measured in terms of the total variation distance and $\tilde{\mathcal{O}}(\cdot)$ hides some logarithmic factors. For PHMC, we improve this result to $\tilde{\mathcal{O}}(\sqrt{d}/\varepsilon^{7})$ when the Hessian of $f$ is Lipschitz and the boundary of $\mathcal{C}$ is sufficiently smooth. To our knowledge, these are the first convergence rate results for Hamiltonian Monte Carlo methods in the constrained sampling setting that can handle non-convex $f$ and can provide guarantees with the best dimension dependency among existing methods with deterministic gradients. We then consider the setting where unbiased stochastic gradients are available. We propose PSGLD and PSGHMC that can handle stochastic gradients without Metropolis-Hasting correction steps. When $f$ is strongly convex and smooth, we obtain an iteration complexity of $\tilde{\mathcal{O}}(d/\varepsilon^{18})$ and $\tilde{\mathcal{O}}(d\sqrt{d}/\varepsilon^{39})$ respectively in the 2-Wasserstein distance. For the more general case, when $f$ is smooth and non-convex, we also provide finite-time performance bounds and iteration complexity results. Finally, we test our algorithms on Bayesian LASSO regression and Bayesian constrained deep learning problems.  ( 2 min )
    Soft Labels for Rapid Satellite Object Detection. (arXiv:2212.00585v1 [cs.CV])
    Soft labels in image classification are vector representations of an image's true classification. In this paper, we investigate soft labels in the context of satellite object detection. We propose using detections as the basis for a new dataset of soft labels. Much of the effort in creating a high-quality model is gathering and annotating the training data. If we could use a model to generate a dataset for us, we could not only rapidly create datasets, but also supplement existing open-source datasets. Using a subset of the xView dataset, we train a YOLOv5 model to detect cars, planes, and ships. We then use that model to generate soft labels for the second training set which we then train and compare to the original model. We show that soft labels can be used to train a model that is almost as accurate as a model trained on the original data.  ( 2 min )
    Graph Anomaly Detection via Multi-Scale Contrastive Learning Networks with Augmented View. (arXiv:2212.00535v1 [cs.LG])
    Graph anomaly detection (GAD) is a vital task in graph-based machine learning and has been widely applied in many real-world applications. The primary goal of GAD is to capture anomalous nodes from graph datasets, which evidently deviate from the majority of nodes. Recent methods have paid attention to various scales of contrastive strategies for GAD, i.e., node-subgraph and node-node contrasts. However, they neglect the subgraph-subgraph comparison information which the normal and abnormal subgraph pairs behave differently in terms of embeddings and structures in GAD, resulting in sub-optimal task performance. In this paper, we fulfill the above idea in the proposed multi-view multi-scale contrastive learning framework with subgraph-subgraph contrast for the first practice. To be specific, we regard the original input graph as the first view and generate the second view by graph augmentation with edge modifications. With the guidance of maximizing the similarity of the subgraph pairs, the proposed subgraph-subgraph contrast contributes to more robust subgraph embeddings despite of the structure variation. Moreover, the introduced subgraph-subgraph contrast cooperates well with the widely-adopted node-subgraph and node-node contrastive counterparts for mutual GAD performance promotions. Besides, we also conduct sufficient experiments to investigate the impact of different graph augmentation approaches on detection performance. The comprehensive experimental results well demonstrate the superiority of our method compared with the state-of-the-art approaches and the effectiveness of the multi-view subgraph pair contrastive strategy for the GAD task.  ( 2 min )
    Deep Kernel Learning for Mortality Prediction in the Face of Temporal Shift. (arXiv:2212.00557v1 [cs.LG])
    Neural models, with their ability to provide novel representations, have shown promising results in prediction tasks in healthcare. However, patient demographics, medical technology, and quality of care change over time. This often leads to drop in the performance of neural models for prospective patients, especially in terms of their calibration. The deep kernel learning (DKL) framework may be robust to such changes as it combines neural models with Gaussian processes, which are aware of prediction uncertainty. Our hypothesis is that out-of-distribution test points will result in probabilities closer to the global mean and hence prevent overconfident predictions. This in turn, we hypothesise, will result in better calibration on prospective data. This paper investigates DKL's behaviour when facing a temporal shift, which was naturally introduced when an information system that feeds a cohort database was changed. We compare DKL's performance to that of a neural baseline based on recurrent neural networks. We show that DKL indeed produced superior calibrated predictions. We also confirm that the DKL's predictions were indeed less sharp. In addition, DKL's discrimination ability was even improved: its AUC was 0.746 (+- 0.014 std), compared to 0.739 (+- 0.028 std) for the baseline. The paper demonstrated the importance of including uncertainty in neural computing, especially for their prospective use.  ( 2 min )
    Privacy-Preserving Data Synthetisation for Secure Information Sharing. (arXiv:2212.00484v1 [cs.LG])
    We can protect user data privacy via many approaches, such as statistical transformation or generative models. However, each of them has critical drawbacks. On the one hand, creating a transformed data set using conventional techniques is highly time-consuming. On the other hand, in addition to long training phases, recent deep learning-based solutions require significant computational resources. In this paper, we propose PrivateSMOTE, a technique designed for competitive effectiveness in protecting cases at maximum risk of re-identification while requiring much less time and computational resources. It works by synthetic data generation via interpolation to obfuscate high-risk cases while minimizing data utility loss of the original data. Compared to multiple conventional and state-of-the-art privacy-preservation methods on 20 data sets, PrivateSMOTE demonstrates competitive results in re-identification risk. Also, it presents similar or higher predictive performance than the baselines, including generative adversarial networks and variational autoencoders, reducing their energy consumption and time requirements by a minimum factor of 9 and 12, respectively.  ( 2 min )
    Cellular Automata Model for Non-Structural Proteins Comparing Transmissibility and Pathogenesis of SARS Covid (CoV-2, CoV) and MERS Covid. (arXiv:2212.00502v1 [q-bio.OT])
    Significantly higher transmissibility of SARS CoV-2 (2019) compared to SARS CoV (2003) can be attributed to mutations of structural proteins (Spike S, Nucleocapsid N, Membrane M, and Envelope E) and the role played by non-structural proteins (nsps) and accessory proteins (ORFs) for viral replication, assembly and shedding. The non-structural proteins (nsps) avail host protein synthesis machinery to initiate viral replication, along with neutralization of host immune defense. The key protein out of the 16 nsps, is the non-structural protein nsp1, also known as the leader protein. Nsp1 leads the process of hijacking host resources by blocking host translation. This paper concentrates on the analysis of nsps of SARS covid (CoV-2, CoV) and MERS covid based on Cellular Automata enhanced Machine Learning (CAML) model developed for study of biological strings. This computational model compares deviation of structure - function of CoV-2 from that of CoV employing CAML model parameters derived out of CA evolution of amino acid chains of nsps. This comparative analysis points to - (i) higher transmissibility of CoV-2 compared to CoV for major nsps, and (ii) deviation of MERS covid from SARS CoV in respect of virulence and pathogenesis. A Machine Learning (ML) framework has been designed to map the CAML model parameters to the physical domain features reported in in-vitro/in-vivo/in-silico experimental studies. The ML framework enables us to learn the permissible range of model parameters derived out of mutational study of sixteen nsps of three viruses.  ( 2 min )
    Early prediction of the risk of ICU mortality with Deep Federated Learning. (arXiv:2212.00554v1 [cs.LG])
    Intensive Care Units usually carry patients with a serious risk of mortality. Recent research has shown the ability of Machine Learning to indicate the patients' mortality risk and point physicians toward individuals with a heightened need for care. Nevertheless, healthcare data is often subject to privacy regulations and can therefore not be easily shared in order to build Centralized Machine Learning models that use the combined data of multiple hospitals. Federated Learning is a Machine Learning framework designed for data privacy that can be used to circumvent this problem. In this study, we evaluate the ability of deep Federated Learning to predict the risk of Intensive Care Unit mortality at an early stage. We compare the predictive performance of Federated, Centralized, and Local Machine Learning in terms of AUPRC, F1-score, and AUROC. Our results show that Federated Learning performs equally well as the centralized approach and is substantially better than the local approach, thus providing a viable solution for early Intensive Care Unit mortality prediction. In addition, we show that the prediction performance is higher when the patient history window is closer to discharge or death. Finally, we show that using the F1-score as an early stopping metric can stabilize and increase the performance of our approach for the task at hand.  ( 2 min )
    Enabling Fast Unit Commitment Constraint Screening via Learning Cost Model. (arXiv:2212.00483v1 [math.OC])
    Unit commitment (UC) are essential tools to transmission system operators for finding the most economical and feasible generation schedules and dispatch signals. Constraint screening has been receiving attention as it holds the promise for reducing a number of inactive or redundant constraints in the UC problem, so that the solution process of large scale UC problem can be accelerated by considering the reduced optimization problem. Standard constraint screening approach relies on optimizing over load and generations to find binding line flow constraints, yet the screening is conservative with a large percentage of constraints still reserved for the UC problem. In this paper, we propose a novel machine learning (ML) model to predict the most economical costs given load inputs. Such ML model bridges the cost perspectives of UC decisions to the optimization-based constraint screening model, and can screen out higher proportion of operational constraints. We verify the proposed method's performance on both sample-aware and sample-agnostic setting, and illustrate the proposed scheme can further reduce the computation time on a variety of setup for UC problems.  ( 2 min )
    MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech Recognition. (arXiv:2212.00500v1 [cs.MM])
    In this paper, we propose a novel multi-modal multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR), which employs both unlabeled speech and text data. The main difficulty in speech-text joint pre-training comes from the significant difference between speech and text modalities, especially for Mandarin speech and text. Unlike English and other languages with an alphabetic writing system, Mandarin uses an ideographic writing system where character and sound are not tightly mapped to one another. Therefore, we propose to introduce the phoneme modality into pre-training, which can help capture modality-invariant information between Mandarin speech and text. Specifically, we employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data. For end-to-end pre-training, we introduce self-supervised speech-to-pseudo-codes (S2C) and phoneme-to-text (P2T) tasks utilizing unlabeled speech and text data, where speech-pseudo-codes pairs and phoneme-text pairs are a supplement to the supervised speech-text pairs. To train the encoder to learn better speech representation, we introduce self-supervised masked speech prediction (MSP) and supervised phoneme prediction (PP) tasks to learn to map speech into phonemes. Besides, we directly add the downstream supervised speech-to-text (S2T) task into the pre-training process, which can further improve the pre-training performance and achieve better recognition results even without fine-tuning. Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.  ( 2 min )
    Implicit Mixture of Interpretable Experts for Global and Local Interpretability. (arXiv:2212.00471v1 [cs.LG])
    We investigate the feasibility of using mixtures of interpretable experts (MoIE) to build interpretable image classifiers on MNIST10. MoIE uses a black-box router to assign each input to one of many inherently interpretable experts, thereby providing insight into why a particular classification decision was made. We find that a naively trained MoIE will learn to 'cheat', whereby the black-box router will solve the classification problem by itself, with each expert simply learning a constant function for one particular class. We propose to solve this problem by introducing interpretable routers and training the black-box router's decisions to match the interpretable router. In addition, we propose a novel implicit parameterization scheme that allows us to build mixtures of arbitrary numbers of experts, allowing us to study how classification performance, local and global interpretability vary as the number of experts is increased. Our new model, dubbed Implicit Mixture of Interpretable Experts (IMoIE) can match state-of-the-art classification accuracy on MNIST10 while providing local interpretability, and can provide global interpretability albeit at the cost of reduced classification accuracy.  ( 2 min )
    Regularization with Fake Features. (arXiv:2212.00433v1 [cs.LG])
    Recent successes of massively overparameterized models have inspired a new line of work investigating the underlying conditions that enable overparameterized models to generalize well. This paper considers a framework where the possibly overparametrized model includes fake features, i.e., features that are present in the model but not in the data. We present a non-asymptotic high-probability bound on the generalization error of the ridge regression problem under the model misspecification of having fake features. Our high-probability results characterize the interplay between the implicit regularization provided by the fake features and the explicit regularization provided by the ridge parameter. We observe that fake features may improve the generalization error, even though they are irrelevant to the data.  ( 2 min )
    Proceedings of the 2nd International Workshop on Reading Music Systems. (arXiv:2212.00380v1 [cs.CV])
    The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 2nd International Workshop on Reading Music Systems, held in Delft on the 2nd of November 2019.  ( 2 min )
    A Comprehensive Study on Machine Learning Methods to Increase the Prediction Accuracy of Classifiers and Reduce the Number of Medical Tests Required to Diagnose Alzheimer'S Disease. (arXiv:2212.00414v1 [cs.LG])
    Alzheimer's patients gradually lose their ability to think, behave, and interact with others. Medical history, laboratory tests, daily activities, and personality changes can all be used to diagnose the disorder. A series of time-consuming and expensive tests are used to diagnose the illness. The most effective way to identify Alzheimer's disease is using a Random-forest classifier in this study, along with various other Machine Learning techniques. The main goal of this study is to fine-tune the classifier to detect illness with fewer tests while maintaining a reasonable disease discovery accuracy. We successfully identified the condition in almost 94% of cases using four of the thirty frequently utilized indicators.  ( 2 min )
    GrannGAN: Graph annotation generative adversarial networks. (arXiv:2212.00449v1 [cs.LG])
    We consider the problem of modelling high-dimensional distributions and generating new examples of data with complex relational feature structure coherent with a graph skeleton. The model we propose tackles the problem of generating the data features constrained by the specific graph structure of each data point by splitting the task into two phases. In the first it models the distribution of features associated with the nodes of the given graph, in the second it complements the edge features conditionally on the node features. We follow the strategy of implicit distribution modelling via generative adversarial network (GAN) combined with permutation equivariant message passing architecture operating over the sets of nodes and edges. This enables generating the feature vectors of all the graph objects in one go (in 2 phases) as opposed to a much slower one-by-one generations of sequential models, prevents the need for expensive graph matching procedures usually needed for likelihood-based generative models, and uses efficiently the network capacity by being insensitive to the particular node ordering in the graph representation. To the best of our knowledge, this is the first method that models the feature distribution along the graph skeleton allowing for generations of annotated graphs with user specified structures. Our experiments demonstrate the ability of our model to learn complex structured distributions through quantitative evaluation over three annotated graph datasets.  ( 2 min )
    Multi-Source Survival Domain Adaptation. (arXiv:2212.00424v1 [cs.LG])
    Survival analysis is the branch of statistics that studies the relation between the characteristics of living entities and their respective survival times, taking into account the partial information held by censored cases. A good analysis can, for example, determine whether one medical treatment for a group of patients is better than another. With the rise of machine learning, survival analysis can be modeled as learning a function that maps studied patients to their survival times. To succeed with that, there are three crucial issues to be tackled. First, some patient data is censored: we do not know the true survival times for all patients. Second, data is scarce, which led past research to treat different illness types as domains in a multi-task setup. Third, there is the need for adaptation to new or extremely rare illness types, where little or no labels are available. In contrast to previous multi-task setups, we want to investigate how to efficiently adapt to a new survival target domain from multiple survival source domains. For this, we introduce a new survival metric and the corresponding discrepancy measure between survival distributions. These allow us to define domain adaptation for survival analysis while incorporating censored data, which would otherwise have to be dropped. Our experiments on two cancer data sets reveal a superb performance on target domains, a better treatment recommendation, and a weight matrix with a plausible explanation.  ( 2 min )
    Rethinking Two Consensuses of the Transferability in Deep Learning. (arXiv:2212.00399v1 [cs.CV])
    Deep transfer learning (DTL) has formed a long-term quest toward enabling deep neural networks (DNNs) to reuse historical experiences as efficiently as humans. This ability is named knowledge transferability. A commonly used paradigm for DTL is firstly learning general knowledge (pre-training) and then reusing (fine-tuning) them for a specific target task. There are two consensuses of transferability of pre-trained DNNs: (1) a larger domain gap between pre-training and downstream data brings lower transferability; (2) the transferability gradually decreases from lower layers (near input) to higher layers (near output). However, these consensuses were basically drawn from the experiments based on natural images, which limits their scope of application. This work aims to study and complement them from a broader perspective by proposing a method to measure the transferability of pre-trained DNN parameters. Our experiments on twelve diverse image classification datasets get similar conclusions to the previous consensuses. More importantly, two new findings are presented, i.e., (1) in addition to the domain gap, a larger data amount and huge dataset diversity of downstream target task also prohibit the transferability; (2) although the lower layers learn basic image features, they are usually not the most transferable layers due to their domain sensitivity.  ( 2 min )
    AC-Band: A Combinatorial Bandit-Based Approach to Algorithm Configuration. (arXiv:2212.00333v1 [cs.LG])
    We study the algorithm configuration (AC) problem, in which one seeks to find an optimal parameter configuration of a given target algorithm in an automated way. Recently, there has been significant progress in designing AC approaches that satisfy strong theoretical guarantees. However, a significant gap still remains between the practical performance of these approaches and state-of-the-art heuristic methods. To this end, we introduce AC-Band, a general approach for the AC problem based on multi-armed bandits that provides theoretical guarantees while exhibiting strong practical performance. We show that AC-Band requires significantly less computation time than other AC approaches providing theoretical guarantees while still yielding high-quality configurations.  ( 2 min )
    Why Are Conditional Generative Models Better Than Unconditional Ones?. (arXiv:2212.00362v1 [cs.LG])
    Extensive empirical evidence demonstrates that conditional generative models are easier to train and perform better than unconditional ones by exploiting the labels of data. So do score-based diffusion models. In this paper, we analyze the phenomenon formally and identify that the key of conditional learning is to partition the data properly. Inspired by the analyses, we propose self-conditioned diffusion models (SCDM), which is trained conditioned on indices clustered by the k-means algorithm on the features extracted by a model pre-trained in a self-supervised manner. SCDM significantly improves the unconditional model across various datasets and achieves a record-breaking FID of 3.94 on ImageNet 64x64 without labels. Besides, SCDM achieves a slightly better FID than the corresponding conditional model on CIFAR10.  ( 2 min )
    All You Need Is Hashing: Defending Against Data Reconstruction Attack in Vertical Federated Learning. (arXiv:2212.00325v1 [cs.CR])
    Vertical federated learning is a trending solution for multi-party collaboration in training machine learning models. Industrial frameworks adopt secure multi-party computation methods such as homomorphic encryption to guarantee data security and privacy. However, a line of work has revealed that there are still leakage risks in VFL. The leakage is caused by the correlation between the intermediate representations and the raw data. Due to the powerful approximation ability of deep neural networks, an adversary can capture the correlation precisely and reconstruct the data. To deal with the threat of the data reconstruction attack, we propose a hashing-based VFL framework, called \textit{HashVFL}, to cut off the reversibility directly. The one-way nature of hashing allows our framework to block all attempts to recover data from hash codes. However, integrating hashing also brings some challenges, e.g., the loss of information. This paper proposes and addresses three challenges to integrating hashing: learnability, bit balance, and consistency. Experimental results demonstrate \textit{HashVFL}'s efficiency in keeping the main task's performance and defending against data reconstruction attacks. Furthermore, we also analyze its potential value in detecting abnormal inputs. In addition, we conduct extensive experiments to prove \textit{HashVFL}'s generalization in various settings. In summary, \textit{HashVFL} provides a new perspective on protecting multi-party's data security and privacy in VFL. We hope our study can attract more researchers to expand the application domains of \textit{HashVFL}.  ( 2 min )
    Deep neural network techniques for monaural speech enhancement: state of the art analysis. (arXiv:2212.00369v1 [cs.SD])
    Deep neural networks (DNN) techniques have become pervasive in domains such as natural language processing and computer vision. They have achieved great success in these domains in task such as machine translation and image generation. Due to their success, these data driven techniques have been applied in audio domain. More specifically, DNN models have been applied in speech enhancement domain to achieve denosing, dereverberation and multi-speaker separation in monaural speech enhancement. In this paper, we review some dominant DNN techniques being employed to achieve speech separation. The review looks at the whole pipeline of speech enhancement from feature extraction, how DNN based tools are modelling both global and local features of speech and model training (supervised and unsupervised). We also review the use of speech-enhancement pre-trained models to boost speech enhancement process. The review is geared towards covering the dominant trends with regards to DNN application in speech enhancement in speech obtained via a single speaker.  ( 2 min )
    Proceedings of the 3rd International Workshop on Reading Music Systems. (arXiv:2212.00378v1 [cs.CV])
    The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 3rd International Workshop on Reading Music Systems, held in Alicante on the 23rd of July 2021.  ( 2 min )
    Differentially Private Learning with Per-Sample Adaptive Clipping. (arXiv:2212.00328v1 [cs.LG])
    Privacy in AI remains a topic that draws attention from researchers and the general public in recent years. As one way to implement privacy-preserving AI, differentially private learning is a framework that enables AI models to use differential privacy (DP). To achieve DP in the learning process, existing algorithms typically limit the magnitude of gradients with a constant clipping, which requires carefully tuned due to its significant impact on model performance. As a solution to this issue, latest works NSGD and Auto-S innovatively propose to use normalization instead of clipping to avoid hyperparameter tuning. However, normalization-based approaches like NSGD and Auto-S rely on a monotonic weight function, which imposes excessive weight on small gradient samples and introduces extra deviation to the update. In this paper, we propose a Differentially Private Per-Sample Adaptive Clipping (DP-PSAC) algorithm based on a non-monotonic adaptive weight function, which guarantees privacy without the typical hyperparameter tuning process of using a constant clipping while significantly reducing the deviation between the update and true batch-averaged gradient. We provide a rigorous theoretical convergence analysis and show that with convergence rate at the same order, the proposed algorithm achieves a lower non-vanishing bound, which is maintained over training iterations, compared with NSGD/Auto-S. In addition, through extensive experimental evaluation, we show that DP-PSAC outperforms or matches the state-of-the-art methods on multiple main-stream vision and language tasks.  ( 2 min )
    Hijack Vertical Federated Learning Models with Adversarial Embedding. (arXiv:2212.00322v1 [cs.LG])
    Vertical federated learning (VFL) is an emerging paradigm that enables collaborators to build machine learning models together in a distributed fashion. In general, these parties have a group of users in common but own different features. Existing VFL frameworks use cryptographic techniques to provide data privacy and security guarantees, leading to a line of works studying computing efficiency and fast implementation. However, the security of VFL's model remains underexplored.  ( 2 min )
    Differentially Private Adaptive Optimization with Delayed Preconditioners. (arXiv:2212.00309v1 [cs.LG])
    Privacy noise may negate the benefits of using adaptive optimizers in differentially private model training. Prior works typically address this issue by using auxiliary information (e.g., public data) to boost the effectiveness of adaptive optimization. In this work, we explore techniques to estimate and efficiently adapt to gradient geometry in private adaptive optimization without auxiliary data. Motivated by the observation that adaptive methods can tolerate stale preconditioners, we propose differentially private adaptive training with delayed preconditioners (DP^2), a simple method that constructs delayed but less noisy preconditioners to better realize the benefits of adaptivity. Theoretically, we provide convergence guarantees for our method for both convex and non-convex problems, and analyze trade-offs between delay and privacy noise reduction. Empirically, we explore DP^2 across several real-world datasets, demonstrating that it can improve convergence speed by as much as 4x relative to non-adaptive baselines and match the performance of state-of-the-art optimization methods that require auxiliary data.  ( 2 min )
    Decentralized Matrix Factorization with Heterogeneous Differential Privacy. (arXiv:2212.00306v1 [cs.LG])
    Conventional matrix factorization relies on centralized collection of users' data for recommendation, which might introduce an increased risk of privacy leakage especially when the recommender is untrusted. Existing differentially private matrix factorization methods either assume the recommender is trusted, or can only provide a uniform level of privacy protection for all users and items with untrusted recommender. In this paper, we propose a novel Heterogeneous Differentially Private Matrix Factorization algorithm (denoted as HDPMF) for untrusted recommender. To the best of our knowledge, we are the first to achieve heterogeneous differential privacy for decentralized matrix factorization in untrusted recommender scenario. Specifically, our framework uses modified stretching mechanism with an innovative rescaling scheme to achieve better trade off between privacy and accuracy. Meanwhile, by allocating privacy budget properly, we can capture homogeneous privacy preference within a user/item but heterogeneous privacy preference across different users/items. Theoretical analysis confirms that HDPMF renders rigorous privacy guarantee, and exhaustive experiments demonstrate its superiority especially in strong privacy guarantee, high dimension model and sparse dataset scenario.  ( 2 min )
    Generalizing and Improving Jacobian and Hessian Regularization. (arXiv:2212.00311v1 [cs.LG])
    Jacobian and Hessian regularization aim to reduce the magnitude of the first and second-order partial derivatives with respect to neural network inputs, and they are predominantly used to ensure the adversarial robustness of image classifiers. In this work, we generalize previous efforts by extending the target matrix from zero to any matrix that admits efficient matrix-vector products. The proposed paradigm allows us to construct novel regularization terms that enforce symmetry or diagonality on square Jacobian and Hessian matrices. On the other hand, the major challenge for Jacobian and Hessian regularization has been high computational complexity. We introduce Lanczos-based spectral norm minimization to tackle this difficulty. This technique uses a parallelized implementation of the Lanczos algorithm and is capable of effective and stable regularization of large Jacobian and Hessian matrices. Theoretical justifications and empirical evidence are provided for the proposed paradigm and technique. We carry out exploratory experiments to validate the effectiveness of our novel regularization terms. We also conduct comparative experiments to evaluate Lanczos-based spectral norm minimization against prior methods. Results show that the proposed methodologies are advantageous for a wide range of tasks.  ( 2 min )
    On the Compatibility between a Neural Network and a Partial Differential Equation for Physics-informed Learning. (arXiv:2212.00270v1 [physics.comp-ph])
    We shed light on a pitfall and an opportunity in physics-informed neural networks (PINNs). We prove that a multilayer perceptron (MLP) only with ReLU (Rectified Linear Unit) or ReLU-like Lipschitz activation functions will always lead to a vanished Hessian. Such a network-imposed constraint contradicts any second- or higher-order partial differential equations (PDEs). Therefore, a ReLU-based MLP cannot form a permissible function space for the approximation of their solutions. Inspired by this pitfall, we prove that a linear PDE up to the $n$-th order can be strictly satisfied by an MLP with $C^n$ activation functions when the weights of its output layer lie on a certain hyperplane, as called the out-layer-hyperplane. An MLP equipped with the out-layer-hyperplane becomes "physics-enforced", no longer requiring a loss function for the PDE itself (but only those for the initial and boundary conditions). Such a hyperplane exists not only for MLPs but for any network architecture tailed by a fully-connected hidden layer. To our knowledge, this should be the first PINN architecture that enforces point-wise correctness of a PDE. We give the closed-form expression of the out-layer-hyperplane for second-order linear PDEs and provide an implementation.  ( 2 min )
    The Effect of Data Dimensionality on Neural Network Prunability. (arXiv:2212.00291v1 [cs.LG])
    Practitioners prune neural networks for efficiency gains and generalization improvements, but few scrutinize the factors determining the prunability of a neural network the maximum fraction of weights that pruning can remove without compromising the model's test accuracy. In this work, we study the properties of input data that may contribute to the prunability of a neural network. For high dimensional input data such as images, text, and audio, the manifold hypothesis suggests that these high dimensional inputs approximately lie on or near a significantly lower dimensional manifold. Prior work demonstrates that the underlying low dimensional structure of the input data may affect the sample efficiency of learning. In this paper, we investigate whether the low dimensional structure of the input data affects the prunability of a neural network.  ( 2 min )
    ResNet Structure Simplification with the Convolutional Kernel Redundancy Measure. (arXiv:2212.00272v1 [cs.CV])
    Deep learning, especially convolutional neural networks, has triggered accelerated advancements in computer vision, bringing changes into our daily practice. Furthermore, the standardized deep learning modules (also known as backbone networks), i.e., ResNet and EfficientNet, have enabled efficient and rapid development of new computer vision solutions. Yet, deep learning methods still suffer from several drawbacks. One of the most concerning problems is the high memory and computational cost, such that dedicated computing units, typically GPUs, have to be used for training and development. Therefore, in this paper, we propose a quantifiable evaluation method, the convolutional kernel redundancy measure, which is based on perceived image differences, for guiding the network structure simplification. When applying our method to the chest X-ray image classification problem with ResNet, our method can maintain the performance of the network and reduce the number of parameters from over $23$ million to approximately $128$ thousand (reducing $99.46\%$ of the parameters).  ( 2 min )
    Learning Combinatorial Structures via Markov Random Fields with Sampling through Lov\'asz Local Lemma. (arXiv:2212.00296v1 [cs.LG])
    Generative models for learning combinatorial structures have transformative impacts in many applications. However, existing approaches fail to offer efficient and accurate learning results. Because of the highly intractable nature of the gradient estimation of the learning objective subject to combinatorial constraints. Existing gradient estimation methods would easily run into exponential time/memory space, or incur huge estimation errors due to improper approximation. We develop NEural Lovasz Sampler (Nelson), a neural network based on Lov\'asz Local Lemma (LLL). We show it guarantees to generate samples satisfying combinatorial constraints from the distribution of the constrained Markov Random Fields model (MRF) under certain conditions. We further present a fully differentiable contrastive-divergence-based learning framework on constrained MRF (Nelson-CD). Meanwhile, Nelson-CD being fully differentiable allows us to take advantage of the parallel computing power of GPUs, resulting in great efficiency. Experimental results on three real-world combinatorial problems reveal that Nelson learns to generate 100% valid structures. In comparison, baselines either time out on large-size data sets or fail to generate valid structures, whereas Nelson scales much better with problem size. In addition, Nelson outperforms baselines in various learning metrics, such as log-likelihood and MAP scores.  ( 2 min )
    Locally Adaptive Hierarchical Cluster Termination With Application To Individual Tree Delineation. (arXiv:2212.00288v1 [stat.ML])
    A clustering termination procedure which is locally adaptive (with respect to the hierarchical tree of sets representative of the agglomerative merging) is proposed, for agglomerative hierarchical clustering on a set equipped with a distance function. It represents a multi-scale alternative to conventional scale dependent threshold based termination criteria.  ( 2 min )
    Component Segmentation of Engineering Drawings Using Graph Convolutional Networks. (arXiv:2212.00290v1 [cs.CV])
    We present a data-driven framework to automate the vectorization and machine interpretation of 2D engineering part drawings. In industrial settings, most manufacturing engineers still rely on manual reads to identify the topological and manufacturing requirements from drawings submitted by designers. The interpretation process is laborious and time-consuming, which severely inhibits the efficiency of part quotation and manufacturing tasks. While recent advances in image-based computer vision methods have demonstrated great potential in interpreting natural images through semantic segmentation approaches, the application of such methods in parsing engineering technical drawings into semantically accurate components remains a significant challenge. The severe pixel sparsity in engineering drawings also restricts the effective featurization of image-based data-driven methods. To overcome these challenges, we propose a deep learning based framework that predicts the semantic type of each vectorized component. Taking a raster image as input, we vectorize all components through thinning, stroke tracing, and cubic bezier fitting. Then a graph of such components is generated based on the connectivity between the components. Finally, a graph convolutional neural network is trained on this graph data to identify the semantic type of each component. We test our framework in the context of semantic segmentation of text, dimension and, contour components in engineering drawings. Results show that our method yields the best performance compared to recent image, and graph-based segmentation methods.  ( 2 min )
    Low-Rank Tensor Function Representation for Multi-Dimensional Data Recovery. (arXiv:2212.00262v1 [cs.CV])
    Since higher-order tensors are naturally suitable for representing multi-dimensional data in real-world, e.g., color images and videos, low-rank tensor representation has become one of the emerging areas in machine learning and computer vision. However, classical low-rank tensor representations can only represent data on finite meshgrid due to their intrinsical discrete nature, which hinders their potential applicability in many scenarios beyond meshgrid. To break this barrier, we propose a low-rank tensor function representation (LRTFR), which can continuously represent data beyond meshgrid with infinite resolution. Specifically, the suggested tensor function, which maps an arbitrary coordinate to the corresponding value, can continuously represent data in an infinite real space. Parallel to discrete tensors, we develop two fundamental concepts for tensor functions, i.e., the tensor function rank and low-rank tensor function factorization. We theoretically justify that both low-rank and smooth regularizations are harmoniously unified in the LRTFR, which leads to high effectiveness and efficiency for data continuous representation. Extensive multi-dimensional data recovery applications arising from image processing (image inpainting and denoising), machine learning (hyperparameter optimization), and computer graphics (point cloud upsampling) substantiate the superiority and versatility of our method as compared with state-of-the-art methods. Especially, the experiments beyond the original meshgrid resolution (hyperparameter optimization) or even beyond meshgrid (point cloud upsampling) validate the favorable performances of our method for continuous representation.  ( 2 min )
    PIZZA: A new benchmark for complex end-to-end task-oriented parsing. (arXiv:2212.00265v1 [cs.CL])
    Much recent work in task-oriented parsing has focused on finding a middle ground between flat slots and intents, which are inexpressive but easy to annotate, and powerful representations such as the lambda calculus, which are expressive but costly to annotate. This paper continues the exploration of task-oriented parsing by introducing a new dataset for parsing pizza and drink orders, whose semantics cannot be captured by flat slots and intents. We perform an extensive evaluation of deep-learning techniques for task-oriented parsing on this dataset, including different flavors of seq2seq systems and RNNGs. The dataset comes in two main versions, one in a recently introduced utterance-level hierarchical notation that we call TOP, and one whose targets are executable representations (EXR). We demonstrate empirically that training the parser to directly generate EXR notation not only solves the problem of entity resolution in one fell swoop and overcomes a number of expressive limitations of TOP notation, but also results in significantly greater parsing accuracy.  ( 2 min )
    Task Discovery: Finding the Tasks that Neural Networks Generalize on. (arXiv:2212.00261v1 [cs.LG])
    When developing deep learning models, we usually decide what task we want to solve then search for a model that generalizes well on the task. An intriguing question would be: what if, instead of fixing the task and searching in the model space, we fix the model and search in the task space? Can we find tasks that the model generalizes on? How do they look, or do they indicate anything? These are the questions we address in this paper. We propose a task discovery framework that automatically finds examples of such tasks via optimizing a generalization-based quantity called agreement score. We demonstrate that one set of images can give rise to many tasks on which neural networks generalize well. These tasks are a reflection of the inductive biases of the learning framework and the statistical patterns present in the data, thus they can make a useful tool for analysing the neural networks and their biases. As an example, we show that the discovered tasks can be used to automatically create adversarial train-test splits which make a model fail at test time, without changing the pixels or labels, but by only selecting how the datapoints should be split between the train and test sets. We end with a discussion on human-interpretability of the discovered tasks.  ( 2 min )
    Distributed Deep Reinforcement Learning: A Survey and A Multi-Player Multi-Agent Learning Toolbox. (arXiv:2212.00253v1 [cs.LG])
    With the breakthrough of AlphaGo, deep reinforcement learning becomes a recognized technique for solving sequential decision-making problems. Despite its reputation, data inefficiency caused by its trial and error learning mechanism makes deep reinforcement learning hard to be practical in a wide range of areas. Plenty of methods have been developed for sample efficient deep reinforcement learning, such as environment modeling, experience transfer, and distributed modifications, amongst which, distributed deep reinforcement learning has shown its potential in various applications, such as human-computer gaming, and intelligent transportation. In this paper, we conclude the state of this exciting field, by comparing the classical distributed deep reinforcement learning methods, and studying important components to achieve efficient distributed learning, covering single player single agent distributed deep reinforcement learning to the most complex multiple players multiple agents distributed deep reinforcement learning. Furthermore, we review recently released toolboxes that help to realize distributed deep reinforcement learning without many modifications of their non-distributed versions. By analyzing their strengths and weaknesses, a multi-player multi-agent distributed deep reinforcement learning toolbox is developed and released, which is further validated on Wargame, a complex environment, showing usability of the proposed toolbox for multiple players and multiple agents distributed deep reinforcement learning under complex games. Finally, we try to point out challenges and future trends, hoping this brief review can provide a guide or a spark for researchers who are interested in distributed deep reinforcement learning.  ( 2 min )
    Experimental Observations of the Topology of Convolutional Neural Network Activations. (arXiv:2212.00222v1 [cs.LG])
    Topological data analysis (TDA) is a branch of computational mathematics, bridging algebraic topology and data science, that provides compact, noise-robust representations of complex structures. Deep neural networks (DNNs) learn millions of parameters associated with a series of transformations defined by the model architecture, resulting in high-dimensional, difficult-to-interpret internal representations of input data. As DNNs become more ubiquitous across multiple sectors of our society, there is increasing recognition that mathematical methods are needed to aid analysts, researchers, and practitioners in understanding and interpreting how these models' internal representations relate to the final classification. In this paper, we apply cutting edge techniques from TDA with the goal of gaining insight into the interpretability of convolutional neural networks used for image classification. We use two common TDA approaches to explore several methods for modeling hidden-layer activations as high-dimensional point clouds, and provide experimental evidence that these point clouds capture valuable structural information about the model's process. First, we demonstrate that a distance metric based on persistent homology can be used to quantify meaningful differences between layers, and we discuss these distances in the broader context of existing representational similarity metrics for neural network interpretability. Second, we show that a mapper graph can provide semantic insight into how these models organize hierarchical class knowledge at each layer. These observations demonstrate that TDA is a useful tool to help deep learning practitioners unlock the hidden structures of their models.  ( 2 min )
    Physics-Constrained Generative Adversarial Networks for 3D Turbulence. (arXiv:2212.00217v1 [physics.comp-ph])
    Generative Adversarial Networks (GANs) have received wide acclaim among the machine learning (ML) community for their ability to generate realistic 2D images. ML is being applied more often to complex problems beyond those of computer vision. However, current frameworks often serve as black boxes and lack physics embeddings, leading to poor ability in enforcing constraints and unreliable models. In this work, we develop physics embeddings that can be stringently imposed, referred to as hard constraints, in the neural network architecture. We demonstrate their capability for 3D turbulence by embedding them in GANs, particularly to enforce the mass conservation constraint in incompressible fluid turbulence. In doing so, we also explore and contrast the effects of other methods of imposing physics constraints within the GANs framework, especially penalty-based physics constraints popular in literature. By using physics-informed diagnostics and statistics, we evaluate the strengths and weaknesses of our approach and demonstrate its feasibility.  ( 2 min )
    Gated Recurrent Neural Networks with Weighted Time-Delay Feedback. (arXiv:2212.00228v1 [cs.LG])
    We introduce a novel gated recurrent unit (GRU) with a weighted time-delay feedback mechanism in order to improve the modeling of long-term dependencies in sequential data. This model is a discretized version of a continuous-time formulation of a recurrent unit, where the dynamics are governed by delay differential equations (DDEs). By considering a suitable time-discretization scheme, we propose $\tau$-GRU, a discrete-time gated recurrent unit with delay. We prove the existence and uniqueness of solutions for the continuous-time model, and we demonstrate that the proposed feedback mechanism can help improve the modeling of long-term dependencies. Our empirical results show that $\tau$-GRU can converge faster and generalize better than state-of-the-art recurrent units and gated recurrent architectures on a range of tasks, including time-series classification, human activity recognition, and speech recognition.  ( 2 min )
    Distilling Multi-Step Reasoning Capabilities of Large Language Models into Smaller Models via Semantic Decompositions. (arXiv:2212.00193v1 [cs.LG])
    Step-by-step reasoning approaches like chain-of-thought (CoT) have proved to be a very effective technique to induce reasoning capabilities in large language models. However, the success of the CoT approach depends primarily on model size, and often billion parameter-scale models are needed to get CoT to work. In this paper, we propose a knowledge distillation approach, that leverages the step-by-step CoT reasoning capabilities of larger models and distils these reasoning abilities into smaller models. Our approach Decompositional Distillation learns a semantic decomposition of the original problem into a sequence of subproblems and uses it to train two models: a) a problem decomposer that learns to decompose the complex reasoning problem into a sequence of simpler sub-problems and b) a problem solver that uses the intermediate subproblems to solve the overall problem. On a multi-step math word problem dataset (GSM8K), we boost the performance of GPT-2 variants up to 35% when distilled with our approach compared to CoT. We show that using our approach, it is possible to train a GPT-2-large model (775M) that can outperform a 10X larger GPT-3 (6B) model trained using CoT reasoning. Finally, we also demonstrate that our approach of problem decomposition can also be used as an alternative to CoT prompting, which boosts the GPT-3 performance by 40% compared to CoT prompts.  ( 2 min )
    Are you using test log-likelihood correctly?. (arXiv:2212.00219v1 [stat.ML])
    Test log-likelihood is commonly used to compare different models of the same data and different approximate inference algorithms for fitting the same probabilistic model. We present simple examples demonstrating how comparisons based on test log-likelihood can contradict comparisons according to other objectives. Specifically, our examples show that (i) conclusions about forecast accuracy based on test log-likelihood comparisons may not agree with conclusions based on other distributional quantities like means; and (ii) that approximate Bayesian inference algorithms that attain higher test log-likelihoods need not also yield more accurate posterior approximations.  ( 2 min )
    AUG-FedPrompt: Practical Few-shot Federated NLP with Data-augmented Prompts. (arXiv:2212.00192v1 [cs.CL])
    Transformer-based pre-trained models have become the de-facto solution for NLP tasks. Fine-tuning such pre-trained models for downstream tasks often requires tremendous amount of data that is both private and labeled. However, in reality: 1) such private data cannot be collected and is distributed across mobile devices, and 2) well-curated labeled data is scarce. To tackle those issues, we first define a data generator for federated few-shot learning tasks, which encompasses the quantity and distribution of scarce labeled data in a realistic setting. Then we propose AUG-FedPrompt, a prompt-based federated learning algorithm that carefully annotates abundant unlabeled data for data augmentation. AUG-FedPrompt can perform on par with full-set fine-tuning with very few initial labeled data.  ( 2 min )
    ODPP: A Unified Algorithm Framework for Unsupervised Option Discovery based on Determinantal Point Process. (arXiv:2212.00211v1 [cs.LG])
    Learning rich skills through temporal abstractions without supervision of external rewards is at the frontier of Reinforcement Learning research. Existing works mainly fall into two distinctive categories: variational and Laplacian-based option discovery. The former maximizes the diversity of the discovered options through a mutual information loss but overlooks coverage of the state space, while the latter focuses on improving the coverage of options by increasing connectivity during exploration, but does not consider diversity. In this paper, we propose a unified framework that quantifies diversity and coverage through a novel use of the Determinantal Point Process (DPP) and enables unsupervised option discovery explicitly optimizing both objectives. Specifically, we define the DPP kernel matrix with the Laplacian spectrum of the state transition graph and use the expected mode number in the trajectories as the objective to capture and enhance both diversity and coverage of the learned options. The proposed option discovery algorithm is extensively evaluated using challenging tasks built with Mujoco and Atari, demonstrating that our proposed algorithm substantially outperforms SOTA baselines from both diversity- and coverage-driven categories. The codes are available at https://github.com/LucasCJYSDL/ODPP.  ( 2 min )
    Five Properties of Specific Curiosity You Didn't Know Curious Machines Should Have. (arXiv:2212.00187v1 [cs.AI])
    Curiosity for machine agents has been a focus of lively research activity. The study of human and animal curiosity, particularly specific curiosity, has unearthed several properties that would offer important benefits for machine learners, but that have not yet been well-explored in machine intelligence. In this work, we conduct a comprehensive, multidisciplinary survey of the field of animal and machine curiosity. As a principal contribution of this work, we use this survey as a foundation to introduce and define what we consider to be five of the most important properties of specific curiosity: 1) directedness towards inostensible referents, 2) cessation when satisfied, 3) voluntary exposure, 4) transience, and 5) coherent long-term learning. As a second main contribution of this work, we show how these properties may be implemented together in a proof-of-concept reinforcement learning agent: we demonstrate how the properties manifest in the behaviour of this agent in a simple non-episodic grid-world environment that includes curiosity-inducing locations and induced targets of curiosity. As we would hope, our example of a computational specific curiosity agent exhibits short-term directed behaviour while updating long-term preferences to adaptively seek out curiosity-inducing situations. This work, therefore, presents a landmark synthesis and translation of specific curiosity to the domain of machine learning and reinforcement learning and provides a novel view into how specific curiosity operates and in the future might be integrated into the behaviour of goal-seeking, decision-making computational agents in complex environments.  ( 2 min )
    SPADE: Semi-supervised Anomaly Detection under Distribution Mismatch. (arXiv:2212.00173v1 [cs.LG])
    Semi-supervised anomaly detection is a common problem, as often the datasets containing anomalies are partially labeled. We propose a canonical framework: Semi-supervised Pseudo-labeler Anomaly Detection with Ensembling (SPADE) that isn't limited by the assumption that labeled and unlabeled data come from the same distribution. Indeed, the assumption is often violated in many applications - for example, the labeled data may contain only anomalies unlike unlabeled data, or unlabeled data may contain different types of anomalies, or labeled data may contain only 'easy-to-label' samples. SPADE utilizes an ensemble of one class classifiers as the pseudo-labeler to improve the robustness of pseudo-labeling with distribution mismatch. Partial matching is proposed to automatically select the critical hyper-parameters for pseudo-labeling without validation data, which is crucial with limited labeled data. SPADE shows state-of-the-art semi-supervised anomaly detection performance across a wide range of scenarios with distribution mismatch in both tabular and image domains. In some common real-world settings such as model facing new types of unlabeled anomalies, SPADE outperforms the state-of-the-art alternatives by 5% AUC in average.  ( 2 min )
    Shape-Guided Diffusion with Inside-Outside Attention. (arXiv:2212.00210v1 [cs.CV])
    Shape can specify key object constraints, yet existing text-to-image diffusion models ignore this cue and synthesize objects that are incorrectly scaled, cut off, or replaced with background content. We propose a training-free method, Shape-Guided Diffusion, which uses a novel Inside-Outside Attention mechanism to constrain the cross-attention (and self-attention) maps such that prompt tokens (and pixels) referring to the inside of the shape cannot attend outside the shape, and vice versa. To demonstrate the efficacy of our method, we propose a new image editing task where the model must replace an object specified by its mask and a text prompt. We curate a new ShapePrompts benchmark based on MS-COCO and achieve SOTA results in shape faithfulness, text alignment, and realism according to both quantitative metrics and human preferences. Our data and code will be made available at https://shape-guided-diffusion.github.io.  ( 2 min )
    Multi-Task Imitation Learning for Linear Dynamical Systems. (arXiv:2212.00186v1 [cs.LG])
    We study representation learning for efficient imitation learning over linear systems. In particular, we consider a setting where learning is split into two phases: (a) a pre-training step where a shared $k$-dimensional representation is learned from $H$ source policies, and (b) a target policy fine-tuning step where the learned representation is used to parameterize the policy class. We find that the imitation gap over trajectories generated by the learned target policy is bounded by $\tilde{O}\left( \frac{k n_x}{HN_{\mathrm{shared}}} + \frac{k n_u}{N_{\mathrm{target}}}\right)$, where $n_x > k$ is the state dimension, $n_u$ is the input dimension, $N_{\mathrm{shared}}$ denotes the total amount of data collected for each policy during representation learning, and $N_{\mathrm{target}}$ is the amount of target task data. This result formalizes the intuition that aggregating data across related tasks to learn a representation can significantly improve the sample efficiency of learning a target task. The trends suggested by this bound are corroborated in simulation.  ( 2 min )
    Deep Learning-Based Vehicle Speed Prediction for Ecological Adaptive Cruise Control in Urban and Highway Scenarios. (arXiv:2212.00149v1 [eess.SY])
    In a typical car-following scenario, target vehicle speed fluctuations act as an external disturbance to the host vehicle and in turn affect its energy consumption. To control a host vehicle in an energy-efficient manner using model predictive control (MPC), and moreover, enhance the performance of an ecological adaptive cruise control (EACC) strategy, forecasting the future velocities of a target vehicle is essential. For this purpose, a deep recurrent neural network-based vehicle speed prediction using long-short term memory (LSTM) and gated recurrent units (GRU) is studied in this work. Besides these, the physics-based constant velocity (CV) and constant acceleration (CA) models are discussed. The sequential time series data for training (e.g. speed trajectories of the target and its preceding vehicles obtained through vehicle-to-vehicle (V2V) communication, road speed limits, traffic light current and future phases collected using vehicle-to-infrastructure (V2I) communication) is gathered from both urban and highway networks created in the microscopic traffic simulator SUMO. The proposed speed prediction models are evaluated for long-term predictions (up to 10 s) of target vehicle future velocities. Moreover, the results revealed that the LSTM-based speed predictor outperformed other models in terms of achieving better prediction accuracy on unseen test datasets, and thereby showcasing better generalization ability. Furthermore, the performance of EACC-equipped host car on the predicted velocities is evaluated, and its energy-saving benefits for different prediction horizons are presented.  ( 2 min )
    Clustering and Analysis of GPS Trajectory Data using Distance-based Features. (arXiv:2212.00206v1 [cs.LG])
    The proliferation of smartphones has accelerated mobility studies by largely increasing the type and volume of mobility data available. One such source of mobility data is from GPS technology, which is becoming increasingly common and helps the research community understand mobility patterns of people. However, there lacks a standardized framework for studying the different mobility patterns created by the non-Work, non-Home locations of Working and Nonworking users on Workdays and Offdays using machine learning methods. We propose a new mobility metric, Daily Characteristic Distance, and use it to generate features for each user together with Origin-Destination matrix features. We then use those features with an unsupervised machine learning method, $k$-means clustering, and obtain three clusters of users for each type of day (Workday and Offday). Finally, we propose two new metrics for the analysis of the clustering results, namely User Commonality and Average Frequency. By using the proposed metrics, interesting user behaviors can be discerned and it helps us to better understand the mobility patterns of the users.  ( 2 min )
    Time-Efficient Reward Learning via Visually Assisted Cluster Ranking. (arXiv:2212.00169v1 [cs.LG])
    One of the most successful paradigms for reward learning uses human feedback in the form of comparisons. Although these methods hold promise, human comparison labeling is expensive and time consuming, constituting a major bottleneck to their broader applicability. Our insight is that we can greatly improve how effectively human time is used in these approaches by batching comparisons together, rather than having the human label each comparison individually. To do so, we leverage data dimensionality-reduction and visualization techniques to provide the human with a interactive GUI displaying the state space, in which the user can label subportions of the state space. Across some simple Mujoco tasks, we show that this high-level approach holds promise and is able to greatly increase the performance of the resulting agents, provided the same amount of human labeling time.  ( 2 min )
    Novel Modelling Strategies for High-frequency Stock Trading Data. (arXiv:2212.00148v1 [stat.AP])
    Full electronic automation in stock exchanges has recently become popular, generating high-frequency intraday data and motivating the development of near real-time price forecasting methods. Machine learning algorithms are widely applied to mid-price stock predictions. Processing raw data as inputs for prediction models (e.g., data thinning and feature engineering) can primarily affect the performance of the prediction methods. However, researchers rarely discuss this topic. This motivated us to propose three novel modelling strategies for processing raw data. We illustrate how our novel modelling strategies improve forecasting performance by analyzing high-frequency data of the Dow Jones 30 component stocks. In these experiments, our strategies often lead to statistically significant improvement in predictions. The three strategies improve the F1 scores of the SVM models by 0.056, 0.087, and 0.016, respectively.  ( 2 min )
    Layout-aware Dreamer for Embodied Referring Expression Grounding. (arXiv:2212.00171v1 [cs.CV])
    In this work, we study the problem of Embodied Referring Expression Grounding, where an agent needs to navigate in a previously unseen environment and localize a remote object described by a concise high-level natural language instruction. When facing such a situation, a human tends to imagine what the destination may look like and to explore the environment based on prior knowledge of the environmental layout, such as the fact that a bathroom is more likely to be found near a bedroom than a kitchen. We have designed an autonomous agent called Layout-aware Dreamer (LAD), including two novel modules, that is, the Layout Learner and the Goal Dreamer to mimic this cognitive decision process. The Layout Learner learns to infer the room category distribution of neighboring unexplored areas along the path for coarse layout estimation, which effectively introduces layout common sense of room-to-room transitions to our agent. To learn an effective exploration of the environment, the Goal Dreamer imagines the destination beforehand. Our agent achieves new state-of-the-art performance on the public leaderboard of the REVERIE dataset in challenging unseen test environments with improvement in navigation success (SR) by 4.02% and remote grounding success (RGS) by 3.43% compared to the previous state-of-the-art. The code is released at https://github.com/zehao-wang/LAD  ( 2 min )
    Answering Private Linear Queries Adaptively using the Common Mechanism. (arXiv:2212.00135v1 [cs.CR])
    When analyzing confidential data through a privacy filter, a data scientist often needs to decide which queries will best support their intended analysis. For example, an analyst may wish to study noisy two-way marginals in a dataset produced by a mechanism M1. But, if the data are relatively sparse, the analyst may choose to examine noisy one-way marginals, produced by a mechanism M2 instead. Since the choice of whether to use M1 or M2 is data-dependent, a typical differentially private workflow is to first split the privacy loss budget rho into two parts: rho1 and rho2, then use the first part rho1 to determine which mechanism to use, and the remainder rho2 to obtain noisy answers from the chosen mechanism. In a sense, the first step seems wasteful because it takes away part of the privacy loss budget that could have been used to make the query answers more accurate. In this paper, we consider the question of whether the choice between M1 and M2 can be performed without wasting any privacy loss budget. For linear queries, we propose a method for decomposing M1 and M2 into three parts: (1) a mechanism M* that captures their shared information, (2) a mechanism M1' that captures information that is specific to M1, (3) a mechanism M2' that captures information that is specific to M2. Running M* and M1' together is completely equivalent to running M1 (both in terms of query answer accuracy and total privacy cost rho). Similarly, running M* and M2' together is completely equivalent to running M2. Since M* will be used no matter what, the analyst can use its output to decide whether to subsequently run M1'(thus recreating the analysis supported by M1) or M2'(recreating the analysis supported by M2), without wasting privacy loss budget.  ( 2 min )
    DEL-Dock: Molecular Docking-Enabled Modeling of DNA-Encoded Libraries. (arXiv:2212.00136v1 [q-bio.QM])
    DNA-Encoded Library (DEL) technology has enabled significant advances in hit identification by enabling efficient testing of combinatorially-generated molecular libraries. DEL screens measure protein binding affinity though sequencing reads of molecules tagged with unique DNA-barcodes that survive a series of selection experiments. Computational models have been deployed to learn the latent binding affinities that are correlated to the sequenced count data; however, this correlation is often obfuscated by various sources of noise introduced in its complicated data-generation process. In order to denoise DEL count data and screen for molecules with good binding affinity, computational models require the correct assumptions in their modeling structure to capture the correct signals underlying the data. Recent advances in DEL models have focused on probabilistic formulations of count data, but existing approaches have thus far been limited to only utilizing 2-D molecule-level representations. We introduce a new paradigm, DEL-Dock, that combines ligand-based descriptors with 3-D spatial information from docked protein-ligand complexes. 3-D spatial information allows our model to learn over the actual binding modality rather than using only structured-based information of the ligand. We show that our model is capable of effectively denoising DEL count data to predict molecule enrichment scores that are better correlated with experimental binding affinity measurements compared to prior works. Moreover, by learning over a collection of docked poses we demonstrate that our model, trained only on DEL data, implicitly learns to perform good docking pose selection without requiring external supervision from expensive-to-source protein crystal structures.  ( 2 min )
    Evidential Conditional Neural Processes. (arXiv:2212.00131v1 [cs.LG])
    The Conditional Neural Process (CNP) family of models offer a promising direction to tackle few-shot problems by achieving better scalability and competitive predictive performance. However, the current CNP models only capture the overall uncertainty for the prediction made on a target data point. They lack a systematic fine-grained quantification on the distinct sources of uncertainty that are essential for model training and decision-making under the few-shot setting. We propose Evidential Conditional Neural Processes (ECNP), which replace the standard Gaussian distribution used by CNP with a much richer hierarchical Bayesian structure through evidential learning to achieve epistemic-aleatoric uncertainty decomposition. The evidential hierarchical structure also leads to a theoretically justified robustness over noisy training tasks. Theoretical analysis on the proposed ECNP establishes the relationship with CNP while offering deeper insights on the roles of the evidential parameters. Extensive experiments conducted on both synthetic and real-world data demonstrate the effectiveness of our proposed model in various few-shot settings.  ( 2 min )
    Denoising Diffusion for Sampling SAT Solutions. (arXiv:2212.00121v1 [cs.AI])
    Generating diverse solutions to the Boolean Satisfiability Problem (SAT) is a hard computational problem with practical applications for testing and functional verification of software and hardware designs. We explore the way to generate such solutions using Denoising Diffusion coupled with a Graph Neural Network to implement the denoising function. We find that the obtained accuracy is similar to the currently best purely neural method and the produced SAT solutions are highly diverse, even if the system is trained with non-random solutions from a standard solver.  ( 2 min )
    FIESTA: FIber gEneration and bundle Segmentation in Tractography using Autoencoders. (arXiv:2212.00143v1 [cs.CV])
    White matter bundle segmentation is a cornerstone of modern tractography to study the brain's structural connectivity in domains such as neurological disorders, neurosurgery, and aging. In this study, we present FIESTA (FIber gEneration and bundle Segmentation in Tractography using Autoencoders), a reliable and robust, fully automated, and easily semi-automatically calibrated pipeline based on deep autoencoders that can dissect and fully populate WM bundles. Our framework allows the transition from one anatomical bundle definition to another with marginal calibrating time. This pipeline is built upon FINTA, CINTA, and GESTA methods that demonstrated how autoencoders can be used successfully for streamline filtering, bundling, and streamline generation in tractography. Our proposed method improves bundling coverage by recovering hard-to-track bundles with generative sampling through the latent space seeding of the subject bundle and the atlas bundle. A latent space of streamlines is learned using autoencoder-based modeling combined with contrastive learning. Using an atlas of bundles in standard space (MNI), our proposed method segments new tractograms using the autoencoder latent distance between each tractogram streamline and its closest neighbor bundle in the atlas of bundles. Intra-subject bundle reliability is improved by recovering hard-to-track streamlines, using the autoencoder to generate new streamlines that increase each bundle's spatial coverage while remaining anatomically meaningful. Results show that our method is more reliable than state-of-the-art automated virtual dissection methods such as RecoBundles, RecoBundlesX, TractSeg, White Matter Analysis and XTRACT. Overall, these results show that our framework improves the practicality and usability of current state-of-the-art bundling framework  ( 2 min )
    One Risk to Rule Them All: A Risk-Sensitive Perspective on Model-Based Offline Reinforcement Learning. (arXiv:2212.00124v1 [cs.LG])
    Offline reinforcement learning (RL) is suitable for safety-critical domains where online exploration is too costly or dangerous. In safety-critical settings, decision-making should take into consideration the risk of catastrophic outcomes. In other words, decision-making should be risk-sensitive. Previous works on risk in offline RL combine together offline RL techniques, to avoid distributional shift, with risk-sensitive RL algorithms, to achieve risk-sensitivity. In this work, we propose risk-sensitivity as a mechanism to jointly address both of these issues. Our model-based approach is risk-averse to both epistemic and aleatoric uncertainty. Risk-aversion to epistemic uncertainty prevents distributional shift, as areas not covered by the dataset have high epistemic uncertainty. Risk-aversion to aleatoric uncertainty discourages actions that may result in poor outcomes due to environment stochasticity. Our experiments show that our algorithm achieves competitive performance on deterministic benchmarks, and outperforms existing approaches for risk-sensitive objectives in stochastic domains.  ( 2 min )
    Generative Adversarial Learning of Sinkhorn Algorithm Initializations. (arXiv:2212.00133v1 [cs.LG])
    The Sinkhorn algorithm (arXiv:1306.0895) is the state-of-the-art to compute approximations of optimal transport distances between discrete probability distributions, making use of an entropically regularized formulation of the problem. The algorithm is guaranteed to converge, no matter its initialization. This lead to little attention being paid to initializing it, and simple starting vectors like the n-dimensional one-vector are common choices. We train a neural network to compute initializations for the algorithm, which significantly outperform standard initializations. The network predicts a potential of the optimal transport dual problem, where training is conducted in an adversarial fashion using a second, generating network. The network is universal in the sense that it is able to generalize to any pair of distributions of fixed dimension. Furthermore, we show that for certain applications the network can be used independently.  ( 2 min )
    Knowledge-augmented Deep Learning and Its Applications: A Survey. (arXiv:2212.00017v1 [cs.LG])
    Deep learning models, though having achieved great success in many different fields over the past years, are usually data hungry, fail to perform well on unseen samples, and lack of interpretability. Various prior knowledge often exists in the target domain and their use can alleviate the deficiencies with deep learning. To better mimic the behavior of human brains, different advanced methods have been proposed to identify domain knowledge and integrate it into deep models for data-efficient, generalizable, and interpretable deep learning, which we refer to as knowledge-augmented deep learning (KADL). In this survey, we define the concept of KADL, and introduce its three major tasks, i.e., knowledge identification, knowledge representation, and knowledge integration. Different from existing surveys that are focused on a specific type of knowledge, we provide a broad and complete taxonomy of domain knowledge and its representations. Based on our taxonomy, we provide a systematic review of existing techniques, different from existing works that survey integration approaches agnostic to taxonomy of knowledge. This survey subsumes existing works and offers a bird's-eye view of research in the general area of knowledge-augmented deep learning. The thorough and critical reviews of numerous papers help not only understand current progresses but also identify future directions for the research on knowledge-augmented deep learning.  ( 2 min )
    Random Copolymer inverse design system orienting on Accurate discovering of Antimicrobial peptide-mimetic copolymers. (arXiv:2212.00023v1 [q-bio.BM])
    Antimicrobial resistance is one of the biggest health problem, especially in the current period of COVID-19 pandemic. Due to the unique membrane-destruction bactericidal mechanism, antimicrobial peptide-mimetic copolymers are paid more attention and it is urgent to find more potential candidates with broad-spectrum antibacterial efficacy and low toxicity. Artificial intelligence has shown significant performance on small molecule or biotech drugs, however, the higher-dimension of polymer space and the limited experimental data restrict the application of existing methods on copolymer design. Herein, we develop a universal random copolymer inverse design system via multi-model copolymer representation learning, knowledge distillation and reinforcement learning. Our system realize a high-precision antimicrobial activity prediction with few-shot data by extracting various chemical information from multi-modal copolymer representations. By pre-training a scaffold-decorator generative model via knowledge distillation, copolymer space are greatly contracted to the near space of existing data for exploration. Thus, our reinforcement learning algorithm can be adaptive for customized generation on specific scaffolds and requirements on property or structures. We apply our system on collected antimicrobial peptide-mimetic copolymers data, and we discover candidate copolymers with desired properties.  ( 2 min )
    Semi-Supervised Heterogeneous Graph Learning with Multi-level Data Augmentation. (arXiv:2212.00024v1 [cs.LG])
    In recent years, semi-supervised graph learning with data augmentation (DA) is currently the most commonly used and best-performing method to enhance model robustness in sparse scenarios with few labeled samples. Differing from homogeneous graph, DA in heterogeneous graph has greater challenges: heterogeneity of information requires DA strategies to effectively handle heterogeneous relations, which considers the information contribution of different types of neighbors and edges to the target nodes. Furthermore, over-squashing of information is caused by the negative curvature that formed by the non-uniformity distribution and strong clustering in complex graph. To address these challenges, this paper presents a novel method named Semi-Supervised Heterogeneous Graph Learning with Multi-level Data Augmentation (HG-MDA). For the problem of heterogeneity of information in DA, node and topology augmentation strategies are proposed for the characteristics of heterogeneous graph. And meta-relation-based attention is applied as one of the indexes for selecting augmented nodes and edges. For the problem of over-squashing of information, triangle based edge adding and removing are designed to alleviate the negative curvature and bring the gain of topology. Finally, the loss function consists of the cross-entropy loss for labeled data and the consistency regularization for unlabeled data. In order to effectively fuse the prediction results of various DA strategies, the sharpening is used. Existing experiments on public datasets, i.e., ACM, DBLP, OGB, and industry dataset MB show that HG-MDA outperforms current SOTA models. Additionly, HG-MDA is applied to user identification in internet finance scenarios, helping the business to add 30% key users, and increase loans and balances by 3.6%, 11.1%, and 9.8%.  ( 2 min )
    Scalable Pathogen Detection from Next Generation DNA Sequencing with Deep Learning. (arXiv:2212.00015v1 [cs.LG])
    Next-generation sequencing technologies have enhanced the scope of Internet-of-Things (IoT) to include genomics for personalized medicine through the increased availability of an abundance of genome data collected from heterogeneous sources at a reduced cost. Given the sheer magnitude of the collected data and the significant challenges offered by the presence of highly similar genomic structure across species, there is a need for robust, scalable analysis platforms to extract actionable knowledge such as the presence of potentially zoonotic pathogens. The emergence of zoonotic diseases from novel pathogens, such as the influenza virus in 1918 and SARS-CoV-2 in 2019 that can jump species barriers and lead to pandemic underscores the need for scalable metagenome analysis. In this work, we propose MG2Vec, a deep learning-based solution that uses the transformer network as its backbone, to learn robust features from raw metagenome sequences for downstream biomedical tasks such as targeted and generalized pathogen detection. Extensive experiments on four increasingly challenging, yet realistic diagnostic settings, show that the proposed approach can help detect pathogens from uncurated, real-world clinical samples with minimal human supervision in the form of labels. Further, we demonstrate that the learned representations can generalize to completely unrelated pathogens across diseases and species for large-scale metagenome analysis. We provide a comprehensive evaluation of a novel representation learning framework for metagenome-based disease diagnostics with deep learning and provide a way forward for extracting and using robust vector representations from low-cost next generation sequencing to develop generalizable diagnostic tools.  ( 3 min )
    Feature Selection with Distance Correlation. (arXiv:2212.00046v1 [hep-ph])
    Choosing which properties of the data to use as input to multivariate decision algorithms -- a.k.a. feature selection -- is an important step in solving any problem with machine learning. While there is a clear trend towards training sophisticated deep networks on large numbers of relatively unprocessed inputs (so-called automated feature engineering), for many tasks in physics, sets of theoretically well-motivated and well-understood features already exist. Working with such features can bring many benefits, including greater interpretability, reduced training and run time, and enhanced stability and robustness. We develop a new feature selection method based on Distance Correlation (DisCo), and demonstrate its effectiveness on the tasks of boosted top- and $W$-tagging. Using our method to select features from a set of over 7,000 energy flow polynomials, we show that we can match the performance of much deeper architectures, by using only ten features and two orders-of-magnitude fewer model parameters.  ( 2 min )
    Towards True Lossless Sparse Communication in Multi-Agent Systems. (arXiv:2212.00115v1 [cs.LG])
    Communication enables agents to cooperate to achieve their goals. Learning when to communicate, i.e., sparse (in time) communication, and whom to message is particularly important when bandwidth is limited. Recent work in learning sparse individualized communication, however, suffers from high variance during training, where decreasing communication comes at the cost of decreased reward, particularly in cooperative tasks. We use the information bottleneck to reframe sparsity as a representation learning problem, which we show naturally enables lossless sparse communication at lower budgets than prior art. In this paper, we propose a method for true lossless sparsity in communication via Information Maximizing Gated Sparse Multi-Agent Communication (IMGS-MAC). Our model uses two individualized regularization objectives, an information maximization autoencoder and sparse communication loss, to create informative and sparse communication. We evaluate the learned communication `language' through direct causal analysis of messages in non-sparse runs to determine the range of lossless sparse budgets, which allow zero-shot sparsity, and the range of sparse budgets that will inquire a reward loss, which is minimized by our learned gating function with few-shot sparsity. To demonstrate the efficacy of our results, we experiment in cooperative multi-agent tasks where communication is essential for success. We evaluate our model with both continuous and discrete messages. We focus our analysis on a variety of ablations to show the effect of message representations, including their properties, and lossless performance of our model.  ( 2 min )
    Optical multi-task learning using multi-wavelength diffractive deep neural networks. (arXiv:2212.00022v1 [cs.LG])
    Photonic neural networks are brain-inspired information processing technology using photons instead of electrons to perform artificial intelligence (AI) tasks. However, existing architectures are designed for a single task but fail to multiplex different tasks in parallel within a single monolithic system due to the task competition that deteriorates the model performance. This paper proposes a novel optical multi-task learning system by designing multi-wavelength diffractive deep neural networks (D2NNs) with the joint optimization method. By encoding multi-task inputs into multi-wavelength channels, the system can increase the computing throughput and significantly alle-viate the competition to perform multiple tasks in parallel with high accuracy. We design the two-task and four-task D2NNs with two and four spectral channels, respectively, for classifying different inputs from MNIST, FMNIST, KMNIST, and EMNIST databases. The numerical evaluations demonstrate that, under the same network size, mul-ti-wavelength D2NNs achieve significantly higher classification accuracies for multi-task learning than single-wavelength D2NNs. Furthermore, by increasing the network size, the multi-wavelength D2NNs for simultaneously performing multiple tasks achieve comparable classification accuracies with respect to the individual training of multiple single-wavelength D2NNs to perform tasks separately. Our work paves the way for developing the wave-length-division multiplexing technology to achieve high-throughput neuromorphic photonic computing and more general AI systems to perform multiple tasks in parallel.  ( 2 min )
    Data-driven Science and Machine Learning Methods in Laser-Plasma Physics. (arXiv:2212.00026v1 [cs.LG])
    Laser-plasma physics has developed rapidly over the past few decades as lasers have become both more powerful and more widely available. Early experimental and numerical research in this field was dominated by single-shot experiments with limited parameter exploration. However, recent technological improvements make it possible to gather data for hundreds or thousands of different settings in both experiments and simulations. This has sparked interest in using advanced techniques from mathematics, statistics and computer science to deal with, and benefit from, big data. At the same time, sophisticated modeling techniques also provide new ways for researchers to deal effectively with situation where still only sparse data are available. This paper aims to present an overview of relevant machine learning methods with focus on applicability to laser-plasma physics and its important sub-fields of laser-plasma acceleration and inertial confinement fusion.  ( 2 min )
    Incentivising cooperation by rewarding the weakest member. (arXiv:2212.00119v1 [cs.MA])
    Autonomous agents that act with each other on behalf of humans are becoming more common in many social domains, such as customer service, transportation, and health care. In such social situations greedy strategies can reduce the positive outcome for all agents, such as leading to stop-and-go traffic on highways, or causing a denial of service on a communications channel. Instead, we desire autonomous decision-making for efficient performance while also considering equitability of the group to avoid these pitfalls. Unfortunately, in complex situations it is far easier to design machine learning objectives for selfish strategies than for equitable behaviors. Here we present a simple way to reward groups of agents in both evolution and reinforcement learning domains by the performance of their weakest member. We show how this yields ``fairer'' more equitable behavior, while also maximizing individual outcomes, and we show the relationship to biological selection mechanisms of group-level selection and inclusive fitness theory.  ( 2 min )
    MrSARP: A Hierarchical Deep Generative Prior for SAR Image Super-resolution. (arXiv:2212.00069v1 [cs.CV])
    Generative models learned from training using deep learning methods can be used as priors in inverse under-determined inverse problems, including imaging from sparse set of measurements. In this paper, we present a novel hierarchical deep-generative model MrSARP for SAR imagery that can synthesize SAR images of a target at different resolutions jointly. MrSARP is trained in conjunction with a critic that scores multi resolution images jointly to decide if they are realistic images of a target at different resolutions. We show how this deep generative model can be used to retrieve the high spatial resolution image from low resolution images of the same target. The cost function of the generator is modified to improve its capability to retrieve the input parameters for a given set of resolution images. We evaluate the model's performance using the three standard error metrics used for evaluating super-resolution performance on simulated data and compare it to upsampling and sparsity based image sharpening approaches.  ( 2 min )
    Location analysis of players in UEFA EURO 2020 and 2022 using generalized valuation of defense by estimating probabilities. (arXiv:2212.00021v1 [cs.LG])
    Analyzing defenses in team sports is generally challenging because of the limited event data. Researchers have previously proposed methods to evaluate football team defense by predicting the events of ball gain and being attacked using locations of all players and the ball. However, they did not consider the importance of the events, assumed the perfect observation of all 22 players, and did not fully investigated the influence of the diversity (e.g., nationality and sex). Here, we propose a generalized valuation method of defensive teams by score-scaling the predicted probabilities of the events. Using the open-source location data of all players in broadcast video frames in football games of men's Euro 2020 and women's Euro 2022, we investigated the effect of the number of players on the prediction and validated our approach by analyzing the games. Results show that for the predictions of being attacked, scoring, and conceding, all players' information was not necessary, while that of ball gain required information on three to four offensive and defensive players. With game analyses we explained the excellence in defense of finalist teams in Euro 2020. Our approach might be applicable to location data from broadcast video frames in football games.  ( 2 min )
    Autotuning PID control using Actor-Critic Deep Reinforcement Learning. (arXiv:2212.00013v1 [cs.LG])
    This work is an exploratory research concerned with determining in what way reinforcement learning can be used to predict optimal PID parameters for a robot designed for apple harvest. To study this, an algorithm called Advantage Actor Critic (A2C) is implemented on a simulated robot arm. The simulation primarily relies on the ROS framework. Experiments for tuning one actuator at a time and two actuators a a time are run, which both show that the model is able to predict PID gains that perform better than the set baseline. In addition, it is studied if the model is able to predict PID parameters based on where an apple is located. Initial tests show that the model is indeed able to adapt its predictions to apple locations, making it an adaptive controller.  ( 2 min )
    A Light-weight, Effective and Efficient Model for Label Aggregation in Crowdsourcing. (arXiv:2212.00007v1 [cs.HC])
    Due to the noises in crowdsourced labels, label aggregation (LA) has emerged as a standard procedure to post-process crowdsourced labels. LA methods estimate true labels from crowdsourced labels by modeling worker qualities. Most existing LA methods are iterative in nature. They need to traverse all the crowdsourced labels multiple times in order to jointly and iteratively update true labels and worker qualities until convergence. Consequently, these methods have high space and time complexities. In this paper, we treat LA as a dynamic system and model it as a Dynamic Bayesian network. From the dynamic model we derive two light-weight algorithms, LA\textsuperscript{onepass} and LA\textsuperscript{twopass}, which can effectively and efficiently estimate worker qualities and true labels by traversing all the labels at most twice. Due to the dynamic nature, the proposed algorithms can also estimate true labels online without re-visiting historical data. We theoretically prove the convergence property of the proposed algorithms, and bound the error of estimated worker qualities. We also analyze the space and time complexities of the proposed algorithms and show that they are equivalent to those of majority voting. Experiments conducted on 20 real-world datasets demonstrate that the proposed algorithms can effectively and efficiently aggregate labels in both offline and online settings even if they traverse all the labels at most twice.  ( 2 min )
    Attentional Ptycho-Tomography (APT) for three-dimensional nanoscale X-ray imaging with minimal data acquisition and computation time. (arXiv:2212.00014v1 [eess.IV])
    Noninvasive X-ray imaging of nanoscale three-dimensional objects, e.g. integrated circuits (ICs), generally requires two types of scanning: ptychographic, which is translational and returns estimates of complex electromagnetic field through ICs; and tomographic scanning, which collects complex field projections from multiple angles. Here, we present Attentional Ptycho-Tomography (APT), an approach trained to provide accurate reconstructions of ICs despite incomplete measurements, using a dramatically reduced amount of angular scanning. Training process includes regularizing priors based on typical IC patterns and the physics of X-ray propagation. We demonstrate that APT with 12-time reduced angles achieves fidelity comparable to the gold standard with the original set of angles. With the same set of reduced angles, APT also outperforms baseline reconstruction methods. In our experiments, APT achieves 108-time aggregate reduction in data acquisition and computation without compromising quality. We expect our physics-assisted machine learning framework could also be applied to other branches of nanoscale imaging.  ( 2 min )
  • Open

    Conditional Neural Processes for Molecules. (arXiv:2210.09211v2 [stat.ML] UPDATED)
    Neural processes (NPs) are models for transfer learning with properties reminiscent of Gaussian Processes (GPs). They are adept at modelling data consisting of few observations of many related functions on the same input space and are trained by minimizing a variational objective, which is computationally much less expensive than the Bayesian updating required by GPs. So far, most studies of NPs have focused on low-dimensional datasets which are not representative of realistic transfer learning tasks. Drug discovery is one application area that is characterized by datasets consisting of many chemical properties or functions which are sparsely observed, yet depend on shared features or representations of the molecular inputs. This paper applies the conditional neural process (CNP) to DOCKSTRING, a dataset of docking scores for benchmarking ML models. CNPs show competitive performance in few-shot learning tasks relative to supervised learning baselines common in chemoinformatics, as well as an alternative model for transfer learning based on pre-training and refining neural network regressors. We present a Bayesian optimization experiment which showcases the probabilistic nature of CNPs and discuss shortcomings of the model in uncertainty quantification.  ( 2 min )
    Tail Batch Sampling: Approximating Global Contrastive Losses as Optimization over Batch Assignments. (arXiv:2210.12874v2 [cs.LG] UPDATED)
    Contrastive Learning has recently achieved state-of-the-art performance in a wide range of tasks. Many contrastive learning approaches use mined hard negatives to make batches more informative during training but these approaches are inefficient as they increase epoch length proportional to the number of mined negatives and require frequent updates of nearest neighbor indices or mining from recent batches. In this work, we provide an alternative to hard negative mining in supervised contrastive learning, Tail Batch Sampling (TBS), an efficient approximation to the batch assignment problem that upper bounds the gap between the global and training losses, $\mathcal{L}^{Global} - \mathcal{L}^{Train}$. TBS \textbf{improves state-of-the-art performance} in sentence embedding (+0.37 Spearman) and code-search tasks (+2.2\% MRR), is easy to implement - requiring only a few additional lines of code, does not maintain external data structures such as nearest neighbor indices, is more computationally efficient when compared to the most minimal hard negative mining approaches, and makes no changes to the model being trained.  ( 2 min )
    Solar Flare Index Prediction Using SDO/HMI Vector Magnetic Data Products with Statistical and Machine Learning Methods. (arXiv:2209.13779v3 [astro-ph.SR] UPDATED)
    Solar flares, especially the M- and X-class flares, are often associated with coronal mass ejections (CMEs). They are the most important sources of space weather effects, that can severely impact the near-Earth environment. Thus it is essential to forecast flares (especially the M-and X-class ones) to mitigate their destructive and hazardous consequences. Here, we introduce several statistical and Machine Learning approaches to the prediction of the AR's Flare Index (FI) that quantifies the flare productivity of an AR by taking into account the numbers of different class flares within a certain time interval. Specifically, our sample includes 563 ARs appeared on solar disk from May 2010 to Dec 2017. The 25 magnetic parameters, provided by the Space-weather HMI Active Region Patches (SHARP) from Helioseismic and Magnetic Imager (HMI) on board the Solar Dynamics Observatory (SDO), characterize coronal magnetic energy stored in ARs by proxy and are used as the predictors. We investigate the relationship between these SHARP parameters and the FI of ARs with a machine-learning algorithm (spline regression) and the resampling method (Synthetic Minority Over-Sampling Technique for Regression with Gaussian Noise, short by SMOGN). Based on the established relationship, we are able to predict the value of FIs for a given AR within the next 1-day period. Compared with other 4 popular machine learning algorithms, our methods improve the accuracy of FI prediction, especially for large FI. In addition, we sort the importance of SHARP parameters by Borda Count method calculated from the ranks that are rendered by 9 different machine learning methods.  ( 3 min )
    Convergence of Stochastic Approximation via Martingale and Converse Lyapunov Methods. (arXiv:2205.01303v2 [stat.ML] UPDATED)
    In this paper, we study the almost sure boundedness and the convergence of the stochastic approximation (SA) algorithm. At present, most available convergence proofs are based on the ODE method, and the almost sure boundedness of the iterations is an assumption and not a conclusion. In Borkar-Meyn (2000), it is shown that if the ODE has only one globally attractive equilibrium, then under additional assumptions, the iterations are bounded almost surely, and the SA algorithm converges to the desired solution. Our objective in the present paper is to provide an alternate proof of the above, based on martingale methods, which are simpler and less technical than those based on the ODE method. As a prelude, we prove a new sufficient condition for the global asymptotic stability of an ODE. Next we prove a ``converse'' Lyapunov theorem on the existence of a suitable Lyapunov function with a globally bounded Hessian, for a globally exponentially stable system. Both theorems are of independent interest to researchers in stability theory. Then, using these results, we provide sufficient conditions for the almost sure boundedness and the convergence of the SA algorithm. We show through examples that our theory covers some situations that are not covered by currently known results, specifically Borkar-Meyn (2000).  ( 2 min )
    Unsupervised Learning under Latent Label Shift. (arXiv:2207.13179v2 [cs.LG] UPDATED)
    What sorts of structure might enable a learner to discover classes from unlabeled data? Traditional approaches rely on feature-space similarity and heroic assumptions on the data. In this paper, we introduce unsupervised learning under Latent Label Shift (LLS), where we have access to unlabeled data from multiple domains such that the label marginals $p_d(y)$ can shift across domains but the class conditionals $p(\mathbf{x}|y)$ do not. This work instantiates a new principle for identifying classes: elements that shift together group together. For finite input spaces, we establish an isomorphism between LLS and topic modeling: inputs correspond to words, domains to documents, and labels to topics. Addressing continuous data, we prove that when each label's support contains a separable region, analogous to an anchor word, oracle access to $p(d|\mathbf{x})$ suffices to identify $p_d(y)$ and $p_d(y|\mathbf{x})$ up to permutation. Thus motivated, we introduce a practical algorithm that leverages domain-discriminative models as follows: (i) push examples through domain discriminator $p(d|\mathbf{x})$; (ii) discretize the data by clustering examples in $p(d|\mathbf{x})$ space; (iii) perform non-negative matrix factorization on the discrete data; (iv) combine the recovered $p(y|d)$ with the discriminator outputs $p(d|\mathbf{x})$ to compute $p_d(y|x) \; \forall d$. With semi-synthetic experiments, we show that our algorithm can leverage domain information to improve upon competitive unsupervised classification methods. We reveal a failure mode of standard unsupervised classification methods when feature-space similarity does not indicate true groupings, and show empirically that our method better handles this case. Our results establish a deep connection between distribution shift and topic modeling, opening promising lines for future work.  ( 2 min )
    Solving a Special Type of Optimal Transport Problem by a Modified Hungarian Algorithm. (arXiv:2210.16645v3 [math.OC] UPDATED)
    Computing empirical Wasserstein distance in the independence test is an optimal transport (OT) problem with a special structure. This observation inspires us to study a special type of OT problem and propose a modified Hungarian algorithm to solve it exactly. For an OT problem involving two marginals with $m$ and $n$ atoms ($m\geq n$), respectively, the computational complexity of the proposed algorithm is $O(m^2n)$. Computing the empirical Wasserstein distance in the independence test requires solving this special type of OT problem, where $m=n^2$. The associated computational complexity of the proposed algorithm is $O(n^5)$, while the order of applying the classic Hungarian algorithm is $O(n^6)$. In addition to the aforementioned special type of OT problem, it is shown that the modified Hungarian algorithm could be adopted to solve a wider range of OT problems. Broader applications of the proposed algorithm are discussed -- solving the one-to-many and the many-to-many assignment problems. Numerical experiments are conducted to validate our theoretical results. The experiment results demonstrate that the proposed modified Hungarian algorithm compares favorably with the Hungarian algorithm and the well-known Sinkhorn algorithm.  ( 2 min )
    Decomposing neural networks as mappings of correlation functions. (arXiv:2202.04925v2 [cond-mat.dis-nn] UPDATED)
    Understanding the functional principles of information processing in deep neural networks continues to be a challenge, in particular for networks with trained and thus non-random weights. To address this issue, we study the mapping between probability distributions implemented by a deep feed-forward network. We characterize this mapping as an iterated transformation of distributions, where the non-linearity in each layer transfers information between different orders of correlation functions. This allows us to identify essential statistics in the data, as well as different information representations that can be used by neural networks. Applied to an XOR task and to MNIST, we show that correlations up to second order predominantly capture the information processing in the internal layers, while the input layer also extracts higher-order correlations from the data. This analysis provides a quantitative and explainable perspective on classification.  ( 2 min )
    Random Graph Embedding and Joint Sparse Regularization for Multi-label Feature Selection. (arXiv:2204.06445v2 [stat.ML] UPDATED)
    Multi-label learning is often used to mine the correlation between variables and multiple labels, and its research focuses on fully extracting the information between variables and labels. The $\ell_{2,1}$ regularization is often used to get a sparse coefficient matrix, but the problem of multicollinearity among variables cannot be effectively solved. In this paper, the proposed model can choose the most relevant variables by solving a joint constraint optimization problem using the $\ell_{2,1}$ regularization and Frobenius regularization. In manifold regularization, we carry out a random walk strategy based on the joint structure to construct a neighborhood graph, which is highly robust to outliers. In addition, we give an iterative algorithm of the proposed method and proved the convergence of this algorithm. The experiments on the real-world data sets also show that the comprehensive performance of our method is consistently better than the classical method.  ( 2 min )
    Semantic uncertainty intervals for disentangled latent spaces. (arXiv:2207.10074v2 [cs.CV] UPDATED)
    Meaningful uncertainty quantification in computer vision requires reasoning about semantic information -- say, the hair color of the person in a photo or the location of a car on the street. To this end, recent breakthroughs in generative modeling allow us to represent semantic information in disentangled latent spaces, but providing uncertainties on the semantic latent variables has remained challenging. In this work, we provide principled uncertainty intervals that are guaranteed to contain the true semantic factors for any underlying generative model. The method does the following: (1) it uses quantile regression to output a heuristic uncertainty interval for each element in the latent space (2) calibrates these uncertainties such that they contain the true value of the latent for a new, unseen input. The endpoints of these calibrated intervals can then be propagated through the generator to produce interpretable uncertainty visualizations for each semantic factor. This technique reliably communicates semantically meaningful, principled, and instance-adaptive uncertainty in inverse problems like image super-resolution and image completion.  ( 2 min )
    The SKIM-FA Kernel: High-Dimensional Variable Selection and Nonlinear Interaction Discovery in Linear Time. (arXiv:2106.12408v4 [stat.CO] UPDATED)
    Many scientific problems require identifying a small set of covariates that are associated with a target response and estimating their effects. Often, these effects are nonlinear and include interactions, so linear and additive methods can lead to poor estimation and variable selection. Unfortunately, methods that simultaneously express sparsity, nonlinearity, and interactions are computationally intractable -- with runtime at least quadratic in the number of covariates, and often worse. In the present work, we solve this computational bottleneck. We show that suitable interaction models have a kernel representation, namely there exists a "kernel trick" to perform variable selection and estimation in $O$(# covariates) time. Our resulting fit corresponds to a sparse orthogonal decomposition of the regression function in a Hilbert space (i.e., a functional ANOVA decomposition), where interaction effects represent all variation that cannot be explained by lower-order effects. On a variety of synthetic and real data sets, our approach outperforms existing methods used for large, high-dimensional data sets while remaining competitive (or being orders of magnitude faster) in runtime.  ( 2 min )
    Optimal Transport of Classifiers to Fairness. (arXiv:2202.03814v3 [cs.LG] UPDATED)
    In past work on fairness in machine learning, the focus has been on forcing the prediction of classifiers to have similar statistical properties for people of different demographics. To reduce the violation of these properties, fairness methods usually simply rescale the classifier scores, ignoring similarities and dissimilarities between members of different groups. Yet, we hypothesize that such information is relevant in quantifying the unfairness of a given classifier. To validate this hypothesis, we introduce Optimal Transport to Fairness (OTF), a method that quantifies the violation of fairness constraints as the smallest Optimal Transport cost between a probabilistic classifier and any score function that satisfies these constraints. For a flexible class of linear fairness constraints, we construct a practical way to compute OTF as a differentiable fairness regularizer that can be added to any standard classification setting. Experiments show that OTF can be used to achieve an improved trade-off between predictive power and fairness.  ( 2 min )
    Quantum machine learning of large datasets using randomized measurements. (arXiv:2108.01039v3 [quant-ph] UPDATED)
    Quantum computers promise to enhance machine learning for practical applications. Quantum machine learning for real-world data has to handle extensive amounts of high-dimensional data. However, conventional methods for measuring quantum kernels are impractical for large datasets as they scale with the square of the dataset size. Here, we measure quantum kernels using randomized measurements. The quantum computation time scales linearly with dataset size and quadratic for classical post-processing. While our method scales in general exponentially in qubit number, we gain a substantial speed-up when running on intermediate-sized quantum computers. Further, we efficiently encode high-dimensional data into quantum computers with the number of features scaling linearly with the circuit depth. The encoding is characterized by the quantum Fisher information metric and is related to the radial basis function kernel. Our approach is robust to noise via a cost-free error mitigation scheme. We demonstrate the advantages of our methods for noisy quantum computers by classifying images with the IBM quantum computer. To achieve further speedups we distribute the quantum computational tasks between different quantum computers. Our method enables benchmarking of quantum machine learning algorithms with large datasets on currently available quantum computers.  ( 2 min )
    Cluster-Specific Predictions with Multi-Task Gaussian Processes. (arXiv:2011.07866v4 [cs.LG] UPDATED)
    A model involving Gaussian processes (GPs) is introduced to simultaneously handle multi-task learning, clustering, and prediction for multiple functional data. This procedure acts as a model-based clustering method for functional data as well as a learning step for subsequent predictions for new tasks. The model is instantiated as a mixture of multi-task GPs with common mean processes. A variational EM algorithm is derived for dealing with the optimisation of the hyper-parameters along with the hyper-posteriors' estimation of latent variables and processes. We establish explicit formulas for integrating the mean processes and the latent clustering variables within a predictive distribution, accounting for uncertainty on both aspects. This distribution is defined as a mixture of cluster-specific GP predictions, which enhances the performances when dealing with group-structured data. The model handles irregular grid of observations and offers different hypotheses on the covariance structure for sharing additional information across tasks. The performances on both clustering and prediction tasks are assessed through various simulated scenarios and real datasets. The overall algorithm, called MagmaClust, is publicly available as an R package.  ( 2 min )
    Approximate Bayesian Computation via Classification. (arXiv:2111.11507v4 [stat.ME] UPDATED)
    Approximate Bayesian Computation (ABC) enables statistical inference in simulator-based models whose likelihoods are difficult to calculate but easy to simulate from. ABC constructs a kernel-type approximation to the posterior distribution through an accept/reject mechanism which compares summary statistics of real and simulated data. To obviate the need for summary statistics, we directly compare empirical distributions with a Kullback-Leibler (KL) divergence estimator obtained via contrastive learning. In particular, we blend flexible machine learning classifiers within ABC to automate fake/real data comparisons. We consider the traditional accept/reject kernel as well as an exponential weighting scheme which does not require the ABC acceptance threshold. Our theoretical results show that the rate at which our ABC posterior distributions concentrate around the true parameter depends on the estimation error of the classifier. We derive limiting posterior shape results and find that, with a properly scaled exponential kernel, asymptotic normality holds. We demonstrate the usefulness of our approach on simulated examples as well as real data in the context of stock volatility estimation.  ( 2 min )
    Dendrite Net with Acceleration Module for Faster Nonlinear Mapping and System Identification. (arXiv:2006.02901v2 [cs.LG] UPDATED)
    Nonlinear mapping is an essential and common demand in online systems, such as sensor systems and mobile phones. Accelerating nonlinear mapping will directly speed up online systems. Previously the authors of this paper proposed a Dendrite Net (DD) with enormously lower time complexity than the existing nonlinear mapping algorithms; however, there still are redundant calculations in DD. This paper presents a DD with an acceleration module (AC) to accelerate nonlinear mapping further. We conduct three experiments to verify whether DD with AC has lower time complexity while retaining DD's nonlinear mapping properties and system identification properties: The first experiment is the precision and identification of unary nonlinear mapping, reflecting the calculation performance using DD with AC for basic functions in online systems. The second experiment is the mapping precision and identification of the multi-input nonlinear system, reflecting the performance for designing online systems via DD with AC. Finally, this paper compares the time complexity of DD and DD with AC and analyzes the theoretical reasons through repeated experiments. Results: DD with AC retains DD's excellent mapping and identification properties and has lower time complexity. Significance: DD with AC can be used for most engineering systems, such as sensor systems, and will speed up computation in these online systems. The code of DD with AC is available on https://github.com/liugang1234567/Gang-neuron  ( 2 min )
    Learning Multi-Agent Coordination through Connectivity-driven Communication. (arXiv:2002.05233v4 [cs.LG] UPDATED)
    In artificial multi-agent systems, the ability to learn collaborative policies is predicated upon the agents' communication skills: they must be able to encode the information received from the environment and learn how to share it with other agents as required by the task at hand. We present a deep reinforcement learning approach, Connectivity Driven Communication (CDC), that facilitates the emergence of multi-agent collaborative behaviour only through experience. The agents are modelled as nodes of a weighted graph whose state-dependent edges encode pair-wise messages that can be exchanged. We introduce a graph-dependent attention mechanisms that controls how the agents' incoming messages are weighted. This mechanism takes into full account the current state of the system as represented by the graph, and builds upon a diffusion process that captures how the information flows on the graph. The graph topology is not assumed to be known a priori, but depends dynamically on the agents' observations, and is learnt concurrently with the attention mechanism and policy in an end-to-end fashion. Our empirical results show that CDC is able to learn effective collaborative policies and can over-perform competing learning algorithms on cooperative navigation tasks.  ( 2 min )
    High-dimensional density estimation with tensorizing flow. (arXiv:2212.00759v1 [cs.LG])
    We propose the tensorizing flow method for estimating high-dimensional probability density functions from the observed data. The method is based on tensor-train and flow-based generative modeling. Our method first efficiently constructs an approximate density in the tensor-train form via solving the tensor cores from a linear system based on the kernel density estimators of low-dimensional marginals. We then train a continuous-time flow model from this tensor-train density to the observed empirical distribution by performing a maximum likelihood estimation. The proposed method combines the optimization-less feature of the tensor-train with the flexibility of the flow-based generative models. Numerical results are included to demonstrate the performance of the proposed method.  ( 2 min )
    Learning Transition Operators From Sparse Space-Time Samples. (arXiv:2212.00746v1 [cs.IT])
    We consider the nonlinear inverse problem of learning a transition operator $\mathbf{A}$ from partial observations at different times, in particular from sparse observations of entries of its powers $\mathbf{A},\mathbf{A}^2,\cdots,\mathbf{A}^{T}$. This Spatio-Temporal Transition Operator Recovery problem is motivated by the recent interest in learning time-varying graph signals that are driven by graph operators depending on the underlying graph topology. We address the nonlinearity of the problem by embedding it into a higher-dimensional space of suitable block-Hankel matrices, where it becomes a low-rank matrix completion problem, even if $\mathbf{A}$ is of full rank. For both a uniform and an adaptive random space-time sampling model, we quantify the recoverability of the transition operator via suitable measures of incoherence of these block-Hankel embedding matrices. For graph transition operators these measures of incoherence depend on the interplay between the dynamics and the graph topology. We develop a suitable non-convex iterative reweighted least squares (IRLS) algorithm, establish its quadratic local convergence, and show that, in optimal scenarios, no more than $\mathcal{O}(rn \log(nT))$ space-time samples are sufficient to ensure accurate recovery of a rank-$r$ operator $\mathbf{A}$ of size $n \times n$. This establishes that spatial samples can be substituted by a comparable number of space-time samples. We provide an efficient implementation of the proposed IRLS algorithm with space complexity of order $O(r n T)$ and per-iteration time complexity linear in $n$. Numerical experiments for transition operators based on several graph models confirm that the theoretical findings accurately track empirical phase transitions, and illustrate the applicability and scalability of the proposed algorithm.  ( 2 min )
    Penalized Langevin and Hamiltonian Monte Carlo Algorithms for Constrained Sampling. (arXiv:2212.00570v1 [stat.ML])
    We consider the constrained sampling problem where the goal is to sample from a distribution $\pi(x)\propto e^{-f(x)}$ and $x$ is constrained on a convex body $\mathcal{C}\subset \mathbb{R}^d$. Motivated by penalty methods from optimization, we propose penalized Langevin Dynamics (PLD) and penalized Hamiltonian Monte Carlo (PHMC) that convert the constrained sampling problem into an unconstrained one by introducing a penalty function for constraint violations. When $f$ is smooth and the gradient is available, we show $\tilde{\mathcal{O}}(d/\varepsilon^{10})$ iteration complexity for PLD to sample the target up to an $\varepsilon$-error where the error is measured in terms of the total variation distance and $\tilde{\mathcal{O}}(\cdot)$ hides some logarithmic factors. For PHMC, we improve this result to $\tilde{\mathcal{O}}(\sqrt{d}/\varepsilon^{7})$ when the Hessian of $f$ is Lipschitz and the boundary of $\mathcal{C}$ is sufficiently smooth. To our knowledge, these are the first convergence rate results for Hamiltonian Monte Carlo methods in the constrained sampling setting that can handle non-convex $f$ and can provide guarantees with the best dimension dependency among existing methods with deterministic gradients. We then consider the setting where unbiased stochastic gradients are available. We propose PSGLD and PSGHMC that can handle stochastic gradients without Metropolis-Hasting correction steps. When $f$ is strongly convex and smooth, we obtain an iteration complexity of $\tilde{\mathcal{O}}(d/\varepsilon^{18})$ and $\tilde{\mathcal{O}}(d\sqrt{d}/\varepsilon^{39})$ respectively in the 2-Wasserstein distance. For the more general case, when $f$ is smooth and non-convex, we also provide finite-time performance bounds and iteration complexity results. Finally, we test our algorithms on Bayesian LASSO regression and Bayesian constrained deep learning problems.  ( 2 min )
    Dense Hebbian neural networks: a replica symmetric picture of supervised learning. (arXiv:2212.00606v1 [cond-mat.dis-nn])
    We consider dense, associative neural-networks trained by a teacher (i.e., with supervision) and we investigate their computational capabilities analytically, via statistical-mechanics of spin glasses, and numerically, via Monte Carlo simulations. In particular, we obtain a phase diagram summarizing their performance as a function of the control parameters such as quality and quantity of the training dataset, network storage and noise, that is valid in the limit of large network size and structureless datasets: these networks may work in a ultra-storage regime (where they can handle a huge amount of patterns, if compared with shallow neural networks) or in a ultra-detection regime (where they can perform pattern recognition at prohibitive signal-to-noise ratios, if compared with shallow neural networks). Guided by the random theory as a reference framework, we also test numerically learning, storing and retrieval capabilities shown by these networks on structured datasets as MNist and Fashion MNist. As technical remarks, from the analytic side, we implement large deviations and stability analysis within Guerra's interpolation to tackle the not-Gaussian distributions involved in the post-synaptic potentials while, from the computational counterpart, we insert Plefka approximation in the Monte Carlo scheme, to speed up the evaluation of the synaptic tensors, overall obtaining a novel and broad approach to investigate supervised learning in neural networks, beyond the shallow limit, in general.  ( 2 min )
    Shining light on data: Geometric data analysis through quantum dynamics. (arXiv:2212.00682v1 [quant-ph])
    Experimental sciences have come to depend heavily on our ability to organize, interpret and analyze high-dimensional datasets produced from observations of a large number of variables governed by natural processes. Natural laws, conservation principles, and dynamical structure introduce intricate inter-dependencies among these observed variables, which in turn yield geometric structure, with fewer degrees of freedom, on the dataset. We show how fine-scale features of this structure in data can be extracted from \emph{discrete} approximations to quantum mechanical processes given by data-driven graph Laplacians and localized wavepackets. This data-driven quantization procedure leads to a novel, yet natural uncertainty principle for data analysis induced by limited data. We illustrate the new approach with algorithms and several applications to real-world data, including the learning of patterns and anomalies in social distancing and mobility behavior during the COVID-19 pandemic.  ( 2 min )
    Regularization with Fake Features. (arXiv:2212.00433v1 [cs.LG])
    Recent successes of massively overparameterized models have inspired a new line of work investigating the underlying conditions that enable overparameterized models to generalize well. This paper considers a framework where the possibly overparametrized model includes fake features, i.e., features that are present in the model but not in the data. We present a non-asymptotic high-probability bound on the generalization error of the ridge regression problem under the model misspecification of having fake features. Our high-probability results characterize the interplay between the implicit regularization provided by the fake features and the explicit regularization provided by the ridge parameter. We observe that fake features may improve the generalization error, even though they are irrelevant to the data.  ( 2 min )
    Multi-Source Survival Domain Adaptation. (arXiv:2212.00424v1 [cs.LG])
    Survival analysis is the branch of statistics that studies the relation between the characteristics of living entities and their respective survival times, taking into account the partial information held by censored cases. A good analysis can, for example, determine whether one medical treatment for a group of patients is better than another. With the rise of machine learning, survival analysis can be modeled as learning a function that maps studied patients to their survival times. To succeed with that, there are three crucial issues to be tackled. First, some patient data is censored: we do not know the true survival times for all patients. Second, data is scarce, which led past research to treat different illness types as domains in a multi-task setup. Third, there is the need for adaptation to new or extremely rare illness types, where little or no labels are available. In contrast to previous multi-task setups, we want to investigate how to efficiently adapt to a new survival target domain from multiple survival source domains. For this, we introduce a new survival metric and the corresponding discrepancy measure between survival distributions. These allow us to define domain adaptation for survival analysis while incorporating censored data, which would otherwise have to be dropped. Our experiments on two cancer data sets reveal a superb performance on target domains, a better treatment recommendation, and a weight matrix with a plausible explanation.  ( 2 min )
    Scalable Variational Bayes methods for Hawkes processes. (arXiv:2212.00293v1 [math.ST])
    Multivariate Hawkes processes are temporal point processes extensively applied to model event data with dependence on past occurrences and interaction phenomena. In the generalised nonlinear model, positive and negative interactions between the components of the process are allowed, therefore accounting for so-called excitation and inhibition effects. In the nonparametric setting, learning the temporal dependence structure of Hawkes processes is often a computationally expensive task, all the more with Bayesian estimation methods. In general, the posterior distribution in the nonlinear Hawkes model is non-conjugate and doubly intractable. Moreover, existing Monte-Carlo Markov Chain methods are often slow and not scalable to high-dimensional processes in practice. Recently, efficient algorithms targeting a mean-field variational approximation of the posterior distribution have been proposed. In this work, we unify existing variational Bayes inference approaches under a general framework, that we theoretically analyse under easily verifiable conditions on the prior, the variational class, and the model. We notably apply our theory to a novel spike-and-slab variational class, that can induce sparsity through the connectivity graph parameter of the multivariate Hawkes model. Then, in the context of the popular sigmoid Hawkes model, we leverage existing data augmentation technique and design adaptive and sparsity-inducing mean-field variational methods. In particular, we propose a two-step algorithm based on a thresholding heuristic to select the graph parameter. Through an extensive set of numerical simulations, we demonstrate that our approach enjoys several benefits: it is computationally efficient, can reduce the dimensionality of the problem by selecting the graph parameter, and is able to adapt to the smoothness of the underlying parameter.  ( 2 min )
    From CNNs to Shift-Invariant Twin Wavelet Models. (arXiv:2212.00394v1 [cs.CV])
    We propose a novel antialiasing method to increase shift invariance in convolutional neural networks (CNNs). More precisely, we replace the conventional combination "real-valued convolutions + max pooling" ($\mathbb R$Max) by "complex-valued convolutions + modulus" ($\mathbb C$Mod), which produce stable feature representations for band-pass filters with well-defined orientations. In a recent work, we proved that, for such filters, the two operators yield similar outputs. Therefore, $\mathbb C$Mod can be viewed as a stable alternative to $\mathbb R$Max. To separate band-pass filters from other freely-trained kernels, in this paper, we designed a "twin" architecture based on the dual-tree complex wavelet packet transform, which generates similar outputs as standard CNNs with fewer trainable parameters. In addition to improving stability to small shifts, our experiments on AlexNet and ResNet showed increased prediction accuracy on natural image datasets such as ImageNet and CIFAR10. Furthermore, our approach outperformed recent antialiasing methods based on low-pass filtering by preserving high-frequency information, while reducing memory usage.  ( 2 min )
    Generative Adversarial Learning of Sinkhorn Algorithm Initializations. (arXiv:2212.00133v1 [cs.LG])
    The Sinkhorn algorithm (arXiv:1306.0895) is the state-of-the-art to compute approximations of optimal transport distances between discrete probability distributions, making use of an entropically regularized formulation of the problem. The algorithm is guaranteed to converge, no matter its initialization. This lead to little attention being paid to initializing it, and simple starting vectors like the n-dimensional one-vector are common choices. We train a neural network to compute initializations for the algorithm, which significantly outperform standard initializations. The network predicts a potential of the optimal transport dual problem, where training is conducted in an adversarial fashion using a second, generating network. The network is universal in the sense that it is able to generalize to any pair of distributions of fixed dimension. Furthermore, we show that for certain applications the network can be used independently.  ( 2 min )
    Locally Adaptive Hierarchical Cluster Termination With Application To Individual Tree Delineation. (arXiv:2212.00288v1 [stat.ML])
    A clustering termination procedure which is locally adaptive (with respect to the hierarchical tree of sets representative of the agglomerative merging) is proposed, for agglomerative hierarchical clustering on a set equipped with a distance function. It represents a multi-scale alternative to conventional scale dependent threshold based termination criteria.  ( 2 min )
    Are you using test log-likelihood correctly?. (arXiv:2212.00219v1 [stat.ML])
    Test log-likelihood is commonly used to compare different models of the same data and different approximate inference algorithms for fitting the same probabilistic model. We present simple examples demonstrating how comparisons based on test log-likelihood can contradict comparisons according to other objectives. Specifically, our examples show that (i) conclusions about forecast accuracy based on test log-likelihood comparisons may not agree with conclusions based on other distributional quantities like means; and (ii) that approximate Bayesian inference algorithms that attain higher test log-likelihoods need not also yield more accurate posterior approximations.  ( 2 min )
    Quadratically Regularized Optimal Transport: nearly optimal potentials and convergence of discrete Laplace operators. (arXiv:2212.00103v1 [math.AP])
    We consider the conjecture proposed in Matsumoto, Zhang and Schiebinger (2022) suggesting that optimal transport with quadratic regularisation can be used to construct a graph whose discrete Laplace operator converges to the Laplace--Beltrami operator. We derive first order optimal potentials for the problem under consideration and find that the resulting solutions exhibit a surprising resemblance to the well-known Barenblatt--Prattle solution of the porous medium equation. Then, relying on these first order optimal potentials, we derive the pointwise $L^2$-limit of such discrete operators built from an i.i.d. random sample on a smooth compact manifold. Simulation results complementing the limiting distribution results are also presented.  ( 2 min )
    Gated Recurrent Neural Networks with Weighted Time-Delay Feedback. (arXiv:2212.00228v1 [cs.LG])
    We introduce a novel gated recurrent unit (GRU) with a weighted time-delay feedback mechanism in order to improve the modeling of long-term dependencies in sequential data. This model is a discretized version of a continuous-time formulation of a recurrent unit, where the dynamics are governed by delay differential equations (DDEs). By considering a suitable time-discretization scheme, we propose $\tau$-GRU, a discrete-time gated recurrent unit with delay. We prove the existence and uniqueness of solutions for the continuous-time model, and we demonstrate that the proposed feedback mechanism can help improve the modeling of long-term dependencies. Our empirical results show that $\tau$-GRU can converge faster and generalize better than state-of-the-art recurrent units and gated recurrent architectures on a range of tasks, including time-series classification, human activity recognition, and speech recognition.  ( 2 min )

  • Open

    Off-the-shelf AI got me pretty far this Halloween: Luxonis OAK-D camera w/ Mobilenet for face tracking, CLIP for costume ID. Didn't have to train anything! (xpost /r/computervision)
    submitted by /u/etotheipi_ [link] [comments]  ( 48 min )
    Upcoming Talk: What is Model Serving? Thurs Dec 8 @ 12:30PM EST
    Once you've trained your ML model, the next step towards production deployment is model serving. This tech talk will break down what it means to turn your ML models into microservices and API endpoints that can be deployed and run anywhere. We'll explore the role of containers in model serving and how you can quickly prepare your models for production deployment using an open-source solution, chassis.ml. Tune in Thursday Dec 8 @ 12:30PM EST to the Modzy Discord Server. submitted by /u/modzykirsten [link] [comments]  ( 50 min )
    Nothing to see here, just a normal AI generated image, from starryai lol(z)
    submitted by /u/ThatDumbInternetGuy [link] [comments]  ( 46 min )
    Is there an independent non-human entity that build things in real life?
    Pls redirect me to the right place if I'm lost. Is there an independent non-human entity that build things in real life? Here's an example. Software. I write software that does the following Contact building construction companies Give one of them a job e.g temple, monument, etc with detailed instructions about design, location, etc Pay them (software interacts with online banking, 2FA stuff, and is good enough to work despite UI changes, so software has control over bank accounts that are funded for this purpose. server costs, 2fa costs, etc are paid from this account as well.. ) Check if the job is complete Do this once a year for the next N years ps funds in the account are invested in index funds and the software withdraws money only when needed. server is rented from 5 different large providers in case 4 of them stops working in the way that the software knows how to interact. software is run concurrently from 5 locations in case 4 of them goes out. etc .. I can write 2 more pages of details to make this sound a lot more realistic but I hope you get the idea I die and no one else knows that this exists. Yet this entity keeps creating buildings in real life. submitted by /u/TWO-AT-THE-SAME-TIME [link] [comments]  ( 47 min )
    Alpha matrix calculation improvements
    I recently read how the Alpha AI was used to reduce the number of calculations required to multiply matrices. Since NN often use matrices could these new algorithms be applied back into the same AI to make it a little more efficient? submitted by /u/lawless_c [link] [comments]  ( 49 min )
    New Easy Dreambooth Training Service! OpenArt Photo Booth
    submitted by /u/PuppetHere [link] [comments]  ( 46 min )
    Elon Musk Reveals Neuralink "N1" BCI Device And Future Technology Plans
    submitted by /u/kenickh [link] [comments]  ( 45 min )
    Hidden abilities of large language models: Is emergence the norm?
    submitted by /u/Number_5_alive [link] [comments]  ( 55 min )
    AI Dream 95 - InvokeAI is truly INCREDIBLE!
    submitted by /u/LordPewPew777 [link] [comments]  ( 46 min )
    I asked ChatGPT to make me Unity C# code that generates procedural hilly terrain, and a camera controller that allows me to fly around it using the keyboard and mouse.
    submitted by /u/apinanaivot [link] [comments]  ( 57 min )
    ChatGPT Is Mind-Blowing — Everything You Need To Know
    submitted by /u/SupPandaHugger [link] [comments]  ( 67 min )
    Best ai company to invest in?
    Looking to invest in an ai company that is up and coming. Was looking at c3. ai as a starter. Is there any other good companies to invest in? submitted by /u/Brumbies5 [link] [comments]  ( 50 min )
    OpenAI invites everyone to test new AI-powered chatbot—with amusing results
    submitted by /u/pollylang [link] [comments]  ( 46 min )
  • Open

    [D] In an optimal world, how would you wish variance between runs based on different random seeds was reported in papers?
    In many papers, no confidence estimates are reported at all (one has to assume the best results for the own method are reported). In other papers, min/max or standard deviation as well as the mean are reported. Even more seldomly, the mean and standard error of the mean is reported. Once in a blue moon, an actual statistical test is run. Given that there plainly is no consensus in the field on how to handle this issue, what is the best way to do it in your opinion? submitted by /u/optimized-adam [link] [comments]  ( 69 min )
    [D] This neural network was generated by a neural network
    I want to share that I have been generating a neural network using GPT over the past two days. Here are links to the Autoencoder and VAE for audio, the code that GPT generated: https://github.com/nikuson/AudioVAE https://github.com/nikuson/Audio-Autoencoder submitted by /u/Nik_uson [link] [comments]  ( 60 min )
    [D] Do neural networks take care of feature engineering?
    Is it correct that with a sufficiently large - in terms of layers and nodes - neural network, that when trained, the network kind of performs feature engineering? I know that would not be formally how to describe it, but does a neural network find interesting patterns in the data that are kind of like features that are maybe even difficult to describe? Here's an example to describe what I'm getting at. just say I'm trying to predict what picture a person is looking at based upon their brain activity, which I measure with EEG as a time series, across 128 electrodes. with a neural network can I just feed in the raw time series voltage recordings, and that will take case of discerning any valuable features, or should I also create a bunch of feature from the data - like mean, std. dev, median, entropy etc? Thanks! submitted by /u/Steve_Sizzou [link] [comments]  ( 61 min )
    [D] PyTorch 2.0 Announcement
    PyTorch 2.0 was just announced at the PyTorch Conference: https://pytorch.org/get-started/pytorch-2.0/ See also the accompanying twitter thread: https://twitter.com/PyTorch/status/1598708792598069249 submitted by /u/joshadel [link] [comments]  ( 63 min )
    [P] Tasknet: a library for easy fine-tuning with Huggingface Trainer and Datasets
    Hi everyone, I'm annoyed by the amount of code required to trainer a transformer with a huggingface dataset (compute_metrics, data_collators, etc.) So I made this library: https://github.com/sileod/tasknet You just have to map the HF dataset to pre-defined templates, and tasknet manages the interface between HF datasets and HF Trainer. It also enables multi-task learning. I would be glad to have some feedback. Thanks submitted by /u/Jean-Porte [link] [comments]  ( 56 min )
    [D] Entropy in feature engineering
    I am thinking about the relative "importance" of the variables in a given data set for a general set of classifiers, and in particular the role that entropy could play here. Below, I'll sketch a heuristic "pseudo argument" that could perhaps motivate the use of entropy to reduce the number of variables. For simplicity, let's assume that we have only three variables, so the vectors in the data set have the form (x, y, z) and are labelled as either 0 or 1. Let's now investigate the role of the z-variable in terms of entropy. Recall that we consider the coordinate hyperplanes H(z, c):={(x,y,z): z=c} for different values c, and we compute the entropies, e(z, c, L), e(z, c, R), defined by the probability distribution of the labelled vectors on both sides of the hyperplane. We then use the pro…  ( 61 min )
    [Project] Awesome Matting Project with SOTA Models and End-to-end Deployment
    Hi, I would like to introduce a Matting project, which provides the capabilities from data preparation, model training, evaluation, deployment, etc. This might be some help to you. Hope you enjoy it. Code: https://github.com/PaddlePaddle/PaddleSeg/tree/release/2.7/Matting Main Features: Hair-level segmentation assisted by color purification make it achieve perfect foreground extraction. Contain 10+ algorithms including traditional methods, trimap-based models and trimap-free models. Include about 20 data transformers which make it have rich data preprocessing by combination. Open source 7+ pretrained human matting models for different scenarios which can be used directly. ​ https://i.redd.it/p4mug8nsbe3a1.gif ​ https://preview.redd.it/yjji9qcwbe3a1.jpg?width=2632&format=pjpg&auto=webp&s=59f010182d62a47ab0afb7a3d9e32adf3e3c5cbb submitted by /u/cgw-123 [link] [comments]  ( 56 min )
    [D] Looking for 100 data scientists to interview for video series
    Note: I already posted this on r/datascience yesterday, but I figured this thread also has a lot of data scientists willing to help! I'm starting a YouTube channel focused on helping new learners break into the data science field. I'm sure someone has helped you on your journey (a friend, professor, mentor, etc). Sadly, some people have nobody...I'm making these videos for them. The premise of the channel is simple. I'm asking 100 data scientists questions about their journey (tips, mistakes, challenges, etc) and then turning those questions into YouTube videos. Here's an example: "100 Data Scientists Reveal Common Newbie Mistakes" OR "100 Data Scientists Share Their #1 Interview Tip" There's a lot of data science-related content out there, but there's nothing like this. Many new learners are lost and could really use your help. So if you're a data scientist and have 20-30 minutes to spare, I'd love to ask you a few questions via a Zoom call. The only requirement is that you're employed as a full-time data scientist. If you're interested, please send me a DM! Thanks so much! p.s. When I say 100, I really mean it. I have roughly 28 interviews scheduled this last week by reaching out to people on LinkedIn and r/datascience. But I still got a ways to go! So if you know anybody you think would love to help, please share this with them! p.p.s. I will also be interviewing data engineers and data analysts, but I won't be doing those for another few months. submitted by /u/JohnDS1503 [link] [comments]  ( 60 min )
  • Open

    Introducing one-step classification and entity recognition with Amazon Comprehend for intelligent document processing
    “Intelligent document processing (IDP) solutions extract data to support automation of high-volume, repetitive document processing tasks and for analysis and insight. IDP uses natural language technologies and computer vision to extract data from structured and unstructured content, especially from documents, to support automation and augmentation.”  – Gartner The goal of Amazon’s intelligent document processing (IDP) […]  ( 10 min )
  • Open

    Speech AI Expands Global Reach With Telugu Language Breakthrough
    More than 75 million people speak Telugu, predominantly in India’s southern regions, making it one of the most widely spoken languages in the country. Despite such prevalence, Telugu is considered a low-resource language when it comes to speech AI. This means there aren’t enough hours’ worth of speech datasets to easily and accurately create AI Read article > The post Speech AI Expands Global Reach With Telugu Language Breakthrough appeared first on NVIDIA Blog.  ( 6 min )
  • Open

    Learning to fly a 2D drone through obstacles
    submitted by /u/MajLenn [link] [comments]  ( 64 min )
    For a time limited game, should I put "time left" and "score" as part of the observation space for neural network?
    I'm trying to use deep neural network to learn and play 1 vs 1 (Tetris) game with time limited. ​ Basically, game's rule is simply with the time counting down, whoever score the most at the end of the game wins. Game has different levels that come with more time and faster game speed. ​ I'm unsure how to best deal with the "time left" and "score" in the observation space, Since time is a big factor for the game, is it wise to log-normalize it and include it in the observation space? ​ 2.If the time limited is increasing from Game#1, Game#2 ... to Game #N, this will ultimately lead to a large "score" and a big "time left". Without knowing the upper bound of score, how to best design the observation in this case? ​ Please let me know if this is not the best place to ask this kind of question, some pointers on where I can ask or find the answer would be great. submitted by /u/move37th [link] [comments]  ( 66 min )
    I have a question about the performance evaluation in the model-based RL.
    Hi, There are papers using gradient-free agents (such as MPC) in MBRL. but, I can't understand why they claim their performance of controller is good after showing their model training method. I don't mean to disparage their efforts, in my view, I can't find a connection between a good model and a controller. because, MPC controller doesn't get any state information(even if state has good representation). Could anyone explain the connection between them? example : Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models submitted by /u/Spiritual_Fig3632 [link] [comments]  ( 57 min )
    Parameter sharing vs single policy learning
    Possibly another noob question, but I have the impression that I’m not fully grasping what parameters sharing means In the context of MARL, a centralised approach to learning is to simply train a single policy over a concatenation of agents observations to produce the join actions of all the agents In a paper I’m reading authors say they don’t do this but train agents independently, but since they are homogeneous they do parameters sharing. They continue saying that this amounts to train a separate policy for each agent parametrised by \theta, but they don’t explicitly say what this \theta is. So I’m confused: • which parameters are shared? NN weights and biases? Isn’t this effectively a single network that is learning, then? That will be conditioned to agents local observations like in CTDE? • how many policies are actually learnt? It is the same policy but conditioned on each agents’ local observations (like in CTDE)? Or is there actually one policy for each agent? (But then I don’t get what gets shared…) • how many NNs are involved? I have the feeling I am confusing the roles of policy, network, and parameter here… submitted by /u/LostInAcademy [link] [comments]  ( 65 min )
    Why neural evolution is not popular?
    One of the bottleneck I know is slow training speed, and GitHub project evojax aims to solve this issue by utilizing GPUs. Are there any other major drawback of neural evolution methods for reinforcement learning? Many thanks. submitted by /u/levizhou [link] [comments]  ( 64 min )
  • Open

    Elon Musk Reveals Neuralink "N1" BCI Device And Future Technology Plans
    submitted by /u/kenickh [link] [comments]  ( 43 min )
  • Open

    Arabic numerals and numerals that are Arabic
    The characters 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9 are called Arabic numerals, but there are a lot of other numerals that are Arabic. I discovered this when reading the documentation on Perl regular expressions, perlre. Here’s the excerpt from that page that caught my eye. Many scripts have their own […] Arabic numerals and numerals that are Arabic first appeared on John D. Cook.  ( 7 min )
  • Open

    From Discrimination to Generation: Knowledge Graph Completion with Generative Transformer. (arXiv:2202.02113v6 [cs.CL] CROSS LISTED)
    Knowledge graph completion aims to address the problem of extending a KG with missing triples. In this paper, we provide an approach GenKGC, which converts knowledge graph completion to sequence-to-sequence generation task with the pre-trained language model. We further introduce relation-guided demonstration and entity-aware hierarchical decoding for better representation learning and fast inference. Experimental results on three datasets show that our approach can obtain better or comparable performance than baselines and achieve faster inference speed compared with previous methods with pre-trained language models. We also release a new large-scale Chinese knowledge graph dataset AliopenKG500 for research purpose. Code and datasets are available in https://github.com/zjunlp/PromptKG/tree/main/GenKGC.  ( 2 min )
    Beyond CAGE: Investigating Generalization of Learned Autonomous Network Defense Policies. (arXiv:2211.15557v2 [cs.LG] UPDATED)
    Advancements in reinforcement learning (RL) have inspired new directions in intelligent automation of network defense. However, many of these advancements have either outpaced their application to network security or have not considered the challenges associated with implementing them in the real-world. To understand these problems, this work evaluates several RL approaches implemented in the second edition of the CAGE Challenge, a public competition to build an autonomous network defender agent in a high-fidelity network simulator. Our approaches all build on the Proximal Policy Optimization (PPO) family of algorithms, and include hierarchical RL, action masking, custom training, and ensemble RL. We find that the ensemble RL technique performs strongest, outperforming our other models and taking second place in the competition. To understand applicability to real environments we evaluate each method's ability to generalize to unseen networks and against an unknown attack strategy. In unseen environments, all of our approaches perform worse, with degradation varied based on the type of environmental change. Against an unknown attacker strategy, we found that our models had reduced overall performance even though the new strategy was less efficient than the ones our models trained on. Together, these results highlight promising research directions for autonomous network defense in the real world.  ( 2 min )
    Eliminating The Impossible, Whatever Remains Must Be True. (arXiv:2206.09551v2 [cs.AI] UPDATED)
    The rise of AI methods to make predictions and decisions has led to a pressing need for more explainable artificial intelligence (XAI) methods. One common approach for XAI is to produce a post-hoc explanation, explaining why a black box ML model made a certain prediction. Formal approaches to post-hoc explanations provide succinct reasons for why a prediction was made, as well as why not another prediction was made. But these approaches assume that features are independent and uniformly distributed. While this means that "why" explanations are correct, they may be longer than required. It also means the "why not" explanations may be suspect as the counterexamples they rely on may not be meaningful. In this paper, we show how one can apply background knowledge to give more succinct "why" formal explanations, that are presumably easier to interpret by humans, and give more accurate "why not" explanations. In addition, we show how to use existing rule induction techniques to efficiently extract background information from a dataset, and also how to report which background information was used to make an explanation, allowing a human to examine it if they doubt the correctness of the explanation.  ( 2 min )
    Stop Measuring Calibration When Humans Disagree. (arXiv:2210.16133v2 [cs.CL] UPDATED)
    Calibration is a popular framework to evaluate whether a classifier knows when it does not know - i.e., its predictive probabilities are a good indication of how likely a prediction is to be correct. Correctness is commonly estimated against the human majority class. Recently, calibration to human majority has been measured on tasks where humans inherently disagree about which class applies. We show that measuring calibration to human majority given inherent disagreements is theoretically problematic, demonstrate this empirically on the ChaosNLI dataset, and derive several instance-level measures of calibration that capture key statistical properties of human judgements - class frequency, ranking and entropy.  ( 2 min )
    Good Visual Guidance Makes A Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction. (arXiv:2205.03521v1 [cs.CL] CROSS LISTED)
    Multimodal named entity recognition and relation extraction (MNER and MRE) is a fundamental and crucial branch in information extraction. However, existing approaches for MNER and MRE usually suffer from error sensitivity when irrelevant object images incorporated in texts. To deal with these issues, we propose a novel Hierarchical Visual Prefix fusion NeTwork (HVPNeT) for visual-enhanced entity and relation extraction, aiming to achieve more effective and robust performance. Specifically, we regard visual representation as pluggable visual prefix to guide the textual representation for error insensitive forecasting decision. We further propose a dynamic gated aggregation strategy to achieve hierarchical multi-scaled visual features as visual prefix for fusion. Extensive experiments on three benchmark datasets demonstrate the effectiveness of our method, and achieve state-of-the-art performance. Code is available in https://github.com/zjunlp/HVPNeT.  ( 2 min )
    Efficient Use of Large Pre-Trained Models for Low Resource ASR. (arXiv:2210.15445v2 [eess.AS] UPDATED)
    Automatic speech recognition (ASR) has been established as a well-performing technique for many scenarios where lots of labeled data is available. Additionally, unsupervised representation learning recently helped to tackle tasks with limited data. Following this, hardware limitations and applications give rise to the question how to efficiently take advantage of large pretrained models and reduce their complexity for downstream tasks. In this work, we study a challenging low resource conversational telephony speech corpus from the medical domain in Vietnamese and German. We show the benefits of using unsupervised techniques beyond simple fine-tuning of large pre-trained models, discuss how to adapt them to a practical telephony task including bandwidth transfer and investigate different data conditions for pre-training and fine-tuning. We outperform the project baselines by 22% relative using pretraining techniques. Further gains of 29% can be achieved by refinements of architecture and training and 6% by adding 0.8 h of in-domain adaptation data.  ( 2 min )
    Learning to Scaffold: Optimizing Model Explanations for Teaching. (arXiv:2204.10810v2 [cs.LG] UPDATED)
    Modern machine learning models are opaque, and as a result there is a burgeoning academic subfield on methods that explain these models' behavior. However, what is the precise goal of providing such explanations, and how can we demonstrate that explanations achieve this goal? Some research argues that explanations should help teach a student (either human or machine) to simulate the model being explained, and that the quality of explanations can be measured by the simulation accuracy of students on unexplained examples. In this work, leveraging meta-learning techniques, we extend this idea to improve the quality of the explanations themselves, specifically by optimizing explanations such that student models more effectively learn to simulate the original model. We train models on three natural language processing and computer vision tasks, and find that students trained with explanations extracted with our framework are able to simulate the teacher significantly more effectively than ones produced with previous methods. Through human annotations and a user study, we further find that these learned explanations more closely align with how humans would explain the required decisions in these tasks. Our code is available at https://github.com/coderpat/learning-scaffold  ( 2 min )
    Degeneracy is OK: Logarithmic Regret for Network Revenue Management with Indiscrete Distributions. (arXiv:2210.07996v2 [cs.LG] UPDATED)
    We study the classical Network Revenue Management (NRM) problem with accept/reject decisions and $T$ IID arrivals. We consider a distributional form where each arrival must fall under a finite number of possible categories, each with a deterministic resource consumption vector, but a random value distributed continuously over an interval. We develop an online algorithm that achieves $O(\log^2 T)$ regret under this model, with no further assumptions. We develop another online algorithm that achieves an improved $O(\log T)$ regret, with only a second-order growth assumption. To our knowledge, these are the first results achieving logarithmic-level regret in a continuous-distribution NRM model without further "non-degeneracy" assumptions. Our results are achieved via new techniques including: a new method of bounding myopic regret, a "semi-fluid" relaxation of the offline allocation, and an improved bound on the "dual convergence".  ( 2 min )
    Detecting Multivariate Time Series Anomalies with Zero Known Label. (arXiv:2208.02108v2 [cs.LG] UPDATED)
    Multivariate time series anomaly detection has been extensively studied under the semi-supervised setting, where a training dataset with all normal instances is required. However, preparing such a dataset is very laborious since each single data instance should be fully guaranteed to be normal. It is, therefore, desired to explore multivariate time series anomaly detection methods based on the dataset without any label knowledge. In this paper, we propose MTGFlow, an unsupervised anomaly detection approach for multivariate time series anomaly detection via dynamic graph and entity-aware normalizing flow, leaning only on a widely accepted hypothesis that abnormal instances exhibit sparse densities than the normal. However, the complex interdependencies among entities and the diverse inherent characteristics of each entity pose significant challenges on the density estimation, let alone to detect anomalies based on the estimated possibility distribution. To tackle these problems, we propose to learn the mutual and dynamic relations among entities via a graph structure learning model, which helps to model accurate distribution of multivariate time series. Moreover, taking account of distinct characteristics of the individual entities, an entity-aware normalizing flow is developed to describe each entity into a parameterized normal distribution, thereby producing fine-grained density estimation. Incorporating these two strategies, MTGFlow achieves superior anomaly detection performance. Experiments on five public datasets with seven baselines are conducted, MTGFlow outperforms the SOTA methods by up to 5.0 AUROC\%. Codes will be released at https://github.com/zqhang/Detecting-Multivariate-Time-Series-Anomalies-with-Zero-Known-Label.
    Image Projective Transformation Rectification with Synthetic Data for Smartphone-captured Chest X-ray Photos Classification. (arXiv:2210.05954v2 [cs.CV] UPDATED)
    Classification on smartphone-captured chest X-ray (CXR) photos to detect pathologies is challenging due to the projective transformation caused by the non-ideal camera position. Recently, various rectification methods have been proposed for different photo rectification tasks such as document photos, license plate photos, etc. Unfortunately, we found that none of them is suitable for CXR photos, due to their specific transformation type, image appearance, annotation type, etc. In this paper, we propose an innovative deep learning-based Projective Transformation Rectification Network (PTRN) to automatically rectify CXR photos by predicting the projective transformation matrix. To the best of our knowledge, it is the first work to predict the projective transformation matrix as the learning goal for photo rectification. Additionally, to avoid the expensive collection of natural data, synthetic CXR photos are generated under the consideration of natural perturbations, extra screens, etc. We evaluate the proposed approach in the CheXphoto smartphone-captured CXR photos classification competition hosted by the Stanford University Machine Learning Group, our approach won first place with a huge performance improvement (ours 0.850, second-best 0.762, in AUC). A deeper study demonstrates that the use of PTRN successfully achieves the classification performance on the spatially transformed CXR photos to the same level as on the high-quality digital CXR images, indicating PTRN can eliminate all negative impacts of projective transformation on the CXR photos.
    Multimodal Analogical Reasoning over Knowledge Graphs. (arXiv:2210.00312v2 [cs.CL] CROSS LISTED)
    Analogical reasoning is fundamental to human cognition and holds an important place in various fields. However, previous studies mainly focus on single-modal analogical reasoning and ignore taking advantage of structure knowledge. Notably, the research in cognitive psychology has demonstrated that information from multimodal sources always brings more powerful cognitive transfer than single modality sources. To this end, we introduce the new task of multimodal analogical reasoning over knowledge graphs, which requires multimodal reasoning ability with the help of background knowledge. Specifically, we construct a Multimodal Analogical Reasoning dataSet (MARS) and a multimodal knowledge graph MarKG. We evaluate with multimodal knowledge graph embedding and pre-trained Transformer baselines, illustrating the potential challenges of the proposed task. We further propose a novel model-agnostic Multimodal analogical reasoning framework with Transformer (MarT) motivated by the structure mapping theory, which can obtain better performance.
    Private Stochastic Optimization With Large Worst-Case Lipschitz Parameter: Optimal Rates for (Non-Smooth) Convex Losses and Extension to Non-Convex Losses. (arXiv:2209.07403v3 [cs.LG] UPDATED)
    We study differentially private (DP) stochastic optimization (SO) with loss functions whose worst-case Lipschitz parameter over all data points may be extremely large. To date, the vast majority of work on DP SO assumes that the loss is uniformly Lipschitz continuous over data (i.e. stochastic gradients are uniformly bounded over all data points). While this assumption is convenient, it often leads to pessimistic excess risk bounds. In many practical problems, the worst-case Lipschitz parameter of the loss over all data points may be extremely large due to outliers. In such cases, the error bounds for DP SO, which scale with the worst-case Lipschitz parameter of the loss, are vacuous. To address these limitations, this work provides near-optimal excess risk bounds that do not depend on the uniform Lipschitz parameter of the loss. Building on a recent line of work [WXDX20, KLZ22], we assume that stochastic gradients have bounded $k$-th order moments for some $k \geq 2$. Compared with works on uniformly Lipschitz DP SO, our excess risk scales with the $k$-th moment bound instead of the uniform Lipschitz parameter of the loss, allowing for significantly faster rates in the presence of outliers and/or heavy-tailed data. For convex and strongly convex loss functions, we provide the first asymptotically optimal excess risk bounds (up to a logarithmic factor). In contrast to [WXDX20, KLZ22], our bounds do not require the loss function to be differentiable/smooth. We also devise an accelerated algorithm for smooth losses that runs in linear time and has excess risk that is tight in certain practical parameter regimes. Additionally, our work is the first to address non-convex non-uniformly Lipschitz loss functions satisfying the Proximal-PL inequality; this covers some practical machine learning models. Our Proximal-PL algorithm has near-optimal excess risk.
    An Experiment Design Paradigm using Joint Feature Selection and Task Optimization. (arXiv:2210.06891v2 [cs.LG] UPDATED)
    This paper presents a subsampling-task paradigm for data-driven task-specific experiment design (ED) and a novel method in populationwide supervised feature selection (FS). Optimal ED, the choice of sampling points under constraints of limited acquisition-time, arises in a wide variety of scientific and engineering contexts. However the continuous optimization used in classical approaches depend on a-priori parameter choices and challenging non-convex optimization landscapes. This paper proposes to replace this strategy with a subsampling-task paradigm, analogous to populationwide supervised FS. In particular, we introduce JOFSTO, which performs JOint Feature Selection and Task Optimization. JOFSTO jointly optimizes two coupled networks: one for feature scoring, which provides the ED, the other for execution of a downstream task or process. Unlike most FS problems, e.g. selecting protein expressions for classification, ED problems typically select from highly correlated globally informative candidates rather than seeking a small number of highly informative features among many uninformative features. JOFSTO's construction efficiently identifies potentially correlated, but effective subsets and returns a trained task network. We demonstrate the approach using parameter estimation and mapping problems in clinically-relevant applications in quantitative MRI and in hyperspectral imaging. Results from simulations and empirical data show the subsampling-task paradigm strongly outperforms classical ED, and within our paradigm, JOFSTO outperforms state-of-the-art supervised FS techniques. JOFSTO extends immediately to wider image-based ED problems and other scenarios where the design must be specified globally across large numbers of acquisitions. Code will be released.
    MinUn: Accurate ML Inference on Microcontrollers. (arXiv:2210.16556v2 [cs.LG] UPDATED)
    Running machine learning inference on tiny devices, known as TinyML, is an emerging research area. This task requires generating inference code that uses memory frugally, a task that standard ML frameworks are ill-suited for. A deployment framework for TinyML must be a) parametric in the number representation to take advantage of the emerging representations like posits, b) carefully assign high-precision to a few tensors so that most tensors can be kept in low-precision while still maintaining model accuracy, and c) avoid memory fragmentation. We describe MinUn, the first TinyML framework that holistically addresses these issues to generate efficient code for ARM microcontrollers (e.g., Arduino Uno, Due and STM32H747) that outperforms the prior TinyML frameworks.
    On Distillation of Guided Diffusion Models. (arXiv:2210.03142v2 [cs.CV] UPDATED)
    Classifier-free guided diffusion models have recently been shown to be highly effective at high-resolution image generation, and they have been widely used in large-scale diffusion frameworks including DALLE-2, Stable Diffusion and Imagen. However, a downside of classifier-free guided diffusion models is that they are computationally expensive at inference time since they require evaluating two diffusion models, a class-conditional model and an unconditional model, tens to hundreds of times. To deal with this limitation, we propose an approach to distilling classifier-free guided diffusion models into models that are fast to sample from: Given a pre-trained classifier-free guided model, we first learn a single model to match the output of the combined conditional and unconditional models, and then we progressively distill that model to a diffusion model that requires much fewer sampling steps. For standard diffusion models trained on the pixel-space, our approach is able to generate images visually comparable to that of the original model using as few as 4 sampling steps on ImageNet 64x64 and CIFAR-10, achieving FID/IS scores comparable to that of the original model while being up to 256 times faster to sample from. For diffusion models trained on the latent-space (e.g., Stable Diffusion), our approach is able to generate high-fidelity images using as few as 1 to 4 denoising steps, accelerating inference by at least 10-fold compared to existing methods on ImageNet 256x256 and LAION datasets. We further demonstrate the effectiveness of our approach on text-guided image editing and inpainting, where our distilled model is able to generate high-quality results using as few as 2-4 denoising steps.
    Taxonomy of Benchmarks in Graph Representation Learning. (arXiv:2206.07729v4 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) extend the success of neural networks to graph-structured data by accounting for their intrinsic geometry. While extensive research has been done on developing GNN models with superior performance according to a collection of graph representation learning benchmarks, it is currently not well understood what aspects of a given model are probed by them. For example, to what extent do they test the ability of a model to leverage graph structure vs. node features? Here, we develop a principled approach to taxonomize benchmarking datasets according to a $\textit{sensitivity profile}$ that is based on how much GNN performance changes due to a collection of graph perturbations. Our data-driven analysis provides a deeper understanding of which benchmarking data characteristics are leveraged by GNNs. Consequently, our taxonomy can aid in selection and development of adequate graph benchmarks, and better informed evaluation of future GNN methods. Finally, our approach and implementation in $\texttt{GTaxoGym}$ package are extendable to multiple graph prediction task types and future datasets.
    Efficient Quantized Sparse Matrix Operations on Tensor Cores. (arXiv:2209.06979v2 [cs.DC] UPDATED)
    The exponentially growing model size drives the continued success of deep learning, but it brings prohibitive computation and memory cost. From the algorithm perspective, model sparsification and quantization have been studied to alleviate the problem. From the architecture perspective, hardware vendors provide Tensor cores for acceleration. However, it is very challenging to gain practical speedups from sparse, low-precision matrix operations on Tensor cores, because of the strict requirements for data layout and lack of support for efficiently manipulating the low-precision integers. We propose Magicube, a high-performance sparse-matrix library for low-precision integers on Tensor cores. Magicube supports SpMM and SDDMM, two major sparse operations in deep learning with mixed precision. Experimental results on an NVIDIA A100 GPU show that Magicube achieves on average 1.44x (up to 2.37x) speedup over the vendor-optimized library for sparse kernels, and 1.43x speedup over the state-of-the-art with a comparable accuracy for end-to-end sparse Transformer inference.
    A Systematic Evaluation of Node Embedding Robustness. (arXiv:2209.08064v3 [cs.LG] UPDATED)
    Node embedding methods map network nodes to low dimensional vectors that can be subsequently used in a variety of downstream prediction tasks. The popularity of these methods has grown significantly in recent years, yet, their robustness to perturbations of the input data is still poorly understood. In this paper, we assess the empirical robustness of node embedding models to random and adversarial poisoning attacks. Our systematic evaluation covers representative embedding methods based on Skip-Gram, matrix factorization, and deep neural networks. We compare edge addition, deletion and rewiring attacks computed using network properties as well as node labels. We also investigate the performance of popular node classification attack baselines that assume full knowledge of the node labels. We report qualitative results via embedding visualization and quantitative results in terms of downstream node classification and network reconstruction performances. We find that node classification results are impacted more than network reconstruction ones, that degree-based and label-based attacks are on average the most damaging and that label heterophily can strongly influence attack performance.
    Relphormer: Relational Graph Transformer for Knowledge Graph Representations. (arXiv:2205.10852v4 [cs.CL] CROSS LISTED)
    Transformers have achieved remarkable performance in widespread fields, including natural language processing, computer vision and graph mining. However, vanilla Transformer architectures have not yielded promising improvements in the Knowledge Graph (KG) representations, where the translational distance paradigm dominates this area. Note that vanilla Transformer architectures struggle to capture the intrinsically heterogeneous semantic and structural information of knowledge graphs. To this end, we propose a new variant of Transformer for knowledge graph representations dubbed Relphormer. Specifically, we introduce Triple2Seq which can dynamically sample contextualized sub-graph sequences as the input to alleviate the heterogeneity issue. We propose a novel structure-enhanced self-attention mechanism to encode the relational information and keep the globally semantic information among sub-graphs. Moreover, we propose masked knowledge modeling as a new paradigm for knowledge graph representation learning. We apply Relphormer to three tasks, namely, knowledge graph completion, KG-based question answering and KG-based recommendation for evaluation. Experimental results show that Relphormer can obtain better performance on benchmark datasets compared with baselines. Code is available in https://github.com/zjunlp/Relphormer.
    Transferring Fairness under Distribution Shifts via Fair Consistency Regularization. (arXiv:2206.12796v2 [cs.LG] UPDATED)
    The increasing reliance on ML models in high-stakes tasks has raised a major concern on fairness violations. Although there has been a surge of work that improves algorithmic fairness, most of them are under the assumption of an identical training and test distribution. In many real-world applications, however, such an assumption is often violated as previously trained fair models are often deployed in a different environment, and the fairness of such models has been observed to collapse. In this paper, we study how to transfer model fairness under distribution shifts, a widespread issue in practice. We conduct a fine-grained analysis of how the fair model is affected under different types of distribution shifts and find that domain shifts are more challenging than subpopulation shifts. Inspired by the success of self-training in transferring accuracy under domain shifts, we derive a sufficient condition for transferring group fairness. Guided by it, we propose a practical algorithm with a fair consistency regularization as the key component. A synthetic dataset benchmark, which covers all types of distribution shifts, is deployed for experimental verification of the theoretical findings. Experiments on synthetic and real datasets including image and tabular data demonstrate that our approach effectively transfers fairness and accuracy under various distribution shifts.
    Accurate Fairness: Improving Individual Fairness without Trading Accuracy. (arXiv:2205.08704v2 [cs.LG] UPDATED)
    Accuracy and individual fairness are both crucial for trustworthy machine learning, but these two aspects are often incompatible with each other so that enhancing one aspect may sacrifice the other inevitably with side effects of true bias or false fairness. We propose in this paper a new fairness criterion, accurate fairness, to align individual fairness with accuracy. Informally, it requires the treatments of an individual and the individual's similar counterparts to conform to a uniform target, i.e., the ground truth of the individual. We prove that accurate fairness also implies typical group fairness criteria over a union of similar sub-populations. We then present a Siamese fairness in-processing approach to minimize the accuracy and fairness losses of a machine learning model under the accurate fairness constraints. To the best of our knowledge, this is the first time that a Siamese approach is adapted for bias mitigation. We also propose fairness confusion matrix-based metrics, fair-precision, fair-recall, and fair-F1 score, to quantify a trade-off between accuracy and individual fairness. Comparative case studies with popular fairness datasets show that our Siamese fairness approach can achieve on average 1.02%-8.78% higher individual fairness (in terms of fairness through awareness) and 8.38%-13.69% higher accuracy, as well as 10.09%-20.57% higher true fair rate, and 5.43%-10.01% higher fair-F1 score, than the state-of-the-art bias mitigation techniques. This demonstrates that our Siamese fairness approach can indeed improve individual fairness without trading accuracy. Finally, the accurate fairness criterion and Siamese fairness approach are applied to mitigate the possible service discrimination with a real Ctrip dataset, by on average fairly serving 112.33% more customers (specifically, 81.29% more customers in an accurately fair way) than baseline models.
    Metric Residual Networks for Sample Efficient Goal-Conditioned Reinforcement Learning. (arXiv:2208.08133v3 [cs.LG] UPDATED)
    Goal-conditioned reinforcement learning (GCRL) has a wide range of potential real-world applications, including manipulation and navigation problems in robotics. Especially in such robotics tasks, sample efficiency is of the utmost importance for GCRL since, by default, the agent is only rewarded when it reaches its goal. While several methods have been proposed to improve the sample efficiency of GCRL, one relatively under-studied approach is the design of neural architectures to support sample efficiency. In this work, we introduce a novel neural architecture for GCRL that achieves significantly better sample efficiency than the commonly-used monolithic network architecture. The key insight is that the optimal action-value function Q^*(s, a, g) must satisfy the triangle inequality in a specific sense. Furthermore, we introduce the metric residual network (MRN) that deliberately decomposes the action-value function Q(s,a,g) into the negated summation of a metric plus a residual asymmetric component. MRN provably approximates any optimal action-value function Q^*(s,a,g), thus making it a fitting neural architecture for GCRL. We conduct comprehensive experiments across 12 standard benchmark environments in GCRL. The empirical results demonstrate that MRN uniformly outperforms other state-of-the-art GCRL neural architectures in terms of sample efficiency.
    Explain My Surprise: Learning Efficient Long-Term Memory by Predicting Uncertain Outcomes. (arXiv:2207.13649v2 [cs.LG] UPDATED)
    In many sequential tasks, a model needs to remember relevant events from the distant past to make correct predictions. Unfortunately, a straightforward application of gradient based training requires intermediate computations to be stored for every element of a sequence. This requires to store prohibitively large intermediate data if a sequence consists of thousands or even millions elements, and as a result, makes learning of very long-term dependencies infeasible. However, the majority of sequence elements can usually be predicted by taking into account only temporally local information. On the other hand, predictions affected by long-term dependencies are sparse and characterized by high uncertainty given only local information. We propose MemUP, a new training method that allows to learn long-term dependencies without backpropagating gradients through the whole sequence at a time. This method can potentially be applied to any recurrent architecture. LSTM network trained with MemUP performs better or comparable to baselines while requiring to store less intermediate data.
    Neural modal ordinary differential equations: Integrating physics-based modeling with neural ordinary differential equations for modeling high-dimensional monitored structures. (arXiv:2207.07883v2 [cs.LG] UPDATED)
    The order/dimension of models derived on the basis of data is commonly restricted by the number of observations, or in the context of monitored systems, sensing nodes. This is particularly true for structural systems (e.g., civil or mechanical structures), which are typically high-dimensional in nature. In the scope of physics-informed machine learning, this paper proposes a framework -- termed Neural Modal ODEs -- to integrate physics-based modeling with deep learning for modeling the dynamics of monitored and high-dimensional engineered systems. Neural Ordinary Differential Equations -- Neural ODEs are exploited as the deep learning operator. In this initiating exploration, we restrict ourselves to linear or mildly nonlinear systems. We propose an architecture that couples a dynamic version of variational autoencoders with physics-informed Neural ODEs (Pi-Neural ODEs). An encoder, as a part of the autoencoder, learns the abstract mappings from the first few items of observational data to the initial values of the latent variables, which drive the learning of embedded dynamics via physics-informed Neural ODEs, imposing a modal model structure on that latent space. The decoder of the proposed model adopts the eigenmodes derived from an eigen-analysis applied to the linearized portion of a physics-based model: a process implicitly carrying the spatial relationship between degrees-of-freedom (DOFs). The framework is validated on a numerical example, and an experimental dataset of a scaled cable-stayed bridge, where the learned hybrid model is shown to outperform a purely physics-based approach to modeling. We further show the functionality of the proposed scheme within the context of virtual sensing, i.e., the recovery of generalized response quantities in unmeasured DOFs from spatially sparse data.
    Zero-Shot Assistance in Sequential Decision Problems. (arXiv:2202.07364v3 [cs.LG] UPDATED)
    We consider the problem of creating assistants that can help agents solve new sequential decision problems, assuming the agent is not able to specify the reward function explicitly to the assistant. Instead of acting in place of the agent as in current automation-based approaches, we give the assistant an advisory role and keep the agent in the loop as the main decision maker. The difficulty is that we must account for potential biases of the agent which may cause it to seemingly irrationally reject advice. To do this we introduce a novel formalization of assistance that models these biases, allowing the assistant to infer and adapt to them. We then introduce a new method for planning the assistant's actions which can scale to large decision making problems. We show experimentally that our approach adapts to these agent biases, and results in higher cumulative reward for the agent than automation-based alternatives. Lastly, we show that an approach combining advice and automation outperforms advice alone at the cost of losing some safety guarantees.
    Scalable and Efficient Training of Large Convolutional Neural Networks with Differential Privacy. (arXiv:2205.10683v4 [cs.LG] UPDATED)
    Large convolutional neural networks (CNN) can be difficult to train in the differentially private (DP) regime, since the optimization algorithms require a computationally expensive operation, known as the per-sample gradient clipping. We propose an efficient and scalable implementation of this clipping on convolutional layers, termed as the mixed ghost clipping, that significantly eases the private training in terms of both time and space complexities, without affecting the accuracy. The improvement in efficiency is rigorously studied through the first complexity analysis for the mixed ghost clipping and existing DP training algorithms. Extensive experiments on vision classification tasks, with large ResNet, VGG, and Vision Transformers, demonstrate that DP training with mixed ghost clipping adds $1\sim 10\%$ memory overhead and $<2\times$ slowdown to the standard non-private training. Specifically, when training VGG19 on CIFAR10, the mixed ghost clipping is $3\times$ faster than state-of-the-art Opacus library with $18\times$ larger maximum batch size. To emphasize the significance of efficient DP training on convolutional layers, we achieve 96.7\% accuracy on CIFAR10 and 83.0\% on CIFAR100 at $\epsilon=1$ using BEiT, while the previous best results are 94.8\% and 67.4\%, respectively. We open-source a privacy engine (\url{https://github.com/woodyx218/private_vision}) that implements DP training of CNN with a few lines of code.
    Quantum Kerr Learning. (arXiv:2205.12004v2 [quant-ph] UPDATED)
    Quantum machine learning is a rapidly evolving field of research that could facilitate important applications for quantum computing and also significantly impact data-driven sciences. In our work, based on various arguments from complexity theory and physics, we demonstrate that a single Kerr mode can provide some "quantum enhancements" when dealing with kernel-based methods. Using kernel properties, neural tangent kernel theory, first-order perturbation theory of the Kerr non-linearity, and non-perturbative numerical simulations, we show that quantum enhancements could happen in terms of convergence time and generalization error. Furthermore, we make explicit indications on how higher-dimensional input data could be considered. Finally, we propose an experimental protocol, that we call \emph{quantum Kerr learning}, based on circuit QED.
    Simplifying Node Classification on Heterophilous Graphs with Compatible Label Propagation. (arXiv:2205.09389v3 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have been predominant for graph learning tasks; however, recent studies showed that a well-known graph algorithm, Label Propagation (LP), combined with a shallow neural network can achieve comparable performance to GNNs in semi-supervised node classification on graphs with high homophily. In this paper, we show that this approach falls short on graphs with low homophily, where nodes often connect to the nodes of the opposite classes. To overcome this, we carefully design a combination of a base predictor with LP algorithm that enjoys a closed-form solution as well as convergence guarantees. Our algorithm first learns the class compatibility matrix and then aggregates label predictions using LP algorithm weighted by class compatibilities. On a wide variety of benchmarks, we show that our approach achieves the leading performance on graphs with various levels of homophily. Meanwhile, it has orders of magnitude fewer parameters and requires less execution time. Empirical evaluations demonstrate that simple adaptations of LP can be competitive in semi-supervised node classification in both homophily and heterophily regimes.
    Micro Batch Streaming: Allowing the Training of DNN Models to Use a large Batch Size in Memory Constrained Environments. (arXiv:2110.12484v2 [cs.LG] UPDATED)
    Recent deep learning models are difficult to train using a large batch size, because commodity machines may not have enough memory to accommodate both the model and a large data batch size. The batch size is one of the hyper-parameters used in the training model, and it is dependent on and is limited by the target machine memory capacity because the batch size can only fit into the remaining memory after the model is uploaded. Moreover, the data item size is also an important factor because if each data item size is larger then the batch size that can fit into the remaining memory becomes smaller. This paper proposes a framework called Micro-Batch Streaming (MBS) to address this problem. This method helps deep learning models to train by providing a batch streaming method that splits a batch into a size that can fit in the remaining memory and streams them sequentially. A loss normalization algorithm based on the gradient accumulation is used to maintain the performance. The purpose of our method is to allow deep learning models to train using larger batch sizes that exceed the memory capacity of a system without increasing the memory size or using multiple devices (GPUs).
    PiCO+: Contrastive Label Disambiguation for Robust Partial Label Learning. (arXiv:2201.08984v3 [cs.LG] UPDATED)
    Partial label learning (PLL) is an important problem that allows each training example to be labeled with a coarse candidate set, which well suits many real-world data annotation scenarios with label ambiguity. Despite the promise, the performance of PLL often lags behind the supervised counterpart. In this work, we bridge the gap by addressing two key research challenges in PLL -- representation learning and label disambiguation -- in one coherent framework. Specifically, our proposed framework PiCO consists of a contrastive learning module along with a novel class prototype-based label disambiguation algorithm. PiCO produces closely aligned representations for examples from the same classes and facilitates label disambiguation. Theoretically, we show that these two components are mutually beneficial, and can be rigorously justified from an expectation-maximization (EM) algorithm perspective. Moreover, we study a challenging yet practical noisy partial label learning setup, where the ground-truth may not be included in the candidate set. To remedy this problem, we present an extension PiCO+ that performs distance-based clean sample selection and learns robust classifiers by a semi-supervised contrastive learning algorithm. Extensive experiments demonstrate that our proposed methods significantly outperform the current state-of-the-art approaches in standard and noisy PLL tasks and even achieve comparable results to fully supervised learning.
    DropMessage: Unifying Random Dropping for Graph Neural Networks. (arXiv:2204.10037v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) are powerful tools for graph representation learning. Despite their rapid development, GNNs also face some challenges, such as over-fitting, over-smoothing, and non-robustness. Previous works indicate that these problems can be alleviated by random dropping methods, which integrate augmented data into models by randomly masking parts of the input. However, some open problems of random dropping on GNNs remain to be solved. First, it is challenging to find a universal method that are suitable for all cases considering the divergence of different datasets and models. Second, augmented data introduced to GNNs causes the incomplete coverage of parameters and unstable training process. Third, there is no theoretical analysis on the effectiveness of random dropping methods on GNNs. In this paper, we propose a novel random dropping method called DropMessage, which performs dropping operations directly on the propagated messages during the message-passing process. More importantly, we find that DropMessage provides a unified framework for most existing random dropping methods, based on which we give theoretical analysis of their effectiveness. Furthermore, we elaborate the superiority of DropMessage: it stabilizes the training process by reducing sample variance; it keeps information diversity from the perspective of information theory, enabling it become a theoretical upper bound of other methods. To evaluate our proposed method, we conduct experiments that aims for multiple tasks on five public datasets and two industrial datasets with various backbone models. The experimental results show that DropMessage has the advantages of both effectiveness and generalization, and can significantly alleviate the problems mentioned above.
    Probabilistic Symmetry for Multi-Agent Dynamics. (arXiv:2205.01927v2 [cs.LG] UPDATED)
    Learning multi-agent dynamics is a core AI problem with broad applications in robotics and autonomous driving. While most existing works focus on deterministic prediction, producing probabilistic forecasts to quantify uncertainty and assess risks is critical for downstream decision-making tasks such as motion planning and collision avoidance. Multi-agent dynamics often contains internal symmetry. By leveraging symmetry, specifically rotation equivariance, we can improve not only the prediction accuracy but also uncertainty calibration. We introduce Energy Score, a proper scoring rule, to evaluate probabilistic predictions. We propose a novel deep dynamics model, Probabilistic Equivariant Continuous COnvolution (PECCO) for probabilistic prediction of multi-agent trajectories. PECCO extends equivariant continuous convolution to model the joint velocity distribution of multiple agents. It uses dynamics integration to propagate the uncertainty from velocity to position. On both synthetic and real-world datasets, PECCO shows significant improvements in accuracy and calibration compared to non-equivariant baselines.
    Causal de Finetti: On the Identification of Invariant Causal Structure in Exchangeable Data. (arXiv:2203.15756v2 [stat.ML] UPDATED)
    Learning causal structure from observational data often assumes that we observe independent and identically distributed (i.\,i.\,d) data. The traditional approach aims to find a graphical representation that encodes the same set of conditional independence relationships as those present in the observed distribution. It is known that under i.\,i.\,d assumption, even with infinite data, there is a limit to how fine-grained a causal structure we can identify. To overcome this limitation, recent work has explored using data originating from different, related environments to learn richer causal structure. These approaches implicitly rely on the independent causal mechanisms (ICM) principle, which postulates that the mechanism giving rise to an effect given its causes and the mechanism which generates the causes do not inform or influence each other. Thus, components of the causal model can independently change from environment to environment. Despite its wide application in machine learning and causal inference, there is a lack of statistical formalization of the ICM principle and how it enables identification of richer causal structures from grouped data. Here we present new causal de Finetti theorems which offer a first statistical formalization of ICM principle and show how causal structure identification is possible from exchangeable data. Our work provides theoretical justification for a broad range of techniques leveraging multi-environment data to learn causal structure.
    Estimation under Model Misspecification with Fake Features. (arXiv:2203.03398v2 [eess.SP] UPDATED)
    We consider estimation under model misspecification where there is a model mismatch between the underlying system, which generates the data, and the model used during estimation. We propose a model misspecification framework which enables a joint treatment of the model misspecification types of having fake features as well as incorrect covariance assumptions on the unknowns and the noise. We present a decomposition of the output error into components that relate to different subsets of the model parameters corresponding to underlying, fake and missing features. Here, fake features are features which are included in the model but are not present in the underlying system. Under this framework, we characterize the estimation performance and reveal trade-offs between the number of samples, number of fake features, and the possibly incorrect noise level assumption. In contrast to existing work focusing on incorrect covariance assumptions or missing features, fake features is a central component of our framework. Our results show that fake features can significantly improve the estimation performance, even though they are not correlated with the features in the underlying system. In particular, we show that the estimation error can be decreased by including more fake features in the model, even to the point where the model is overparametrized, i.e., the model contains more unknowns than observations.
    CowClip: Reducing CTR Prediction Model Training Time from 12 hours to 10 minutes on 1 GPU. (arXiv:2204.06240v3 [cs.LG] UPDATED)
    The click-through rate (CTR) prediction task is to predict whether a user will click on the recommended item. As mind-boggling amounts of data are produced online daily, accelerating CTR prediction model training is critical to ensuring an up-to-date model and reducing the training cost. One approach to increase the training speed is to apply large batch training. However, as shown in computer vision and natural language processing tasks, training with a large batch easily suffers from the loss of accuracy. Our experiments show that previous scaling rules fail in the training of CTR prediction neural networks. To tackle this problem, we first theoretically show that different frequencies of ids make it challenging to scale hyperparameters when scaling the batch size. To stabilize the training process in a large batch size setting, we develop the adaptive Column-wise Clipping (CowClip). It enables an easy and effective scaling rule for the embeddings, which keeps the learning rate unchanged and scales the L2 loss. We conduct extensive experiments with four CTR prediction networks on two real-world datasets and successfully scaled 128 times the original batch size without accuracy loss. In particular, for CTR prediction model DeepFM training on the Criteo dataset, our optimization framework enlarges the batch size from 1K to 128K with over 0.1% AUC improvement and reduces training time from 12 hours to 10 minutes on a single V100 GPU. Our code locates at https://github.com/bytedance/LargeBatchCTR.
    Transfer Learning for Quantum Classifiers: An Information-Theoretic Generalization Analysis. (arXiv:2201.06297v3 [quant-ph] UPDATED)
    A key component of a quantum machine learning model operating on classical inputs is the design of an embedding circuit mapping inputs to a quantum state. This paper studies a transfer learning setting in which classical-to-quantum embedding is carried out by an arbitrary parametric quantum circuit that is pre-trained based on data from a source task. At run time, a binary quantum classifier of the embedding is optimized based on data from the target task of interest. The average excess risk, i.e., the optimality gap, of the resulting classifier depends on how (dis)similar the source and target tasks are. We introduce a new measure of (dis)similarity between the binary quantum classification tasks via the trace distances. An upper bound on the optimality gap is derived in terms of the proposed task (dis)similarity measure, two R$\'e$nyi mutual information terms between classical input and quantum embedding under source and target tasks, as well as a measure of complexity of the combined space of quantum embeddings and classifiers under the source task. The theoretical results are validated on a simple binary classification example.
    Parameters or Privacy: A Provable Tradeoff Between Overparameterization and Membership Inference. (arXiv:2202.01243v2 [stat.ML] UPDATED)
    A surprising phenomenon in modern machine learning is the ability of a highly overparameterized model to generalize well (small error on the test data) even when it is trained to memorize the training data (zero error on the training data). This has led to an arms race towards increasingly overparameterized models (c.f., deep learning). In this paper, we study an underexplored hidden cost of overparameterization: the fact that overparameterized models may be more vulnerable to privacy attacks, in particular the membership inference attack that predicts the (potentially sensitive) examples used to train a model. We significantly extend the relatively few empirical results on this problem by theoretically proving for an overparameterized linear regression model in the Gaussian data setting that membership inference vulnerability increases with the number of parameters. Moreover, a range of empirical studies indicates that more complex, nonlinear models exhibit the same behavior. Finally, we extend our analysis towards ridge-regularized linear regression and show in the Gaussian data setting that increased regularization also increases membership inference vulnerability in the overparameterized regime.
    Are Commercial Face Detection Models as Biased as Academic Models?. (arXiv:2201.10047v2 [cs.CV] UPDATED)
    As facial recognition systems are deployed more widely, scholars and activists have studied their biases and harms. Audits are commonly used to accomplish this and compare the algorithmic facial recognition systems' performance against datasets with various metadata labels about the subjects of the images. Seminal works have found discrepancies in performance by gender expression, age, perceived race, skin type, etc. These studies and audits often examine algorithms which fall into two categories: academic models or commercial models. We present a detailed comparison between academic and commercial face detection systems, specifically examining robustness to noise. We find that state-of-the-art academic face detection models exhibit demographic disparities in their noise robustness, specifically by having statistically significant decreased performance on older individuals and those who present their gender in a masculine manner. When we compare the size of these disparities to that of commercial models, we conclude that commercial models - in contrast to their relatively larger development budget and industry-level fairness commitments - are always as biased or more biased than an academic model.
    Fine-grained TLS services classification with reject option. (arXiv:2202.11984v2 [cs.LG] UPDATED)
    The recent success and proliferation of machine learning and deep learning have provided powerful tools, which are also utilized for encrypted traffic analysis, classification, and threat detection in computer networks. These methods, neural networks in particular, are often complex and require a huge corpus of training data. Therefore, this paper focuses on collecting a large up-to-date dataset with almost 200 fine-grained service labels and 140 million network flows extended with packet-level metadata. The number of flows is three orders of magnitude higher than in other existing public labeled datasets of encrypted traffic. The number of service labels, which is important to make the problem hard and realistic, is four times higher than in the public dataset with the most class labels. The published dataset is intended as a benchmark for identifying services in encrypted traffic. Service identification can be further extended with the task of "rejecting" unknown services, i.e., the traffic not seen during the training phase. Neural networks offer superior performance for tackling this more challenging problem. To showcase the dataset's usefulness, we implemented a neural network with a multi-modal architecture, which is the state-of-the-art approach, and achieved 97.04% classification accuracy and detected 91.94% of unknown services with 5% false positive rate.
    Optimistic search: Change point estimation for large-scale data via adaptive logarithmic queries. (arXiv:2010.10194v3 [stat.ME] UPDATED)
    Change point estimation is often formulated as a search for the maximum of a gain function describing improved fits when segmenting the data. Searching through all candidates requires $O(n)$ evaluations of the gain function for an interval with $n$ observations. If each evaluation is computationally demanding (e.g. in high-dimensional models), this can become infeasible. Instead, we propose optimistic search methods with $O(\log n)$ evaluations exploiting specific structure of the gain function. Towards solid understanding of our strategy, we investigate in detail the $p$-dimensional Gaussian changing means setup, including high-dimensional scenarios. For some of our proposals, we prove asymptotic minimax optimality for detecting change points and derive their asymptotic localization rate. These rates (up to a possible log factor) are optimal for the univariate and multivariate scenarios, and are by far the fastest in the literature under the weakest possible detection condition on the signal-to-noise ratio in the high-dimensional scenario. Computationally, our proposed methodology has the worst case complexity of $O(np)$, which can be improved to be sublinear in $n$ if some a-priori knowledge on the length of the shortest segment is available. Our search strategies generalize far beyond the theoretically analyzed setup. We illustrate, as an example, massive computational speedup in change point detection for high-dimensional Gaussian graphical models.
    Dataset correlation inference attacks against machine learning models. (arXiv:2112.08806v2 [cs.LG] UPDATED)
    Machine learning models are often trained on sensitive and proprietary datasets. Yet what -- and under which conditions -- a model leaks about its dataset, is not well understood. Most previous works study the leakage of information about an individual record. Yet in many situations, global dataset information such as its underlying distribution, e.g. $k$-way marginals or correlations are similarly sensitive or secret. We here explore for the first time whether a model leaks information about the correlations between the input variables of its training dataset, something we name correlation inference attack. We first propose a model-less attack, showing how an attacker can exploit the spherical parametrization of correlation matrices to make an informed guess based on the correlations between the input variables and the target variable alone. Second, we propose a model-based attack, showing how an attacker can exploit black-box access to the model to infer the correlations using shadow models trained on synthetic datasets. Our synthetic data generation approach combines Gaussian copula-based generative modeling with a carefully adapted procedure for sampling correlation matrices under constraints. Third, we evaluate our model-based attack against Logistic Regression and Multilayer Perceptron models and show it to strongly outperform the model-less attack on three real-world tabular datasets, indicating that the models leak information about the correlations. We also propose a novel correlation inference-based attribute inference attack (CI-AIA), and show it to obtain state-of-the-art performance. Taken together, our results show how attackers can use the model to extract information about the dataset distribution, and use it to improve their prior on sensitive attributes of individual records.
    GARNET: Reduced-Rank Topology Learning for Robust and Scalable Graph Neural Networks. (arXiv:2201.12741v3 [cs.LG] UPDATED)
    Graph neural networks (GNNs) have been increasingly deployed in various applications that involve learning on non-Euclidean data. However, recent studies show that GNNs are vulnerable to graph adversarial attacks. Although there are several defense methods to improve GNN robustness by eliminating adversarial components, they may also impair the underlying clean graph structure that contributes to GNN training. In addition, few of those defense models can scale to large graphs due to their high computational complexity and memory usage. In this paper, we propose GARNET, a scalable spectral method to boost the adversarial robustness of GNN models. GARNET first leverages weighted spectral embedding to construct a base graph, which is not only resistant to adversarial attacks but also contains critical (clean) graph structure for GNN training. Next, GARNET further refines the base graph by pruning additional uncritical edges based on probabilistic graphical model. GARNET has been evaluated on various datasets, including a large graph with millions of nodes. Our extensive experiment results show that GARNET achieves adversarial accuracy improvement and runtime speedup over state-of-the-art GNN (defense) models by up to 13.27% and 14.7x, respectively.
    AutoInit: Analytic Signal-Preserving Weight Initialization for Neural Networks. (arXiv:2109.08958v2 [cs.LG] UPDATED)
    Neural networks require careful weight initialization to prevent signals from exploding or vanishing. Existing initialization schemes solve this problem in specific cases by assuming that the network has a certain activation function or topology. It is difficult to derive such weight initialization strategies, and modern architectures therefore often use these same initialization schemes even though their assumptions do not hold. This paper introduces AutoInit, a weight initialization algorithm that automatically adapts to different neural network architectures. By analytically tracking the mean and variance of signals as they propagate through the network, AutoInit appropriately scales the weights at each layer to avoid exploding or vanishing signals. Experiments demonstrate that AutoInit improves performance of convolutional, residual, and transformer networks across a range of activation function, dropout, weight decay, learning rate, and normalizer settings, and does so more reliably than data-dependent initialization methods. This flexibility allows AutoInit to initialize models for everything from small tabular tasks to large datasets such as ImageNet. Such generality turns out particularly useful in neural architecture search and in activation function discovery. In these settings, AutoInit initializes each candidate appropriately, making performance evaluations more accurate. AutoInit thus serves as an automatic configuration tool that makes design of new neural network architectures more robust. The AutoInit package provides a wrapper around TensorFlow models and is available at https://github.com/cognizant-ai-labs/autoinit.
    Federated Noisy Client Learning. (arXiv:2106.13239v3 [cs.LG] UPDATED)
    Federated learning (FL) collaboratively trains a shared global model depending on multiple local clients, while keeping the training data decentralized in order to preserve data privacy. However, standard FL methods ignore the noisy client issue, which may harm the overall performance of the shared model. We first investigate critical issue caused by noisy clients in FL and quantify the negative impact of the noisy clients in terms of the representations learned by different layers. We have the following two key observations: (1) the noisy clients can severely impact the convergence and performance of the global model in FL, and (2) the noisy clients can induce greater bias in the deeper layers than the former layers of the global model. Based on the above observations, we propose Fed-NCL, a framework that conducts robust federated learning with noisy clients. Specifically, Fed-NCL first identifies the noisy clients through well estimating the data quality and model divergence. Then robust layer-wise aggregation is proposed to adaptively aggregate the local models of each client to deal with the data heterogeneity caused by the noisy clients. We further perform the label correction on the noisy clients to improve the generalization of the global model. Experimental results on various datasets demonstrate that our algorithm boosts the performances of different state-of-the-art systems with noisy clients. Our code is available on https://github.com/TKH666/Fed-NCL
    Computing Divergences between Discrete Decomposable Models. (arXiv:2112.04583v2 [cs.LG] UPDATED)
    There are many applications that benefit from computing the exact divergence between 2 discrete probability measures, including machine learning. Unfortunately, in the absence of any assumptions on the structure or independencies within these distributions, computing the divergence between them is an intractable problem in high dimensions. We show that we are able to compute a wide family of functionals and divergences, such as the alpha-beta divergence, between two decomposable models, i.e. chordal Markov networks, in time exponential to the treewidth of these models. The alpha-beta divergence is a family of divergences that include popular divergences such as the Kullback-Leibler divergence, the Hellinger distance, and the chi-squared divergence. Thus, we can accurately compute the exact values of any of this broad class of divergences to the extent to which we can accurately model the two distributions using decomposable models.
    Interpretability with full complexity by constraining feature information. (arXiv:2211.17264v1 [cs.LG])
    Interpretability is a pressing issue for machine learning. Common approaches to interpretable machine learning constrain interactions between features of the input, rendering the effects of those features on a model's output comprehensible but at the expense of model complexity. We approach interpretability from a new angle: constrain the information about the features without restricting the complexity of the model. Borrowing from information theory, we use the Distributed Information Bottleneck to find optimal compressions of each feature that maximally preserve information about the output. The learned information allocation, by feature and by feature value, provides rich opportunities for interpretation, particularly in problems with many features and complex feature interactions. The central object of analysis is not a single trained model, but rather a spectrum of models serving as approximations that leverage variable amounts of information about the inputs. Information is allocated to features by their relevance to the output, thereby solving the problem of feature selection by constructing a learned continuum of feature inclusion-to-exclusion. The optimal compression of each feature -- at every stage of approximation -- allows fine-grained inspection of the distinctions among feature values that are most impactful for prediction. We develop a framework for extracting insight from the spectrum of approximate models and demonstrate its utility on a range of tabular datasets.
    Learning Efficiently Function Approximation for Contextual MDP. (arXiv:2203.00995v2 [cs.LG] UPDATED)
    We study learning contextual MDPs using a function approximation for both the rewards and the dynamics. We consider both the case that the dynamics dependent or independent of the context. For both models we derive polynomial sample and time complexity (assuming an efficient ERM oracle). Our methodology gives a general reduction from learning contextual MDP to supervised learning.
    Safe Model-Free Reinforcement Learning using Disturbance-Observer-Based Control Barrier Functions. (arXiv:2211.17250v1 [cs.RO])
    Safe reinforcement learning (RL) with assured satisfaction of hard state constraints during training has recently received a lot of attention. Safety filters, e.g., based on control barrier functions (CBFs), provide a promising way for safe RL via modifying the unsafe actions of an RL agent on the fly. Existing safety filter-based approaches typically involve learning of uncertain dynamics and quantifying the learned model error, which leads to conservative filters before a large amount of data is collected to learn a good model, thereby preventing efficient exploration. This paper presents a method for safe and efficient model-free RL using disturbance observers (DOBs) and control barrier functions (CBFs). Unlike most existing safe RL methods that deal with hard state constraints, our method does not involve model learning, and leverages DOBs to accurately estimate the pointwise value of the uncertainty, which is then incorporated into a robust CBF condition to generate safe actions. The DOB-based CBF can be used as a safety filter with any model-free RL algorithms by minimally modifying the actions of an RL agent whenever necessary to ensure safety throughout the learning process. Simulation results on a unicycle and a 2D quadrotor demonstrate that the proposed method outperforms a state-of-the-art safe RL algorithm using CBFs and Gaussian processes-based model learning, in terms of safety violation rate, and sample and computational efficiency.
    Adaptive Zeroing-Type Neural Dynamics for Solving Quadratic Minimization and Applied to Target Tracking. (arXiv:2112.01773v2 [math.OC] UPDATED)
    The time-varying quadratic miniaturization (TVQM) problem, as a hotspot currently, urgently demands a more reliable and faster--solving model. To this end, a novel adaptive coefficient constructs framework is presented and realized to improve the performance of the solution model, leading to the adaptive zeroing-type neural dynamics (AZTND) model. Then the AZTND model is applied to solve the TVQM problem. The adaptive coefficients can adjust the step size of the model online so that the solution model converges faster. At the same time, the integration term develops to enhance the robustness of the model in a perturbed environment. Experiments demonstrate that the proposed model shows faster convergence and more reliable robustness than existing approaches. Finally, the AZTND model is applied in a target tracking scheme, proving the practicality of our proposed model.
    Nonlinear Monte Carlo Method for Imbalanced Data Learning. (arXiv:2010.14060v3 [cs.LG] UPDATED)
    For basic machine learning problems, expected error is used to evaluate model performance. Since the distribution of data is usually unknown, we can make simple hypothesis that the data are sampled independently and identically distributed (i.i.d.) and the mean value of loss function is used as the empirical risk by Law of Large Numbers (LLN). This is known as the Monte Carlo method. However, when LLN is not applicable, such as imbalanced data problems, empirical risk will cause overfitting and might decrease robustness and generalization ability. Inspired by the framework of nonlinear expectation theory, we substitute the mean value of loss function with the maximum value of subgroup mean loss. We call it nonlinear Monte Carlo method. In order to use numerical method of optimization, we linearize and smooth the functional of maximum empirical risk and get the descent direction via quadratic programming. With the proposed method, we achieve better performance than SOTA backbone models with less training steps, and more robustness for basic regression and imbalanced classification tasks.
    Variational Laplace Autoencoders. (arXiv:2211.17267v1 [cs.LG])
    Variational autoencoders employ an amortized inference model to approximate the posterior of latent variables. However, such amortized variational inference faces two challenges: (1) the limited posterior expressiveness of fully-factorized Gaussian assumption and (2) the amortization error of the inference model. We present a novel approach that addresses both challenges. First, we focus on ReLU networks with Gaussian output and illustrate their connection to probabilistic PCA. Building on this observation, we derive an iterative algorithm that finds the mode of the posterior and apply full-covariance Gaussian posterior approximation centered on the mode. Subsequently, we present a general framework named Variational Laplace Autoencoders (VLAEs) for training deep generative models. Based on the Laplace approximation of the latent variable posterior, VLAEs enhance the expressiveness of the posterior while reducing the amortization error. Empirical results on MNIST, Omniglot, Fashion-MNIST, SVHN and CIFAR10 show that the proposed approach significantly outperforms other recent amortized or iterative methods on the ReLU networks.
    Efficient Reinforcement Learning Through Trajectory Generation. (arXiv:2211.17249v1 [cs.LG])
    A key barrier to using reinforcement learning (RL) in many real-world applications is the requirement of a large number of system interactions to learn a good control policy. Off-policy and Offline RL methods have been proposed to reduce the number of interactions with the physical environment by learning control policies from historical data. However, their performances suffer from the lack of exploration and the distributional shifts in trajectories once controllers are updated. Moreover, most RL methods require that all states are directly observed, which is difficult to be attained in many settings. To overcome these challenges, we propose a trajectory generation algorithm, which adaptively generates new trajectories as if the system is being operated and explored under the updated control policies. Motivated by the fundamental lemma for linear systems, assuming sufficient excitation, we generate trajectories from linear combinations of historical trajectories. For linear feedback control, we prove that the algorithm generates trajectories with the exact distribution as if they are sampled from the real system using the updated control policy. In particular, the algorithm extends to systems where the states are not directly observed. Experiments show that the proposed method significantly reduces the number of sampled data needed for RL algorithms.
    Heterogeneous Graph Neural Network with Multi-view Representation Learning. (arXiv:2108.13650v3 [cs.LG] UPDATED)
    Graph neural networks for heterogeneous graph embedding is to project nodes into a low-dimensional space by exploring the heterogeneity and semantics of the heterogeneous graph. However, on the one hand, most of existing heterogeneous graph embedding methods either insufficiently model the local structure under specific semantic, or neglect the heterogeneity when aggregating information from it. On the other hand, representations from multiple semantics are not comprehensively integrated to obtain versatile node embeddings. To address the problem, we propose a Heterogeneous Graph Neural Network with Multi-View Representation Learning (named MV-HetGNN) for heterogeneous graph embedding by introducing the idea of multi-view representation learning. The proposed model consists of node feature transformation, view-specific ego graph encoding and auto multi-view fusion to thoroughly learn complex structural and semantic information for generating comprehensive node representations. Extensive experiments on three real-world heterogeneous graph datasets show that the proposed MV-HetGNN model consistently outperforms all the state-of-the-art GNN baselines in various downstream tasks, e.g., node classification, node clustering, and link prediction.
    Overcoming the Convex Relaxation Barrier for Neural Network Verification via Nonconvex Low-Rank Semidefinite Relaxations. (arXiv:2211.17244v1 [cs.LG])
    To rigorously certify the robustness of neural networks to adversarial perturbations, most state-of-the-art techniques rely on a triangle-shaped linear programming (LP) relaxation of the ReLU activation. While the LP relaxation is exact for a single neuron, recent results suggest that it faces an inherent "convex relaxation barrier" as additional activations are added, and as the attack budget is increased. In this paper, we propose a nonconvex relaxation for the ReLU relaxation, based on a low-rank restriction of a semidefinite programming (SDP) relaxation. We show that the nonconvex relaxation has a similar complexity to the LP relaxation, but enjoys improved tightness that is comparable to the much more expensive SDP relaxation. Despite nonconvexity, we prove that the verification problem satisfies constraint qualification, and therefore a Riemannian staircase approach is guaranteed to compute a near-globally optimal solution in polynomial time. Our experiments provide evidence that our nonconvex relaxation almost completely overcome the "convex relaxation barrier" faced by the LP relaxation.
    Global Convergence of Localized Policy Iteration in Networked Multi-Agent Reinforcement Learning. (arXiv:2211.17116v1 [cs.LG])
    We study a multi-agent reinforcement learning (MARL) problem where the agents interact over a given network. The goal of the agents is to cooperatively maximize the average of their entropy-regularized long-term rewards. To overcome the curse of dimensionality and to reduce communication, we propose a Localized Policy Iteration (LPI) algorithm that provably learns a near-globally-optimal policy using only local information. In particular, we show that, despite restricting each agent's attention to only its $\kappa$-hop neighborhood, the agents are able to learn a policy with an optimality gap that decays polynomially in $\kappa$. In addition, we show the finite-sample convergence of LPI to the global optimal policy, which explicitly captures the trade-off between optimality and computational complexity in choosing $\kappa$. Numerical simulations demonstrate the effectiveness of LPI.
    SinGRAF: Learning a 3D Generative Radiance Field for a Single Scene. (arXiv:2211.17260v1 [cs.CV])
    Generative models have shown great promise in synthesizing photorealistic 3D objects, but they require large amounts of training data. We introduce SinGRAF, a 3D-aware generative model that is trained with a few input images of a single scene. Once trained, SinGRAF generates different realizations of this 3D scene that preserve the appearance of the input while varying scene layout. For this purpose, we build on recent progress in 3D GAN architectures and introduce a novel progressive-scale patch discrimination approach during training. With several experiments, we demonstrate that the results produced by SinGRAF outperform the closest related works in both quality and diversity by a large margin.
    Automated Play-Testing Through RL Based Human-Like Play-Styles Generation. (arXiv:2211.17188v1 [cs.LG])
    The increasing complexity of gameplay mechanisms in modern video games is leading to the emergence of a wider range of ways to play games. The variety of possible play-styles needs to be anticipated by designers, through automated tests. Reinforcement Learning is a promising answer to the need of automating video game testing. To that effect one needs to train an agent to play the game, while ensuring this agent will generate the same play-styles as the players in order to give meaningful feedback to the designers. We present CARMI: a Configurable Agent with Relative Metrics as Input. An agent able to emulate the players play-styles, even on previously unseen levels. Unlike current methods it does not rely on having full trajectories, but only summary data. Moreover it only requires little human data, thus compatible with the constraints of modern video game production. This novel agent could be used to investigate behaviors and balancing during the production of a video game with a realistic amount of training time.
    Pex: Memory-efficient Microcontroller Deep Learning through Partial Execution. (arXiv:2211.17246v1 [cs.LG])
    Embedded and IoT devices, largely powered by microcontroller units (MCUs), could be made more intelligent by leveraging on-device deep learning. One of the main challenges of neural network inference on an MCU is the extremely limited amount of read-write on-chip memory (SRAM, < 512 kB). SRAM is consumed by the neural network layer (operator) input and output buffers, which, traditionally, must be in memory (materialised) for an operator to execute. We discuss a novel execution paradigm for microcontroller deep learning, which modifies the execution of neural networks to avoid materialising full buffers in memory, drastically reducing SRAM usage with no computation overhead. This is achieved by exploiting the properties of operators, which can consume/produce a fraction of their input/output at a time. We describe a partial execution compiler, Pex, which produces memory-efficient execution schedules automatically by identifying subgraphs of operators whose execution can be split along the feature ("channel") dimension. Memory usage is reduced further by targeting memory bottlenecks with structured pruning, leading to the co-design of the network architecture and its execution schedule. Our evaluation of image and audio classification models: (a) establishes state-of-the-art performance in low SRAM usage regimes for considered tasks with up to +2.9% accuracy increase; (b) finds that a 4x memory reduction is possible by applying partial execution alone, or up to 10.5x when using the compiler-pruning co-design, while maintaining the classification accuracy compared to prior work; (c) uses the recovered SRAM to process higher resolution inputs instead, increasing accuracy by up to +3.9% on Visual Wake Words.
    Fast Inference from Transformers via Speculative Decoding. (arXiv:2211.17192v1 [cs.LG])
    Inference from large autoregressive models like Transformers is slow - decoding K tokens takes K serial runs of the model. In this work we introduce speculative decoding - an algorithm to sample from autoregressive models faster without any changes to the outputs, by computing several tokens in parallel. At the heart of our approach lie the observations that (1) hard language-modeling tasks often include easier subtasks that can be approximated well by more efficient models, and (2) using speculative execution and a novel sampling method, we can make exact decoding from the large models faster, by running them in parallel on the outputs of the approximation models, potentially generating several tokens concurrently, and without changing the distribution. Our method supports existing off-the-shelf models without retraining or architecture changes. We demonstrate it on T5-XXL and show a 2X-3X acceleration compared to the standard T5X implementation, with identical outputs.
    ObjCAViT: Improving Monocular Depth Estimation Using Natural Language Models And Image-Object Cross-Attention. (arXiv:2211.17232v1 [cs.CV])
    While monocular depth estimation (MDE) is an important problem in computer vision, it is difficult due to the ambiguity that results from the compression of a 3D scene into only 2 dimensions. It is common practice in the field to treat it as simple image-to-image translation, without consideration for the semantics of the scene and the objects within it. In contrast, humans and animals have been shown to use higher-level information to solve MDE: prior knowledge of the nature of the objects in the scene, their positions and likely configurations relative to one another, and their apparent sizes have all been shown to help resolve this ambiguity. In this paper, we present a novel method to enhance MDE performance by encouraging use of known-useful information about the semantics of objects and inter-object relationships within a scene. Our novel ObjCAViT module sources world-knowledge from language models and learns inter-object relationships in the context of the MDE problem using transformer attention, incorporating apparent size information. Our method produces highly accurate depth maps, and we obtain competitive results on the NYUv2 and KITTI datasets. Our ablation experiments show that the use of language and cross-attention within the ObjCAViT module increases performance. Code is released at https://github.com/DylanAuty/ObjCAViT.
    BudgetLongformer: Can we Cheaply Pretrain a SotA Legal Language Model From Scratch?. (arXiv:2211.17135v1 [cs.CL])
    Pretrained transformer models have achieved state-of-the-art results in many tasks and benchmarks recently. Many state-of-the-art Language Models (LMs), however, do not scale well above the threshold of 512 input tokens. In specialized domains though (such as legal, scientific or biomedical), models often need to process very long text (sometimes well above 10000 tokens). Even though many efficient transformers have been proposed (such as Longformer, BigBird or FNet), so far, only very few such efficient models are available for specialized domains. Additionally, since the pretraining process is extremely costly in general - but even more so as the sequence length increases - it is often only in reach of large research labs. One way of making pretraining cheaper is the Replaced Token Detection (RTD) task, by providing more signal during training, since the loss can be computed over all tokens. In this work, we train Longformer models with the efficient RTD task on legal data to showcase that pretraining efficient LMs is possible using much less compute. We evaluate the trained models on challenging summarization tasks requiring the model to summarize long texts to show to what extent the models can achieve good performance on downstream tasks. We find that both the small and base models outperform their baselines on the in-domain BillSum and out-of-domain PubMed tasks in their respective parameter range. We publish our code and models for research purposes.
    A Tutorial on Neural Networks and Gradient-free Training. (arXiv:2211.17217v1 [eess.SY])
    This paper presents a compact, matrix-based representation of neural networks in a self-contained tutorial fashion. Specifically, we develop neural networks as a composition of several vector-valued functions. Although neural networks are well-understood pictorially in terms of interconnected neurons, neural networks are mathematical nonlinear functions constructed by composing several vector-valued functions. Using basic results from linear algebra, we represent a neural network as an alternating sequence of linear maps and scalar nonlinear functions, also known as activation functions. The training of neural networks requires the minimization of a cost function, which in turn requires the computation of a gradient. Using basic multivariable calculus results, the cost gradient is also shown to be a function composed of a sequence of linear maps and nonlinear functions. In addition to the analytical gradient computation, we consider two gradient-free training methods and compare the three training methods in terms of convergence rate and prediction accuracy.
    Semisoft Task Clustering for Multi-Task Learning. (arXiv:2211.17204v1 [cs.LG])
    Multi-task learning (MTL) aims to improve the performance of multiple related prediction tasks by leveraging useful information from them. Due to their flexibility and ability to reduce unknown coefficients substantially, the task-clustering-based MTL approaches have attracted considerable attention. Motivated by the idea of semisoft clustering of data, we propose a semisoft task clustering approach, which can simultaneously reveal the task cluster structure for both pure and mixed tasks as well as select the relevant features. The main assumption behind our approach is that each cluster has some pure tasks, and each mixed task can be represented by a linear combination of pure tasks in different clusters. To solve the resulting non-convex constrained optimization problem, we design an efficient three-step algorithm. The experimental results based on synthetic and real-world datasets validate the effectiveness and efficiency of the proposed approach. Finally, we extend the proposed approach to a robust task clustering problem.
    On Regret-optimal Cooperative Nonstochastic Multi-armed Bandits. (arXiv:2211.17154v1 [stat.ML])
    We consider the nonstochastic multi-agent multi-armed bandit problem with agents collaborating via a communication network with delays. We show a lower bound for individual regret of all agents. We show that with suitable regularizers and communication protocols, a collaborative multi-agent \emph{follow-the-regularized-leader} (FTRL) algorithm has an individual regret upper bound that matches the lower bound up to a constant factor when the number of arms is large enough relative to degrees of agents in the communication graph. We also show that an FTRL algorithm with a suitable regularizer is regret optimal with respect to the scaling with the edge-delay parameter. We present numerical experiments validating our theoretical results and demonstrate cases when our algorithms outperform previously proposed algorithms.
    Investigation of Proper Orthogonal Decomposition for Echo State Networks. (arXiv:2211.17179v1 [cs.LG])
    Echo State Networks (ESN) are a type of Recurrent Neural Networks that yields promising results in representing time series and nonlinear dynamic systems. Although they are equipped with a very efficient training procedure, Reservoir Computing strategies, such as the ESN, require the use of high order networks, i.e. large number of layers, resulting in number of states that is magnitudes higher than the number of model inputs and outputs. This not only makes the computation of a time step more costly, but also may pose robustness issues when applying ESNs to problems such as Model Predictive Control (MPC) and other optimal control problems. One such way to circumvent this is through Model Order Reduction strategies such as the Proper Orthogonal Decomposition (POD) and its variants (POD-DEIM), whereby we find an equivalent lower order representation to an already trained high dimension ESN. The objective of this work is to investigate and analyze the performance of POD methods in Echo State Networks, evaluating their effectiveness. To this end, we evaluate the Memory Capacity (MC) of the POD-reduced network in comparison to the original (full order) ENS. We also perform experiments on two different numerical case studies: a NARMA10 difference equation and an oil platform containing two wells and one riser. The results show that there is little loss of performance comparing the original ESN to a POD-reduced counterpart, and also that the performance of a POD-reduced ESN tend to be superior to a normal ESN of the same size. Also we attain speedups of around $80\%$ in comparison to the original ESN.
    Airfoil Shape Optimization using Deep Q-Network. (arXiv:2211.17189v1 [cs.LG])
    The feasibility of using reinforcement learning for airfoil shape optimization is explored. Deep Q-Network (DQN) is used over Markov's decision process to find the optimal shape by learning the best changes to the initial shape for achieving the required goal. The airfoil profile is generated using Bezier control points to reduce the number of control variables. The changes in the position of control points are restricted to the direction normal to the chordline so as to reduce the complexity of optimization. The process is designed as a search for an episode of change done to each control point of a profile. The DQN essentially learns the episode of best changes by updating the temporal difference of the Bellman Optimality Equation. The drag and lift coefficients are calculated from the distribution of pressure coefficient along the profile computed using XFoil potential flow solver. These coefficients are used to give a reward to every change during the learning process where the ultimate aim stands to maximize the cumulate reward of an episode.
    ExtremeBERT: A Toolkit for Accelerating Pretraining of Customized BERT. (arXiv:2211.17201v1 [cs.CL])
    In this paper, we present ExtremeBERT, a toolkit for accelerating and customizing BERT pretraining. Our goal is to provide an easy-to-use BERT pretraining toolkit for the research community and industry. Thus, the pretraining of popular language models on customized datasets is affordable with limited resources. Experiments show that, to achieve the same or better GLUE scores, the time cost of our toolkit is over $6\times$ times less for BERT Base and $9\times$ times less for BERT Large when compared with the original BERT paper. The documentation and code are released at https://github.com/extreme-bert/extreme-bert under the Apache-2.0 license.
    Targets in Reinforcement Learning to solve Stackelberg Security Games. (arXiv:2211.17132v1 [cs.LG])
    Reinforcement Learning (RL) algorithms have been successfully applied to real world situations like illegal smuggling, poaching, deforestation, climate change, airport security, etc. These scenarios can be framed as Stackelberg security games (SSGs) where defenders and attackers compete to control target resources. The algorithm's competency is assessed by which agent is controlling the targets. This review investigates modeling of SSGs in RL with a focus on possible improvements of target representations in RL algorithms.
    Weisfeiler and Leman Go Relational. (arXiv:2211.17113v1 [cs.LG])
    Knowledge graphs, modeling multi-relational data, improve numerous applications such as question answering or graph logical reasoning. Many graph neural networks for such data emerged recently, often outperforming shallow architectures. However, the design of such multi-relational graph neural networks is ad-hoc, driven mainly by intuition and empirical insights. Up to now, their expressivity, their relation to each other, and their (practical) learning performance is poorly understood. Here, we initiate the study of deriving a more principled understanding of multi-relational graph neural networks. Namely, we investigate the limitations in the expressive power of the well-known Relational GCN and Compositional GCN architectures and shed some light on their practical learning performance. By aligning both architectures with a suitable version of the Weisfeiler-Leman test, we establish under which conditions both models have the same expressive power in distinguishing non-isomorphic (multi-relational) graphs or vertices with different structural roles. Further, by leveraging recent progress in designing expressive graph neural networks, we introduce the $k$-RN architecture that provably overcomes the expressiveness limitations of the above two architectures. Empirically, we confirm our theoretical findings in a vertex classification setting over small and large multi-relational graphs.
    Proximal Residual Flows for Bayesian Inverse Problems. (arXiv:2211.17158v1 [cs.LG])
    Normalizing flows are a powerful tool for generative modelling, density estimation and posterior reconstruction in Bayesian inverse problems. In this paper, we introduce proximal residual flows, a new architecture of normalizing flows. Based on the fact, that proximal neural networks are by definition averaged operators, we ensure invertibility of certain residual blocks. Moreover, we extend the architecture to conditional proximal residual flows for posterior reconstruction within Bayesian inverse problems. We demonstrate the performance of proximal residual flows on numerical examples.
    Multidimensional analysis using sensor arrays with deep learning for high-precision and high-accuracy diagnosis. (arXiv:2211.17139v1 [cs.LG])
    In the upcoming years, artificial intelligence (AI) is going to transform the practice of medicine in most of its specialties. Deep learning can help achieve better and earlier problem detection, while reducing errors on diagnosis. By feeding a deep neural network (DNN) with the data from a low-cost and low-accuracy sensor array, we demonstrate that it becomes possible to significantly improve the measurements' precision and accuracy. The data collection is done with an array composed of 32 temperature sensors, including 16 analog and 16 digital sensors. All sensors have accuracies between 0.5-2.0$^\circ$C. 800 vectors are extracted, covering a range from to 30 to 45$^\circ$C. In order to improve the temperature readings, we use machine learning to perform a linear regression analysis through a DNN. In an attempt to minimize the model's complexity in order to eventually run inferences locally, the network with the best results involves only three layers using the hyperbolic tangent activation function and the Adam Stochastic Gradient Descent (SGD) optimizer. The model is trained with a randomly-selected dataset using 640 vectors (80% of the data) and tested with 160 vectors (20%). Using the mean squared error as a loss function between the data and the model's prediction, we achieve a loss of only 1.47x10$^{-4}$ on the training set and 1.22x10$^{-4}$ on the test set. As such, we believe this appealing approach offers a new pathway towards significantly better datasets using readily-available ultra low-cost sensors.
    High-Dimensional Wide Gap $k$-Means Versus Clustering Axioms. (arXiv:2211.17036v1 [cs.LG])
    Kleinberg's axioms for distance based clustering proved to be contradictory. Various efforts have been made to overcome this problem. Here we make an attempt to handle the issue by embedding in high-dimensional space and granting wide gaps between clusters.
    PAC Verification of Statistical Algorithms. (arXiv:2211.17096v1 [stat.ML])
    Goldwasser et al.\ (2021) recently proposed the setting of PAC verification, where a hypothesis (machine learning model) that purportedly satisfies the agnostic PAC learning objective is verified using an interactive proof. In this paper we develop this notion further in a number of ways. First, we prove a lower bound for PAC verification of $\Omega(\sqrt{d})$ i.i.d.\ samples for hypothesis classes of VC dimension $d$. Second, we present a protocol for PAC verification of unions of intervals over $\mathbb{R}$ that improves upon their proposed protocol for that task, and matches our lower bound. Third, we introduce a natural generalization of their definition to verification of general statistical algorithms, which is applicable to a wider variety of practical algorithms beyond agnostic PAC learning. Showcasing our proposed definition, our final result is a protocol for the verification of statistical query algorithms that satisfy a combinatorial constraint on their queries.
    Optimizing Explanations by Network Canonization and Hyperparameter Search. (arXiv:2211.17174v1 [cs.CV])
    Explainable AI (XAI) is slowly becoming a key component for many AI applications. Rule-based and modified backpropagation XAI approaches however often face challenges when being applied to modern model architectures including innovative layer building blocks, which is caused by two reasons. Firstly, the high flexibility of rule-based XAI methods leads to numerous potential parameterizations. Secondly, many XAI methods break the implementation-invariance axiom because they struggle with certain model components, e.g., BatchNorm layers. The latter can be addressed with model canonization, which is the process of re-structuring the model to disregard problematic components without changing the underlying function. While model canonization is straightforward for simple architectures (e.g., VGG, ResNet), it can be challenging for more complex and highly interconnected models (e.g., DenseNet). Moreover, there is only little quantifiable evidence that model canonization is beneficial for XAI. In this work, we propose canonizations for currently relevant model blocks applicable to popular deep neural network architectures,including VGG, ResNet, EfficientNet, DenseNets, as well as Relation Networks. We further suggest a XAI evaluation framework with which we quantify and compare the effect sof model canonization for various XAI methods in image classification tasks on the Pascal-VOC and ILSVRC2017 datasets, as well as for Visual Question Answering using CLEVR-XAI. Moreover, addressing the former issue outlined above, we demonstrate how our evaluation framework can be applied to perform hyperparameter search for XAI methods to optimize the quality of explanations.
    Handling and extracting key entities from customer conversations using Speech recognition and Named Entity recognition. (arXiv:2211.17107v1 [cs.CL])
    In this modern era of technology with e-commerce developing at a rapid pace, it is very important to understand customer requirements and details from a business conversation. It is very crucial for customer retention and satisfaction. Extracting key insights from these conversations is very important when it comes to developing their product or solving their issue. Understanding customer feedback, responses, and important details of the product are essential and it would be done using Named entity recognition (NER). For extracting the entities we would be converting the conversations to text using the optimal speech-to-text model. The model would be a two-stage network in which the conversation is converted to text. Then, suitable entities are extracted using robust techniques using a NER BERT transformer model. This will aid in the enrichment of customer experience when there is an issue which is faced by them. If a customer faces a problem he will call and register his complaint. The model will then extract the key features from this conversation which will be necessary to look into the problem. These features would include details like the order number, and the exact problem. All these would be extracted directly from the conversation and this would reduce the effort of going through the conversation again.
    High-Fidelity Guided Image Synthesis with Latent Diffusion Models. (arXiv:2211.17084v1 [cs.CV])
    Controllable image synthesis with user scribbles has gained huge public interest with the recent advent of text-conditioned latent diffusion models. The user scribbles control the color composition while the text prompt provides control over the overall image semantics. However, we note that prior works in this direction suffer from an intrinsic domain shift problem, wherein the generated outputs often lack details and resemble simplistic representations of the target domain. In this paper, we propose a novel guided image synthesis framework, which addresses this problem by modeling the output image as the solution of a constrained optimization problem. We show that while computing an exact solution to the optimization is infeasible, an approximation of the same can be achieved while just requiring a single pass of the reverse diffusion process. Additionally, we show that by simply defining a cross-attention based correspondence between the input text tokens and the user stroke-painting, the user is also able to control the semantics of different painted regions without requiring any conditional training or finetuning. Human user study results show that the proposed approach outperforms the previous state-of-the-art by over 85.32% on the overall user satisfaction scores. Project page for our paper is available at https://1jsingh.github.io/gradop.
    Refining Generative Process with Discriminator Guidance in Score-based Diffusion Models. (arXiv:2211.17091v1 [cs.CV])
    While the success of diffusion models has been witnessed in various domains, only a few works have investigated the variation of the generative process. In this paper, we introduce a new generative process that is closer to the reverse process than the original generative process, given the identical score checkpoint. Specifically, we adjust the generative process with the auxiliary discriminator between the real data and the generated data. Consequently, the adjusted generative process with the discriminator generates more realistic samples than the original process. In experiments, we achieve new SOTA FIDs of 1.74 on CIFAR-10, 1.33 on CelebA, and 1.88 on FFHQ in the unconditional generation.
    Learning Label Modular Prompts for Text Classification in the Wild. (arXiv:2211.17142v1 [cs.LG])
    Machine learning models usually assume i.i.d data during training and testing, but data and tasks in real world often change over time. To emulate the transient nature of real world, we propose a challenging but practical task: text classification in-the-wild, which introduces different non-stationary training/testing stages. Decomposing a complex task into modular components can enable robust generalisation under such non-stationary environment. However, current modular approaches in NLP do not take advantage of recent advances in parameter efficient tuning of pretrained language models. To close this gap, we propose MODULARPROMPT, a label-modular prompt tuning framework for text classification tasks. In MODULARPROMPT, the input prompt consists of a sequence of soft label prompts, each encoding modular knowledge related to the corresponding class label. In two of most formidable settings, MODULARPROMPT outperforms relevant baselines by a large margin demonstrating strong generalisation ability. We also conduct comprehensive analysis to validate whether the learned prompts satisfy properties of a modular representation.
    Optimizing time-shifts for reservoir computing using a rank-revealing QR algorithm. (arXiv:2211.17095v1 [cs.LG])
    Reservoir computing is a recurrent neural network paradigm in which only the output layer is trained. Recently, it was demonstrated that adding time-shifts to the signals generated by a reservoir can provide large improvements in performance accuracy. In this work, we present a technique to choose the optimal time shifts. Our technique maximizes the rank of the reservoir matrix using a rank-revealing QR algorithm and is not task dependent. Further, our technique does not require a model of the system, and therefore is directly applicable to analog hardware reservoir computers. We demonstrate our time-shift optimization technique on two types of reservoir computer: one based on an opto-electronic oscillator and the traditional recurrent network with a $tanh$ activation function. We find that our technique provides improved accuracy over random time-shift selection in essentially all cases.
    Continuous Methods : Adaptively intrusive reduced order model closure. (arXiv:2211.16999v1 [cs.LG])
    Reduced order modeling methods are often used as a mean to reduce simulation costs in industrial applications. Despite their computational advantages, reduced order models (ROMs) often fail to accurately reproduce complex dynamics encountered in real life applications. To address this challenge, we leverage NeuralODEs to propose a novel ROM correction approach based on a time-continuous memory formulation. Finally, experimental results show that our proposed method provides a high level of accuracy while retaining the low computational costs inherent to reduced models.
    Hint-dynamic Knowledge Distillation. (arXiv:2211.17059v1 [cs.CV])
    Knowledge Distillation (KD) transfers the knowledge from a high-capacity teacher model to promote a smaller student model. Existing efforts guide the distillation by matching their prediction logits, feature embedding, etc., while leaving how to efficiently utilize them in junction less explored. In this paper, we propose Hint-dynamic Knowledge Distillation, dubbed HKD, which excavates the knowledge from the teacher' s hints in a dynamic scheme. The guidance effect from the knowledge hints usually varies in different instances and learning stages, which motivates us to customize a specific hint-learning manner for each instance adaptively. Specifically, a meta-weight network is introduced to generate the instance-wise weight coefficients about knowledge hints in the perception of the dynamical learning progress of the student model. We further present a weight ensembling strategy to eliminate the potential bias of coefficient estimation by exploiting the historical statics. Experiments on standard benchmarks of CIFAR-100 and Tiny-ImageNet manifest that the proposed HKD well boost the effect of knowledge distillation tasks.
    Self-Supervised Continual Graph Learning in Adaptive Riemannian Spaces. (arXiv:2211.17068v1 [cs.LG])
    Continual graph learning routinely finds its role in a variety of real-world applications where the graph data with different tasks come sequentially. Despite the success of prior works, it still faces great challenges. On the one hand, existing methods work with the zero-curvature Euclidean space, and largely ignore the fact that curvature varies over the coming graph sequence. On the other hand, continual learners in the literature rely on abundant labels, but labeling graph in practice is particularly hard especially for the continuously emerging graphs on-the-fly. To address the aforementioned challenges, we propose to explore a challenging yet practical problem, the self-supervised continual graph learning in adaptive Riemannian spaces. In this paper, we propose a novel self-supervised Riemannian Graph Continual Learner (RieGrace). In RieGrace, we first design an Adaptive Riemannian GCN (AdaRGCN), a unified GCN coupled with a neural curvature adapter, so that Riemannian space is shaped by the learnt curvature adaptive to each graph. Then, we present a Label-free Lorentz Distillation approach, in which we create teacher-student AdaRGCN for the graph sequence. The student successively performs intra-distillation from itself and inter-distillation from the teacher so as to consolidate knowledge without catastrophic forgetting. In particular, we propose a theoretically grounded Generalized Lorentz Projection for the contrastive distillation in Riemannian space. Extensive experiments on the benchmark datasets show the superiority of RieGrace, and additionally, we investigate on how curvature changes over the graph sequence.
    Towards Interpreting Vulnerability of Multi-Instance Learning via Customized and Universal Adversarial Perturbations. (arXiv:2211.17071v1 [cs.CV])
    Multi-instance learning (MIL) is a great paradigm for dealing with complex data and has achieved impressive achievements in a number of fields, including image classification, video anomaly detection, and far more. Each data sample is referred to as a bag containing several unlabeled instances, and the supervised information is only provided at the bag-level. The safety of MIL learners is concerning, though, as we can greatly fool them by introducing a few adversarial perturbations. This can be fatal in some cases, such as when users are unable to access desired images and criminals are attempting to trick surveillance cameras. In this paper, we design two adversarial perturbations to interpret the vulnerability of MIL methods. The first method can efficiently generate the bag-specific perturbation (called customized) with the aim of outsiding it from its original classification region. The second method builds on the first one by investigating the image-agnostic perturbation (called universal) that aims to affect all bags in a given data set and obtains some generalizability. We conduct various experiments to verify the performance of these two perturbations, and the results show that both of them can effectively fool MIL learners. We additionally propose a simple strategy to lessen the effects of adversarial perturbations. Source codes are available at https://github.com/InkiInki/MI-UAP.
    Explaining machine learning models for age classification in human gait analysis. (arXiv:2211.17016v1 [cs.LG])
    Machine learning (ML) models have proven effective in classifying gait analysis data, e.g., binary classification of young vs. older adults. ML models, however, lack in providing human understandable explanations for their predictions. This "black-box" behavior impedes the understanding of which input features the model predictions are based on. We investigated an Explainable Artificial Intelligence method, i.e., Layer-wise Relevance Propagation (LRP), for gait analysis data. The research question was: Which input features are used by ML models to classify age-related differences in walking patterns? We utilized a subset of the AIST Gait Database 2019 containing five bilateral ground reaction force (GRF) recordings per person during barefoot walking of healthy participants. Each input signal was min-max normalized before concatenation and fed into a Convolutional Neural Network (CNN). Participants were divided into three age groups: young (20-39 years), middle-aged (40-64 years), and older (65-79 years) adults. The classification accuracy and relevance scores (derived using LRP) were averaged over a stratified ten-fold cross-validation. The mean classification accuracy of 60.1% was clearly higher than the zero-rule baseline of 37.3%. The confusion matrix shows that the CNN distinguished younger and older adults well, but had difficulty modeling the middle-aged adults.
    Integrating wind variability to modelling wind-ramp events using a non-binary ramp function and deep learning models. (arXiv:2211.17017v1 [cs.LG])
    The forecasting of large ramps in wind power output known as ramp events is crucial for the incorporation of large volumes of wind energy into national electricity grids. Large variations in wind power supply must be compensated by ancillary energy sources which can include the use of fossil fuels. Improved prediction of wind power will help to reduce dependency on supplemental energy sources along with their associated costs and emissions. In this paper, we discuss limitations of current predictive practices and explore the use of Machine Learning methods to enhance wind ramp event classification and prediction. We additionally outline a design for a novel approach to wind ramp prediction, in which high-resolution wind fields are incorporated to the modelling of wind power.
    Climate Change Policy Exploration using Reinforcement Learning. (arXiv:2211.17013v1 [cs.LG])
    Climate Change is an incredibly complicated problem that humanity faces. When many variables interact with each other, it can be difficult for humans to grasp the causes and effects of the very large-scale problem of climate change. The climate is a dynamical system, where small changes can have considerable and unpredictable repercussions in the long term. Understanding how to nudge this system in the right ways could help us find creative solutions to climate change. In this research, we combine Deep Reinforcement Learning and a World-Earth system model to find, and explain, creative strategies to a sustainable future. This is an extension of the work from Strnad et al. where we extend on the method and analysis, by taking multiple directions. We use four different Reinforcement Learning agents varying in complexity to probe the environment in different ways and to find various strategies. The environment is a low-complexity World Earth system model where the goal is to reach a future where all the energy for the economy is produced by renewables by enacting different policies. We use a reward function based on planetary boundaries that we modify to force the agents to find a wider range of strategies. To favour applicability, we slightly modify the environment, by injecting noise and making it fully observable, to understand the impacts of these factors on the learning of the agents.
    Neural Network Representation of Time Integrators. (arXiv:2211.17039v1 [math.NA])
    Deep neural network (DNN) architectures are constructed that are the exact equivalent of explicit Runge-Kutta schemes for numerical time integration. The network weights and biases are given, i.e., no training is needed. In this way, the only task left for physics-based integrators is the DNN approximation of the right-hand side. This allows to clearly delineate the approximation estimates for right-hand side errors and time integration errors. The architecture required for the integration of a simple mass-damper-stiffness case is included as an example.
    Reinforcement Learning for Multi-Truck Vehicle Routing Problems. (arXiv:2211.17078v1 [cs.LG])
    Vehicle routing problems and other combinatorial optimization problems have been approximately solved by reinforcement learning agents with policies based on encoder-decoder models with attention mechanisms. These techniques are of substantial interest but still cannot solve the complex routing problems that arise in a realistic setting which can have many trucks and complex requirements. With the aim of making reinforcement learning a viable technique for supply chain optimization, we develop new extensions to encoder-decoder models for vehicle routing that allow for complex supply chains using classical computing today and quantum computing in the future. We make two major generalizations. First, our model allows for routing problems with multiple trucks. Second, we move away from the simple requirement of having a truck deliver items from nodes to one special depot node, and instead allow for a complex tensor demand structure. We show how our model, even if trained only for a small number of trucks, can be embedded into a large supply chain to yield viable solutions.
    Predicting Properties of Quantum Systems with Conditional Generative Models. (arXiv:2211.16943v1 [quant-ph])
    Machine learning has emerged recently as a powerful tool for predicting properties of quantum many-body systems. For many ground states of gapped Hamiltonians, generative models can learn from measurements of a single quantum state to reconstruct the state accurately enough to predict local observables. Alternatively, kernel methods can predict local observables by learning from measurements on different but related states. In this work, we combine the benefits of both approaches and propose the use of conditional generative models to simultaneously represent a family of states, by learning shared structures of different quantum states from measurements. The trained model allows us to predict arbitrary local properties of ground states, even for states not present in the training data, and without necessitating further training for new observables. We numerically validate our approach (with simulations of up to 45 qubits) for two quantum many-body problems, 2D random Heisenberg models and Rydberg atom systems.
    Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion. (arXiv:2205.02357v4 [cs.CL] CROSS LISTED)
    Multimodal Knowledge Graphs (MKGs), which organize visual-text factual knowledge, have recently been successfully applied to tasks such as information retrieval, question answering, and recommendation system. Since most MKGs are far from complete, extensive knowledge graph completion studies have been proposed focusing on the multimodal entity, relation extraction and link prediction. However, different tasks and modalities require changes to the model architecture, and not all images/objects are relevant to text input, which hinders the applicability to diverse real-world scenarios. In this paper, we propose a hybrid transformer with multi-level fusion to address those issues. Specifically, we leverage a hybrid transformer architecture with unified input-output for diverse multimodal knowledge graph completion tasks. Moreover, we propose multi-level fusion, which integrates visual and text representation via coarse-grained prefix-guided interaction and fine-grained correlation-aware fusion modules. We conduct extensive experiments to validate that our MKGformer can obtain SOTA performance on four datasets of multimodal link prediction, multimodal RE, and multimodal NER. Code is available in https://github.com/zjunlp/MKGformer.
    BASiS: Batch Aligned Spectral Embedding Space. (arXiv:2211.16960v1 [cs.CV])
    Graph is a highly generic and diverse representation, suitable for almost any data processing problem. Spectral graph theory has been shown to provide powerful algorithms, backed by solid linear algebra theory. It thus can be extremely instrumental to design deep network building blocks with spectral graph characteristics. For instance, such a network allows the design of optimal graphs for certain tasks or obtaining a canonical orthogonal low-dimensional embedding of the data. Recent attempts to solve this problem were based on minimizing Rayleigh-quotient type losses. We propose a different approach of directly learning the eigensapce. A severe problem of the direct approach, applied in batch-learning, is the inconsistent mapping of features to eigenspace coordinates in different batches. We analyze the degrees of freedom of learning this task using batches and propose a stable alignment mechanism that can work both with batch changes and with graph-metric changes. We show that our learnt spectral embedding is better in terms of NMI, ACC, Grassman distance, orthogonality and classification accuracy, compared to SOTA. In addition, the learning is more stable.
    Directed Acyclic Graph Structure Learning from Dynamic Graphs. (arXiv:2211.17029v1 [cs.LG])
    Estimating the structure of directed acyclic graphs (DAGs) of features (variables) plays a vital role in revealing the latent data generation process and providing causal insights in various applications. Although there have been many studies on structure learning with various types of data, the structure learning on the dynamic graph has not been explored yet, and thus we study the learning problem of node feature generation mechanism on such ubiquitous dynamic graph data. In a dynamic graph, we propose to simultaneously estimate contemporaneous relationships and time-lagged interaction relationships between the node features. These two kinds of relationships form a DAG, which could effectively characterize the feature generation process in a concise way. To learn such a DAG, we cast the learning problem as a continuous score-based optimization problem, which consists of a differentiable score function to measure the validity of the learned DAGs and a smooth acyclicity constraint to ensure the acyclicity of the learned DAGs. These two components are translated into an unconstraint augmented Lagrangian objective which could be minimized by mature continuous optimization techniques. The resulting algorithm, named GraphNOTEARS, outperforms baselines on simulated data across a wide range of settings that may encounter in real-world applications. We also apply the proposed approach on two dynamic graphs constructed from the real-world Yelp dataset, demonstrating our method could learn the connections between node features, which conforms with the domain knowledge.
    Infinite-width limit of deep linear neural networks. (arXiv:2211.16980v1 [cs.LG])
    This paper studies the infinite-width limit of deep linear neural networks initialized with random parameters. We obtain that, when the number of neurons diverges, the training dynamics converge (in a precise sense) to the dynamics obtained from a gradient descent on an infinitely wide deterministic linear neural network. Moreover, even if the weights remain random, we get their precise law along the training dynamics, and prove a quantitative convergence result of the linear predictor in terms of the number of neurons. We finally study the continuous-time limit obtained for infinitely wide linear neural networks and show that the linear predictors of the neural network converge at an exponential rate to the minimal $\ell_2$-norm minimizer of the risk.
    Correlation of the importances of neural network weights calculated by modern methods of overcoming catastrophic forgetting. (arXiv:2211.17012v1 [cs.LG])
    Following the invention in 2017 of the EWC method, several methods have been proposed to calculate the importance of neural network weights for use in the EWC method. Despite the significant difference in calculating the importance of weights, they all proved to be effective. Accordingly, a reasonable question arises as to how similar the importances of the weights calculated by different methods. To answer this question, we calculated layer-by-layer correlations of the importance of weights calculated by all those methods. As a result, it turned out that the importances of several of the methods correlated with each other quite strongly and we were able to present an explanation for such a correlation. At the same time, for other methods, the correlation can vary from strong on some layers of the network to negative on other layers. Which raises a reasonable question: why, despite the very different calculation methods, all those importances allow EWC method to overcome the catastrophic forgetting of neural networks perfectly?
    Universal Feature Selection Tool (UniFeat): An Open-Source Tool for Dimensionality Reduction. (arXiv:2211.16846v1 [cs.LG])
    The Universal Feature Selection Tool (UniFeat) is an open-source tool developed entirely in Java for performing feature selection processes in various research areas. It provides a set of well-known and advanced feature selection methods within its significant auxiliary tools. This allows users to compare the performance of feature selection methods. Moreover, due to the open-source nature of UniFeat, researchers can use and modify it in their research, which facilitates the rapid development of new feature selection algorithms.
    Continual Learning with Distributed Optimization: Does COCOA Forget?. (arXiv:2211.16994v1 [stat.ML])
    We focus on the continual learning problem where the tasks arrive sequentially and the aim is to perform well on the newly arrived task without performance degradation on the previously seen tasks. In contrast to the continual learning literature focusing on the centralized setting, we investigate the distributed estimation framework. We consider the well-established distributed learning algorithm \cocoa{}. We derive closed form expressions for the iterations for the overparametrized case. We illustrate the convergence and the error performance of the algorithm based on the over/under-parametrization of the problem. Our results show that depending on the problem dimensions and data generation assumptions, \cocoa{} can perform continual learning over a sequence of tasks, i.e., it can learn a new task without forgetting previously learned tasks, with access only to one task at a time.
    Differentiable optimization of the Debye-Wolf integral for light shaping and adaptive optics in two-photon microscopy. (arXiv:2211.16930v1 [physics.optics])
    Control of light through a microscope objective with a high numerical aperture is a common requirement in applications such as optogenetics, adaptive optics, or laser processing. Light propagation, including polarization effects, can be described under these conditions using the Debye-Wolf diffraction integral. Here, we take advantage of differentiable optimization and machine learning for efficiently optimizing the Debye-Wolf integral for such applications. For light shaping we show that this optimization approach is suitable for engineering arbitrary three-dimensional point spread functions in a two-photon microscope. For differentiable model-based adaptive optics (DAO), the developed method can find aberration corrections with intrinsic image features, for example neurons labeled with genetically encoded calcium indicators, without requiring guide stars. Using computational modeling we further discuss the range of spatial frequencies and magnitudes of aberrations which can be corrected with this approach.
    VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing. (arXiv:2211.16934v1 [cs.CL])
    Video dubbing aims to translate the original speech in a film or television program into the speech in a target language, which can be achieved with a cascaded system consisting of speech recognition, machine translation and speech synthesis. To ensure the translated speech to be well aligned with the corresponding video, the length/duration of the translated speech should be as close as possible to that of the original speech, which requires strict length control. Previous works usually control the number of words or characters generated by the machine translation model to be similar to the source sentence, without considering the isochronicity of speech as the speech duration of words/characters in different languages varies. In this paper, we propose a machine translation system tailored for the task of video dubbing, which directly considers the speech duration of each token in translation, to match the length of source and target speech. Specifically, we control the speech length of generated sentence by guiding the prediction of each word with the duration information, including the speech duration of itself as well as how much duration is left for the remaining words. We design experiments on four language directions (German -> English, Spanish -> English, Chinese English), and the results show that the proposed method achieves better length control ability on the generated speech than baseline methods. To make up the lack of real-world datasets, we also construct a real-world test set collected from films to provide comprehensive evaluations on the video dubbing task.
    A Deep Learning Approach to the Prediction of Drug Side-Effects on Molecular Graphs. (arXiv:2211.16871v1 [stat.ML])
    Predicting drug side-effects before they occur is a key task in keeping the number of drug-related hospitalizations low and to improve drug discovery processes. Automatic predictors of side-effects generally are not able to process the structure of the drug, resulting in a loss of information. Graph neural networks have seen great success in recent years, thanks to their ability of exploiting the information conveyed by the graph structure and labels. These models have been used in a wide variety of biological applications, among which the prediction of drug side-effects on a large knowledge graph. Exploiting the molecular graph encoding the structure of the drug represents a novel approach, in which the problem is formulated as a multi-class multi-label graph-focused classification. We developed a methodology to carry out this task, using recurrent Graph Neural Networks, and building a dataset from freely accessible and well established data sources. The results show that our method has an improved classification capability, under many parameters and metrics, with respect to previously available predictors.
    Autonomous Assessment of Demonstration Sufficiency via Bayesian Inverse Reinforcement Learning. (arXiv:2211.15542v2 [cs.LG] UPDATED)
    In this paper we examine the problem of determining demonstration sufficiency for AI agents that learn from demonstrations: how can an AI agent self-assess whether it has received enough demonstrations from an expert to ensure a desired level of performance? To address this problem we propose a novel self-assessment approach based on Bayesian inverse reinforcement learning and value-at-risk to enable agents that learn from demonstrations to compute high-confidence bounds on their performance and use these bounds to determine when they have a sufficient number of demonstrations. We propose and evaluate two definitions of sufficiency: (1) normalized expected value difference, which measures regret with respect to the expert's unobserved reward function, and (2) improvement over a baseline policy. We demonstrate how to formulate high-confidence bounds on both of these metrics. We evaluate our approach in simulation and demonstrate the feasibility of developing an AI system that can accurately evaluate whether it has received sufficient training data to guarantee, with high confidence, that it can match an expert's performance or surpass the performance of a baseline policy within some desired safety threshold.
    On the Design of Communication-Efficient Federated Learning for Health Monitoring. (arXiv:2211.16952v1 [cs.LG])
    With the booming deployment of Internet of Things, health monitoring applications have gradually prospered. Within the recent COVID-19 pandemic situation, interest in permanent remote health monitoring solutions has raised, targeting to reduce contact and preserve the limited medical resources. Among the technological methods to realize efficient remote health monitoring, federated learning (FL) has drawn particular attention due to its robustness in preserving data privacy. However, FL can yield to high communication costs, due to frequent transmissions between the FL server and clients. To tackle this problem, we propose in this paper a communication-efficient federated learning (CEFL) framework that involves clients clustering and transfer learning. First, we propose to group clients through the calculation of similarity factors, based on the neural networks characteristics. Then, a representative client in each cluster is selected to be the leader of the cluster. Differently from the conventional FL, our method performs FL training only among the cluster leaders. Subsequently, transfer learning is adopted by the leader to update its cluster members with the trained FL model. Finally, each member fine-tunes the received model with its own data. To further reduce the communication costs, we opt for a partial-layer FL aggregation approach. This method suggests partially updating the neural network model rather than fully. Through experiments, we show that CEFL can save up to to 98.45% in communication costs while conceding less than 3% in accuracy loss, when compared to the conventional FL. Finally, CEFL demonstrates a high accuracy for clients with small or unbalanced datasets.
    OpenRAN Gym: AI/ML Development, Data Collection, and Testing for O-RAN on PAWR Platforms. (arXiv:2207.12362v2 [cs.NI] UPDATED)
    Open Radio Access Network (RAN) architectures will enable interoperability, openness and programmable data-driven control in next generation cellular networks. However, developing and testing efficient solutions that generalize across heterogeneous cellular deployments and scales, and that optimize network performance in such diverse environments is a complex task that is still largely unexplored. In this paper we present OpenRAN Gym, a unified, open, and O-RAN-compliant experimental toolbox for data collection, design, prototyping and testing of end-to-end data-driven control solutions for next generation Open RAN systems. OpenRAN Gym extends and combines into a unique solution several software frameworks for data collection of RAN statistics and RAN control, and a lightweight O-RAN near-real-time RAN Intelligent Controller (RIC) tailored to run on experimental wireless platforms. We first provide an overview of the various architectural components of OpenRAN Gym and describe how it is used to collect data and design, train and test artificial intelligence and machine learning O-RAN-compliant applications (xApps) at scale. We then describe in detail how to test the developed xApps on softwarized RANs and provide an example of two xApps developed with OpenRAN Gym that are used to control a network with 7 base stations and 42 users deployed on the Colosseum testbed. Finally, we show how solutions developed with OpenRAN Gym on Colosseum can be exported to real-world, heterogeneous wireless platforms, such as the Arena testbed and the POWDER and COSMOS platforms of the PAWR program. OpenRAN Gym and its software components are open-source and publicly-available to the research community. By guiding the readers through running experiments with OpenRAN Gym, we aim at providing a key reference for researchers and practitioners working on experimental Open RAN systems.
    A Unifying Theory of Distance from Calibration. (arXiv:2211.16886v1 [cs.LG])
    We study the fundamental question of how to define and measure the distance from calibration for probabilistic predictors. While the notion of perfect calibration is well-understood, there is no consensus on how to quantify the distance from perfect calibration. Numerous calibration measures have been proposed in the literature, but it is unclear how they compare to each other, and many popular measures such as Expected Calibration Error (ECE) fail to satisfy basic properties like continuity. We present a rigorous framework for analyzing calibration measures, inspired by the literature on property testing. We propose a ground-truth notion of distance from calibration: the $\ell_1$ distance to the nearest perfectly calibrated predictor. We define a consistent calibration measure as one that is a polynomial factor approximation to the this distance. Applying our framework, we identify three calibration measures that are consistent and can be estimated efficiently: smooth calibration, interval calibration, and Laplace kernel calibration. The former two give quadratic approximations to the ground truth distance, which we show is information-theoretically optimal. Our work thus establishes fundamental lower and upper bounds on measuring distance to calibration, and also provides theoretical justification for preferring certain metrics (like Laplace kernel calibration) in practice.
    ALARM: Active LeArning of Rowhammer Mitigations. (arXiv:2211.16942v1 [cs.CR])
    Rowhammer is a serious security problem of contemporary dynamic random-access memory (DRAM) where reads or writes of bits can flip other bits. DRAM manufacturers add mitigations, but don't disclose details, making it difficult for customers to evaluate their efficacy. We present a tool, based on active learning, that automatically infers parameter of Rowhammer mitigations against synthetic models of modern DRAM.
    Federated deep clustering with GAN-based data synthesis. (arXiv:2211.16965v1 [cs.LG])
    Clustering has been extensively studied in centralized settings, but relatively unexplored in federated ones that data are distributed among multiple clients and can only be kept local at the clients. The necessity to invest more resources in improving federated clustering methods is twofold: 1) The performance of supervised federated learning models can benefit from clustering. 2) It is non-trivial to extend centralized ones to perform federated clustering tasks. In centralized settings, various deep clustering methods that perform dimensionality reduction and clustering jointly have achieved great success. To obtain high-quality cluster information, it is natural but non-trivial to extend these methods to federated settings. For this purpose, we propose a simple but effective federated deep clustering method. It requires only one communication round between the central server and clients, can run asynchronously, and can handle device failures. Moreover, although most studies have highlighted adverse effects of the non-independent and identically distributed (non-IID) data across clients, experimental results indicate that the proposed method can significantly benefit from this scenario.
    A Pipeline for Generating, Annotating and Employing Synthetic Data for Real World Question Answering. (arXiv:2211.16971v1 [cs.CL])
    Question Answering (QA) is a growing area of research, often used to facilitate the extraction of information from within documents. State-of-the-art QA models are usually pre-trained on domain-general corpora like Wikipedia and thus tend to struggle on out-of-domain documents without fine-tuning. We demonstrate that synthetic domain-specific datasets can be generated easily using domain-general models, while still providing significant improvements to QA performance. We present two new tools for this task: A flexible pipeline for validating the synthetic QA data and training downstream models on it, and an online interface to facilitate human annotation of this generated data. Using this interface, crowdworkers labelled 1117 synthetic QA pairs, which we then used to fine-tune downstream models and improve domain-specific QA performance by 8.75 F1.
    Toward Robust Diagnosis: A Contour Attention Preserving Adversarial Defense for COVID-19 Detection. (arXiv:2211.16806v1 [eess.IV])
    As the COVID-19 pandemic puts pressure on healthcare systems worldwide, the computed tomography image based AI diagnostic system has become a sustainable solution for early diagnosis. However, the model-wise vulnerability under adversarial perturbation hinders its deployment in practical situation. The existing adversarial training strategies are difficult to generalized into medical imaging field challenged by complex medical texture features. To overcome this challenge, we propose a Contour Attention Preserving (CAP) method based on lung cavity edge extraction. The contour prior features are injected to attention layer via a parameter regularization and we optimize the robust empirical risk with hybrid distance metric. We then introduce a new cross-nation CT scan dataset to evaluate the generalization capability of the adversarial robustness under distribution shift. Experimental results indicate that the proposed method achieves state-of-the-art performance in multiple adversarial defense and generalization tasks. The code and dataset are available at https://github.com/Quinn777/CAP.
    Quadapter: Adapter for GPT-2 Quantization. (arXiv:2211.16912v1 [cs.LG])
    Transformer language models such as GPT-2 are difficult to quantize because of outliers in activations leading to a large quantization error. To adapt to the error, one must use quantization-aware training, which entails a fine-tuning process based on the dataset and the training pipeline identical to those for the original model. Pretrained language models, however, often do not grant access to their datasets and training pipelines, forcing us to rely on arbitrary ones for fine-tuning. In that case, it is observed that quantization-aware training overfits the model to the fine-tuning data. For quantization without overfitting, we introduce a quantization adapter (Quadapter), a small set of parameters that are learned to make activations quantization-friendly by scaling them channel-wise. It keeps the model parameters unchanged. By applying our method to the challenging task of quantizing GPT-2, we demonstrate that it effectively prevents the overfitting and improves the quantization performance.
    Prediction of Oral Food Challenge Outcomes via Ensemble Learning. (arXiv:2208.08268v2 [cs.LG] UPDATED)
    Oral Food Challenges (OFCs) are essential to accurately diagnosing food allergy due to the limitations of existing clinical testing. However, some patients are hesitant to undergo OFCs, while those willing suffer from limited access to allergists in rural/community healthcare settings. Despite its success in predicting patient outcomes in other clinical settings, few applications of machine learning to food allergy have been developed. Thus, in this study, we seek to leverage machine learning methodologies for OFC outcome prediction. Retrospective data was gathered from 1,112 patients who collectively underwent a total of 1,284 OFCs, and consisted of clinical factors including serum-specific Immunoglobulin E (IgE), total IgE, skin prick tests (SPTs), comorbidities, sex, and age. Using these features, multiple machine learning models were constructed to predict OFC outcomes for three common allergens: peanut, egg, and milk. The best performing model for each allergen was an ensemble of random forest (egg) or Learning Using Concave and Convex Kernels (LUCCK) (peanut, milk) models, which achieved an Area under the Curve (AUC) of 0.91, 0.96, and 0.94, in predicting OFC outcomes for peanut, egg, and milk, respectively. Moreover, all such models had sensitivity and specificity values 89%. Model interpretation via SHapley Additive exPlanations (SHAP) indicates that specific IgE, along with wheal and flare values from SPTs, are highly predictive of OFC outcomes. The results of this analysis suggest that ensemble learning has the potential to predict OFC outcomes and reveal relevant clinical factors for further study.
    Self-Supervised Learning for Anomalous Channel Detection in EEG Graphs: Application to Seizure Analysis. (arXiv:2208.07448v2 [cs.LG] UPDATED)
    Electroencephalogram (EEG) signals are effective tools towards seizure analysis where one of the most important challenges is accurate detection of seizure events and brain regions in which seizure happens or initiates. However, all existing machine learning-based algorithms for seizure analysis require access to the labeled seizure data while acquiring labeled data is very labor intensive, expensive, as well as clinicians dependent given the subjective nature of the visual qualitative interpretation of EEG signals. In this paper, we propose to detect seizure channels and clips in a self-supervised manner where no access to the seizure data is needed. The proposed method considers local structural and contextual information embedded in EEG graphs by employing positive and negative sub-graphs. We train our method through minimizing contrastive and generative losses. The employ of local EEG sub-graphs makes the algorithm an appropriate choice when accessing to the all EEG channels is impossible due to complications such as skull fractures. We conduct an extensive set of experiments on the largest seizure dataset and demonstrate that our proposed framework outperforms the state-of-the-art methods in the EEG-based seizure study. The proposed method is the only study that requires no access to the seizure data in its training phase, yet establishes a new state-of-the-art to the field, and outperforms all related supervised methods.
    Corneal endothelium assessment in specular microscopy images with Fuchs' dystrophy via deep regression of signed distance maps. (arXiv:2210.07102v2 [eess.IV] UPDATED)
    Specular microscopy assessment of the human corneal endothelium (CE) in Fuchs' dystrophy is challenging due to the presence of dark image regions called guttae. This paper proposes a UNet-based segmentation approach that requires minimal post-processing and achieves reliable CE morphometric assessment and guttae identification across all degrees of Fuchs' dystrophy. We cast the segmentation problem as a regression task of the cell and gutta signed distance maps instead of a pixel-level classification task as typically done with UNets. Compared to the conventional UNet classification approach, the distance-map regression approach converges faster in clinically relevant parameters. It also produces morphometric parameters that agree with the manually-segmented ground-truth data, namely the average cell density difference of -41.9 cells/mm2 (95% confidence interval (CI) [-306.2, 222.5]) and the average difference of mean cell area of 14.8 um2 (95% CI [-41.9, 71.5]). These results suggest a promising alternative for CE assessment.
    Pattern Attention Transformer with Doughnut Kernel. (arXiv:2211.16961v1 [cs.CV])
    We present in this paper a new architecture, the Pattern Attention Transformer (PAT), that is composed of the new doughnut kernel. Compared with tokens in the NLP field, Transformer in computer vision has the problem of handling the high resolution of pixels in images. Inheriting the patch/window idea from ViT and its follow-ups, the doughnut kernel enhances the design of patches. It replaces the line-cut boundaries with two types of areas: sensor and updating, which is based on the comprehension of self-attention (named QKVA grid). The doughnut kernel also brings a new topic about the shape of kernels. To verify its performance on image classification, PAT is designed with Transformer blocks of regular octagon shape doughnut kernels. Its performance on ImageNet 1K surpasses the Swin Transformer (+0.7 acc1).
    Understanding transit ridership in an equity context through a comparison of statistical and machine learning algorithms. (arXiv:2211.16736v1 [cs.LG])
    Building an accurate model of travel behaviour based on individuals' characteristics and built environment attributes is of importance for policy-making and transportation planning. Recent experiments with big data and Machine Learning (ML) algorithms toward a better travel behaviour analysis have mainly overlooked socially disadvantaged groups. Accordingly, in this study, we explore the travel behaviour responses of low-income individuals to transit investments in the Greater Toronto and Hamilton Area, Canada, using statistical and ML models. We first investigate how the model choice affects the prediction of transit use by the low-income group. This step includes comparing the predictive performance of traditional and ML algorithms and then evaluating a transit investment policy by contrasting the predicted activities and the spatial distribution of transit trips generated by vulnerable households after improving accessibility. We also empirically investigate the proposed transit investment by each algorithm and compare it with the city of Brampton's future transportation plan. While, unsurprisingly, the ML algorithms outperform classical models, there are still doubts about using them due to interpretability concerns. Hence, we adopt recent local and global model-agnostic interpretation tools to interpret how the model arrives at its predictions. Our findings reveal the great potential of ML algorithms for enhanced travel behaviour predictions for low-income strata without considerably sacrificing interpretability.
    Efficient Adversarial Input Generation via Neural Net Patching. (arXiv:2211.16808v1 [cs.LG])
    The adversarial input generation problem has become central in establishing the robustness and trustworthiness of deep neural nets, especially when they are used in safety-critical application domains such as autonomous vehicles and precision medicine. This is also practically challenging for multiple reasons-scalability is a common issue owing to large-sized networks, and the generated adversarial inputs often lack important qualities such as naturalness and output-impartiality. We relate this problem to the task of patching neural nets, i.e. applying small changes in some of the network$'$s weights so that the modified net satisfies a given property. Intuitively, a patch can be used to produce an adversarial input because the effect of changing the weights can also be brought about by changing the inputs instead. This work presents a novel technique to patch neural networks and an innovative approach of using it to produce perturbations of inputs which are adversarial for the original net. We note that the proposed solution is significantly more effective than the prior state-of-the-art techniques.
    Interpretability and accessibility of machine learning in selected food processing, agriculture and health applications. (arXiv:2211.16699v1 [cs.LG])
    Artificial Intelligence (AI) and its data-centric branch of machine learning (ML) have greatly evolved over the last few decades. However, as AI is used increasingly in real world use cases, the importance of the interpretability of and accessibility to AI systems have become major research areas. The lack of interpretability of ML based systems is a major hindrance to widespread adoption of these powerful algorithms. This is due to many reasons including ethical and regulatory concerns, which have resulted in poorer adoption of ML in some areas. The recent past has seen a surge in research on interpretable ML. Generally, designing a ML system requires good domain understanding combined with expert knowledge. New techniques are emerging to improve ML accessibility through automated model design. This paper provides a review of the work done to improve interpretability and accessibility of machine learning in the context of global problems while also being relevant to developing countries. We review work under multiple levels of interpretability including scientific and mathematical interpretation, statistical interpretation and partial semantic interpretation. This review includes applications in three areas, namely food processing, agriculture and health.
    WeatherFusionNet: Predicting Precipitation from Satellite Data. (arXiv:2211.16824v1 [cs.CV])
    The short-term prediction of precipitation is critical in many areas of life. Recently, a large body of work was devoted to forecasting radar reflectivity images. The radar images are available only in areas with ground weather radars. Thus, we aim to predict high-resolution precipitation from lower-resolution satellite radiance images. A neural network called WeatherFusionNet is employed to predict severe rain up to eight hours in advance. WeatherFusionNet is a U-Net architecture that fuses three different ways to process the satellite data; predicting future satellite frames, extracting rain information from the current frames, and using the input sequence directly. Using the presented method, we achieved 1st place in the NeurIPS 2022 Weather4Cast Core challenge. The code and trained parameters are available at \url{https://github.com/Datalab-FIT-CTU/weather4cast-2022}.
    Towards Improving Exploration in Self-Imitation Learning using Intrinsic Motivation. (arXiv:2211.16838v1 [cs.LG])
    Reinforcement Learning has emerged as a strong alternative to solve optimization tasks efficiently. The use of these algorithms highly depends on the feedback signals provided by the environment in charge of informing about how good (or bad) the decisions made by the learned agent are. Unfortunately, in a broad range of problems the design of a good reward function is not trivial, so in such cases sparse reward signals are instead adopted. The lack of a dense reward function poses new challenges, mostly related to exploration. Imitation Learning has addressed those problems by leveraging demonstrations from experts. In the absence of an expert (and its subsequent demonstrations), an option is to prioritize well-suited exploration experiences collected by the agent in order to bootstrap its learning process with good exploration behaviors. However, this solution highly depends on the ability of the agent to discover such trajectories in the early stages of its learning process. To tackle this issue, we propose to combine imitation learning with intrinsic motivation, two of the most widely adopted techniques to address problems with sparse reward. In this work intrinsic motivation is used to encourage the agent to explore the environment based on its curiosity, whereas imitation learning allows repeating the most promising experiences to accelerate the learning process. This combination is shown to yield an improved performance and better generalization in procedurally-generated environments, outperforming previously reported self-imitation learning methods and achieving equal or better sample efficiency with respect to intrinsic motivation in isolation.
    Coordinating Cross-modal Distillation for Molecular Property Prediction. (arXiv:2211.16712v1 [cs.LG])
    In recent years, molecular graph representation learning (GRL) has drawn much more attention in molecular property prediction (MPP) problems. The existing graph methods have demonstrated that 3D geometric information is significant for better performance in MPP. However, accurate 3D structures are often costly and time-consuming to obtain, limiting the large-scale application of GRL. It is an intuitive solution to train with 3D to 2D knowledge distillation and predict with only 2D inputs. But some challenging problems remain open for 3D to 2D distillation. One is that the 3D view is quite distinct from the 2D view, and the other is that the gradient magnitudes of atoms in distillation are discrepant and unstable due to the variable molecular size. To address these challenging problems, we exclusively propose a distillation framework that contains global molecular distillation and local atom distillation. We also provide a theoretical insight to justify how to coordinate atom and molecular information, which tackles the drawback of variable molecular size for atom information distillation. Experimental results on two popular molecular datasets demonstrate that our proposed model achieves superior performance over other methods. Specifically, on the largest MPP dataset PCQM4Mv2 served as an "ImageNet Large Scale Visual Recognition Challenge" in the field of graph ML, the proposed method achieved a 6.9% improvement compared with the best works. And we obtained fourth place with the MAE of 0.0734 on the test-challenge set for OGB-LSC 2022 Graph Regression Task. We will release the code soon.
    Generating Realistic Synthetic Relational Data through Graph Variational Autoencoders. (arXiv:2211.16889v1 [cs.LG])
    Synthetic data generation has recently gained widespread attention as a more reliable alternative to traditional data anonymization. The involved methods are originally developed for image synthesis. Hence, their application to the typically tabular and relational datasets from healthcare, finance and other industries is non-trivial. While substantial research has been devoted to the generation of realistic tabular datasets, the study of synthetic relational databases is still in its infancy. In this paper, we combine the variational autoencoder framework with graph neural networks to generate realistic synthetic relational databases. We then apply the obtained method to two publicly available databases in computational experiments. The results indicate that real databases' structures are accurately preserved in the resulting synthetic datasets, even for large datasets with advanced data types.
    Policy Optimization over General State and Action Spaces. (arXiv:2211.16715v1 [cs.LG])
    Reinforcement learning (RL) problems over general state and action spaces are notoriously challenging. In contrast to the tableau setting, one can not enumerate all the states and then iteratively update the policies for each state. This prevents the application of many well-studied RL methods especially those with provable convergence guarantees. In this paper, we first present a substantial generalization of the recently developed policy mirror descent method to deal with general state and action spaces. We introduce new approaches to incorporate function approximation into this method, so that we do not need to use explicit policy parameterization at all. Moreover, we present a novel policy dual averaging method for which possibly simpler function approximation techniques can be applied. We establish linear convergence rate to global optimality or sublinear convergence to stationarity for these methods applied to solve different classes of RL problems under exact policy evaluation. We then define proper notions of the approximation errors for policy evaluation and investigate their impact on the convergence of these methods applied to general-state RL problems with either finite-action or continuous-action spaces. To the best of our knowledge, the development of these algorithmic frameworks as well as their convergence analysis appear to be new in the literature.
    Evaluating Digital Agriculture Recommendations with Causal Inference. (arXiv:2211.16938v1 [cs.LG])
    In contrast to the rapid digitalization of several industries, agriculture suffers from low adoption of smart farming tools. While AI-driven digital agriculture tools can offer high-performing predictive functionalities, they lack tangible quantitative evidence on their benefits to the farmers. Field experiments can derive such evidence, but are often costly, time consuming and hence limited in scope and scale of application. To this end, we propose an observational causal inference framework for the empirical evaluation of the impact of digital tools on target farm performance indicators (e.g., yield in this case). This way, we can increase farmers' trust via enhancing the transparency of the digital agriculture market and accelerate the adoption of technologies that aim to secure farmer income resilience and global agricultural sustainability. As a case study, we designed and implemented a recommendation system for the optimal sowing time of cotton based on numerical weather predictions, which was used by a farmers' cooperative during the growing season of 2021. We then leverage agricultural knowledge, collected yield data, and environmental information to develop a causal graph of the farm system. Using the back-door criterion, we identify the impact of sowing recommendations on the yield and subsequently estimate it using linear regression, matching, inverse propensity score weighting and meta-learners. The results reveal that a field sown according to our recommendations exhibited a statistically significant yield increase that ranged from 12% to 17%, depending on the method. The effect estimates were robust, as indicated by the agreement among the estimation methods and four successful refutation tests. We argue that this approach can be implemented for decision support systems of other fields, extending their evaluation beyond a performance assessment of internal functionalities.
    Score-based Continuous-time Discrete Diffusion Models. (arXiv:2211.16750v1 [cs.LG])
    Score-based modeling through stochastic differential equations (SDEs) has provided a new perspective on diffusion models, and demonstrated superior performance on continuous data. However, the gradient of the log-likelihood function, i.e., the score function, is not properly defined for discrete spaces. This makes it non-trivial to adapt \textcolor{\cdiff}{the score-based modeling} to categorical data. In this paper, we extend diffusion models to discrete variables by introducing a stochastic jump process where the reverse process denoises via a continuous-time Markov chain. This formulation admits an analytical simulation during backward sampling. To learn the reverse process, we extend score matching to general categorical data and show that an unbiased estimator can be obtained via simple matching of the conditional marginal distributions. We demonstrate the effectiveness of the proposed method on a set of synthetic and real-world music and image benchmarks.
    Continual Learning with Optimal Transport based Mixture Model. (arXiv:2211.16780v1 [cs.LG])
    Online Class Incremental learning (CIL) is a challenging setting in Continual Learning (CL), wherein data of new tasks arrive in incoming streams and online learning models need to handle incoming data streams without revisiting previous ones. Existing works used a single centroid adapted with incoming data streams to characterize a class. This approach possibly exposes limitations when the incoming data stream of a class is naturally multimodal. To address this issue, in this work, we first propose an online mixture model learning approach based on nice properties of the mature optimal transport theory (OT-MM). Specifically, the centroids and covariance matrices of the mixture model are adapted incrementally according to incoming data streams. The advantages are two-fold: (i) we can characterize more accurately complex data streams and (ii) by using centroids for each class produced by OT-MM, we can estimate the similarity of an unseen example to each class more reasonably when doing inference. Moreover, to combat the catastrophic forgetting in the CIL scenario, we further propose Dynamic Preservation. Particularly, after performing the dynamic preservation technique across data streams, the latent representations of the classes in the old and new tasks become more condensed themselves and more separate from each other. Together with a contraction feature extractor, this technique facilitates the model in mitigating the catastrophic forgetting. The experimental results on real-world datasets show that our proposed method can significantly outperform the current state-of-the-art baselines.
    VI-PINNs: Variance-involved Physics-informed Neural Networks for Fast and Accurate Prediction of Partial Differential Equations. (arXiv:2211.16753v1 [cs.LG])
    Although physics-informed neural networks(PINNs) have progressed a lot in many real applications recently, there remains problems to be further studied, such as achieving more accurate results, taking less training time, and quantifying the uncertainty of the predicted results. Recent advances in PINNs have indeed significantly improved the performance of PINNs in many aspects, but few have considered the effect of variance in the training process. In this work, we take into consideration the effect of variance and propose our VI-PINNs to give better predictions. We output two values in the final layer of the network to represent the predicted mean and variance respectively, and the latter is used to represent the uncertainty of the output. A modified negative log-likelihood loss and an auxiliary task are introduced for fast and accurate training. We perform several experiments on a wide range of different problems to highlight the advantages of our approach. The results convey that our method not only gives more accurate predictions but also converges faster.
    Taming Hyperparameter Tuning in Continuous Normalizing Flows Using the JKO Scheme. (arXiv:2211.16757v1 [math.OC])
    A normalizing flow (NF) is a mapping that transforms a chosen probability distribution to a normal distribution. Such flows are a common technique used for data generation and density estimation in machine learning and data science. The density estimate obtained with a NF requires a change of variables formula that involves the computation of the Jacobian determinant of the NF transformation. In order to tractably compute this determinant, continuous normalizing flows (CNF) estimate the mapping and its Jacobian determinant using a neural ODE. Optimal transport (OT) theory has been successfully used to assist in finding CNFs by formulating them as OT problems with a soft penalty for enforcing the standard normal distribution as a target measure. A drawback of OT-based CNFs is the addition of a hyperparameter, $\alpha$, that controls the strength of the soft penalty and requires significant tuning. We present JKO-Flow, an algorithm to solve OT-based CNF without the need of tuning $\alpha$. This is achieved by integrating the OT CNF framework into a Wasserstein gradient flow framework, also known as the JKO scheme. Instead of tuning $\alpha$, we repeatedly solve the optimization problem for a fixed $\alpha$ effectively performing a JKO update with a time-step $\alpha$. Hence we obtain a "divide and conquer" algorithm by repeatedly solving simpler problems instead of solving a potentially harder problem with large $\alpha$.
    Robust and Fast Measure of Information via Low-rank Representation. (arXiv:2211.16784v1 [cs.LG])
    The matrix-based R\'enyi's entropy allows us to directly quantify information measures from given data, without explicit estimation of the underlying probability distribution. This intriguing property makes it widely applied in statistical inference and machine learning tasks. However, this information theoretical quantity is not robust against noise in the data, and is computationally prohibitive in large-scale applications. To address these issues, we propose a novel measure of information, termed low-rank matrix-based R\'enyi's entropy, based on low-rank representations of infinitely divisible kernel matrices. The proposed entropy functional inherits the specialty of of the original definition to directly quantify information from data, but enjoys additional advantages including robustness and effective calculation. Specifically, our low-rank variant is more sensitive to informative perturbations induced by changes in underlying distributions, while being insensitive to uninformative ones caused by noises. Moreover, low-rank R\'enyi's entropy can be efficiently approximated by random projection and Lanczos iteration techniques, reducing the overall complexity from $\mathcal{O}(n^3)$ to $\mathcal{O}(n^2 s)$ or even $\mathcal{O}(ns^2)$, where $n$ is the number of data samples and $s \ll n$. We conduct large-scale experiments to evaluate the effectiveness of this new information measure, demonstrating superior results compared to matrix-based R\'enyi's entropy in terms of both performance and computational efficiency.
    DimenFix: A novel meta-dimensionality reduction method for feature preservation. (arXiv:2211.16752v1 [cs.LG])
    Dimensionality reduction has become an important research topic as demand for interpreting high-dimensional datasets has been increasing rapidly in recent years. There have been many dimensionality reduction methods with good performance in preserving the overall relationship among data points when mapping them to a lower-dimensional space. However, these existing methods fail to incorporate the difference in importance among features. To address this problem, we propose a novel meta-method, DimenFix, which can be operated upon any base dimensionality reduction method that involves a gradient-descent-like process. By allowing users to define the importance of different features, which is considered in dimensionality reduction, DimenFix creates new possibilities to visualize and understand a given dataset. Meanwhile, DimenFix does not increase the time cost or reduce the quality of dimensionality reduction with respect to the base dimensionality reduction used.
    HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer Compression. (arXiv:2211.16749v1 [cs.LG])
    Transformers have attained superior performance in natural language processing and computer vision. Their self-attention and feedforward layers are overparameterized, limiting inference speed and energy efficiency. Tensor decomposition is a promising technique to reduce parameter redundancy by leveraging tensor algebraic properties to express the parameters in a factorized form. Prior efforts used manual or heuristic factorization settings without hardware-aware customization, resulting in poor hardware efficiencies and large performance degradation. In this work, we propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions and automates the choice of tensorization shape and decomposition rank with hardware-aware co-optimization. We jointly investigate tensor contraction path optimizations and a fused Einsum mapping strategy to bridge the gap between theoretical benefits and real hardware efficiency improvement. Our two-stage knowledge distillation flow resolves the trainability bottleneck and thus significantly boosts the final accuracy of factorized Transformers. Overall, we experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss and achieve a better efficiency-accuracy Pareto frontier than hand-tuned and heuristic baselines.
    Dr.3D: Adapting 3D GANs to Artistic Drawings. (arXiv:2211.16798v1 [cs.CV])
    While 3D GANs have recently demonstrated the high-quality synthesis of multi-view consistent images and 3D shapes, they are mainly restricted to photo-realistic human portraits. This paper aims to extend 3D GANs to a different, but meaningful visual form: artistic portrait drawings. However, extending existing 3D GANs to drawings is challenging due to the inevitable geometric ambiguity present in drawings. To tackle this, we present Dr.3D, a novel adaptation approach that adapts an existing 3D GAN to artistic drawings. Dr.3D is equipped with three novel components to handle the geometric ambiguity: a deformation-aware 3D synthesis network, an alternating adaptation of pose estimation and image synthesis, and geometric priors. Experiments show that our approach can successfully adapt 3D GANs to drawings and enable multi-view consistent semantic editing of drawings.
    Efficient Reinforcement Learning (ERL): Targeted Exploration Through Action Saturation. (arXiv:2211.16691v1 [cs.LG])
    Reinforcement Learning (RL) generally suffers from poor sample complexity, mostly due to the need to exhaustively explore the state space to find good policies. On the other hand, we postulate that expert knowledge of the system to control often allows us to design simple rules we expect good policies to follow at all times. In this work, we hence propose a simple yet effective modification of continuous actor-critic RL frameworks to incorporate such prior knowledge in the learned policies and constrain them to regions of the state space that are deemed interesting, thereby significantly accelerating their convergence. Concretely, we saturate the actions chosen by the agent if they do not comply with our intuition and, critically, modify the gradient update step of the policy to ensure the learning process does not suffer from the saturation step. On a room temperature control simulation case study, these modifications allow agents to converge to well-performing policies up to one order of magnitude faster than classical RL agents while retaining good final performance.
    Boosted Dynamic Neural Networks. (arXiv:2211.16726v1 [cs.LG])
    Early-exiting dynamic neural networks (EDNN), as one type of dynamic neural networks, has been widely studied recently. A typical EDNN has multiple prediction heads at different layers of the network backbone. During inference, the model will exit at either the last prediction head or an intermediate prediction head where the prediction confidence is higher than a predefined threshold. To optimize the model, these prediction heads together with the network backbone are trained on every batch of training data. This brings a train-test mismatch problem that all the prediction heads are optimized on all types of data in training phase while the deeper heads will only see difficult inputs in testing phase. Treating training and testing inputs differently at the two phases will cause the mismatch between training and testing data distributions. To mitigate this problem, we formulate an EDNN as an additive model inspired by gradient boosting, and propose multiple training techniques to optimize the model effectively. We name our method BoostNet. Our experiments show it achieves the state-of-the-art performance on CIFAR100 and ImageNet datasets in both anytime and budgeted-batch prediction modes. Our code is released at https://github.com/SHI-Labs/Boosted-Dynamic-Networks.
    The multi-modal universe of fast-fashion: the Visuelle 2.0 benchmark. (arXiv:2204.06972v2 [cs.CV] UPDATED)
    We present Visuelle 2.0, the first dataset useful for facing diverse prediction problems that a fast-fashion company has to manage routinely. Furthermore, we demonstrate how the use of computer vision is substantial in this scenario. Visuelle 2.0 contains data for 6 seasons / 5355 clothing products of Nuna Lie, a famous Italian company with hundreds of shops located in different areas within the country. In particular, we focus on a specific prediction problem, namely short-observation new product sale forecasting (SO-fore). SO-fore assumes that the season has started and a set of new products is on the shelves of the different stores. The goal is to forecast the sales for a particular horizon, given a short, available past (few weeks), since no earlier statistics are available. To be successful, SO-fore approaches should capture this short past and exploit other modalities or exogenous data. To these aims, Visuelle 2.0 is equipped with disaggregated data at the item-shop level and multi-modal information for each clothing item, allowing computer vision approaches to come into play. The main message that we deliver is that the use of image data with deep networks boosts performances obtained when using the time series in long-term forecasting scenarios, ameliorating the WAPE and MAE by up to 5.48% and 7% respectively compared to competitive baseline methods. The dataset is available at https://humaticslab.github.io/forecasting/visuelle
    Towards Training GNNs using Explanation Directed Message Passing. (arXiv:2211.16731v1 [cs.LG])
    With the increasing use of Graph Neural Networks (GNNs) in critical real-world applications, several post hoc explanation methods have been proposed to understand their predictions. However, there has been no work in generating explanations on the fly during model training and utilizing them to improve the expressive power of the underlying GNN models. In this work, we introduce a novel explanation-directed neural message passing framework for GNNs, EXPASS (EXplainable message PASSing), which aggregates only embeddings from nodes and edges identified as important by a GNN explanation method. EXPASS can be used with any existing GNN architecture and subgraph-optimizing explainer to learn accurate graph embeddings. We theoretically show that EXPASS alleviates the oversmoothing problem in GNNs by slowing the layer wise loss of Dirichlet energy and that the embedding difference between the vanilla message passing and EXPASS framework can be upper bounded by the difference of their respective model weights. Our empirical results show that graph embeddings learned using EXPASS improve the predictive performance and alleviate the oversmoothing problems of GNNs, opening up new frontiers in graph machine learning to develop explanation-based training frameworks.
    Protein Language Models and Structure Prediction: Connection and Progression. (arXiv:2211.16742v1 [q-bio.QM])
    The prediction of protein structures from sequences is an important task for function prediction, drug design, and related biological processes understanding. Recent advances have proved the power of language models (LMs) in processing the protein sequence databases, which inherit the advantages of attention networks and capture useful information in learning representations for proteins. The past two years have witnessed remarkable success in tertiary protein structure prediction (PSP), including evolution-based and single-sequence-based PSP. It seems that instead of using energy-based models and sampling procedures, protein language model (pLM)-based pipelines have emerged as mainstream paradigms in PSP. Despite the fruitful progress, the PSP community needs a systematic and up-to-date survey to help bridge the gap between LMs in the natural language processing (NLP) and PSP domains and introduce their methodologies, advancements and practical applications. To this end, in this paper, we first introduce the similarities between protein and human languages that allow LMs extended to pLMs, and applied to protein databases. Then, we systematically review recent advances in LMs and pLMs from the perspectives of network architectures, pre-training strategies, applications, and commonly-used protein databases. Next, different types of methods for PSP are discussed, particularly how the pLM-based architectures function in the process of protein folding. Finally, we identify challenges faced by the PSP community and foresee promising research directions along with the advances of pLMs. This survey aims to be a hands-on guide for researchers to understand PSP methods, develop pLMs and tackle challenging problems in this field for practical purposes.
    Handling Missing Data via Max-Entropy Regularized Graph Autoencoder. (arXiv:2211.16771v1 [cs.LG])
    Graph neural networks (GNNs) are popular weapons for modeling relational data. Existing GNNs are not specified for attribute-incomplete graphs, making missing attribute imputation a burning issue. Until recently, many works notice that GNNs are coupled with spectral concentration, which means the spectrum obtained by GNNs concentrates on a local part in spectral domain, e.g., low-frequency due to oversmoothing issue. As a consequence, GNNs may be seriously flawed for reconstructing graph attributes as graph spectral concentration tends to cause a low imputation precision. In this work, we present a regularized graph autoencoder for graph attribute imputation, named MEGAE, which aims at mitigating spectral concentration problem by maximizing the graph spectral entropy. Notably, we first present the method for estimating graph spectral entropy without the eigen-decomposition of Laplacian matrix and provide the theoretical upper error bound. A maximum entropy regularization then acts in the latent space, which directly increases the graph spectral entropy. Extensive experiments show that MEGAE outperforms all the other state-of-the-art imputation methods on a variety of benchmark datasets.
    Automated anomaly-aware 3D segmentation of bones and cartilages in knee MR images from the Osteoarthritis Initiative. (arXiv:2211.16696v1 [eess.IV])
    In medical image analysis, automated segmentation of multi-component anatomical structures, which often have a spectrum of potential anomalies and pathologies, is a challenging task. In this work, we develop a multi-step approach using U-Net-based neural networks to initially detect anomalies (bone marrow lesions, bone cysts) in the distal femur, proximal tibia and patella from 3D magnetic resonance (MR) images of the knee in individuals with varying grades of osteoarthritis. Subsequently, the extracted data are used for downstream tasks involving semantic segmentation of individual bone and cartilage volumes as well as bone anomalies. For anomaly detection, the U-Net-based models were developed to reconstruct the bone profiles of the femur and tibia in images via inpainting so anomalous bone regions could be replaced with close to normal appearances. The reconstruction error was used to detect bone anomalies. A second anomaly-aware network, which was compared to anomaly-na\"ive segmentation networks, was used to provide a final automated segmentation of the femoral, tibial and patellar bones and cartilages from the knee MR images containing a spectrum of bone anomalies. The anomaly-aware segmentation approach provided up to 58% reduction in Hausdorff distances for bone segmentations compared to the results from the anomaly-na\"ive segmentation networks. In addition, the anomaly-aware networks were able to detect bone lesions in the MR images with greater sensitivity and specificity (area under the receiver operating characteristic curve [AUC] up to 0.896) compared to the anomaly-na\"ive segmentation networks (AUC up to 0.874).
    N-HiTS: Neural Hierarchical Interpolation for Time Series Forecasting. (arXiv:2201.12886v6 [cs.LG] UPDATED)
    Recent progress in neural forecasting accelerated improvements in the performance of large-scale forecasting systems. Yet, long-horizon forecasting remains a very difficult task. Two common challenges afflicting the task are the volatility of the predictions and their computational complexity. We introduce N-HiTS, a model which addresses both challenges by incorporating novel hierarchical interpolation and multi-rate data sampling techniques. These techniques enable the proposed method to assemble its predictions sequentially, emphasizing components with different frequencies and scales while decomposing the input signal and synthesizing the forecast. We prove that the hierarchical interpolation technique can efficiently approximate arbitrarily long horizons in the presence of smoothness. Additionally, we conduct extensive large-scale dataset experiments from the long-horizon forecasting literature, demonstrating the advantages of our method over the state-of-the-art methods, where N-HiTS provides an average accuracy improvement of almost 20% over the latest Transformer architectures while reducing the computation time by an order of magnitude (50 times). Our code is available at bit.ly/3VA5DoT
    Transfer Entropy Bottleneck: Learning Sequence to Sequence Information Transfer. (arXiv:2211.16607v1 [cs.LG])
    When presented with a data stream of two statistically dependent variables, predicting the future of one of the variables (the target stream) can benefit from information about both its history and the history of the other variable (the source stream). For example, fluctuations in temperature at a weather station can be predicted using both temperatures and barometric readings. However, a challenge when modelling such data is that it is easy for a neural network to rely on the greatest joint correlations within the target stream, which may ignore a crucial but small information transfer from the source to the target stream. As well, there are often situations where the target stream may have previously been modelled independently and it would be useful to use that model to inform a new joint model. Here, we develop an information bottleneck approach for conditional learning on two dependent streams of data. Our method, which we call Transfer Entropy Bottleneck (TEB), allows one to learn a model that bottlenecks the directed information transferred from the source variable to the target variable, while quantifying this information transfer within the model. As such, TEB provides a useful new information bottleneck approach for modelling two statistically dependent streams of data in order to make predictions about one of them.
    Extracting Semantic Knowledge from GANs with Unsupervised Learning. (arXiv:2211.16710v1 [cs.CV])
    Recently, unsupervised learning has made impressive progress on various tasks. Despite the dominance of discriminative models, increasing attention is drawn to representations learned by generative models and in particular, Generative Adversarial Networks (GANs). Previous works on the interpretation of GANs reveal that GANs encode semantics in feature maps in a linearly separable form. In this work, we further find that GAN's features can be well clustered with the linear separability assumption. We propose a novel clustering algorithm, named KLiSH, which leverages the linear separability to cluster GAN's features. KLiSH succeeds in extracting fine-grained semantics of GANs trained on datasets of various objects, e.g., car, portrait, animals, and so on. With KLiSH, we can sample images from GANs along with their segmentation masks and synthesize paired image-segmentation datasets. Using the synthesized datasets, we enable two downstream applications. First, we train semantic segmentation networks on these datasets and test them on real images, realizing unsupervised semantic segmentation. Second, we train image-to-image translation networks on the synthesized datasets, enabling semantic-conditional image synthesis without human annotations.
    Capturing long-range interaction with reciprocal space neural network. (arXiv:2211.16684v1 [cond-mat.mtrl-sci])
    Machine Learning (ML) interatomic models and potentials have been widely employed in simulations of materials. Long-range interactions often dominate in some ionic systems whose dynamics behavior is significantly influenced. However, the long-range effect such as Coulomb and Van der Wales potential is not considered in most ML interatomic potentials. To address this issue, we put forward a method that can take long-range effects into account for most ML local interatomic models with the reciprocal space neural network. The structure information in real space is firstly transformed into reciprocal space and then encoded into a reciprocal space potential or a global descriptor with full atomic interactions. The reciprocal space potential and descriptor keep full invariance of Euclidean symmetry and choice of the cell. Benefiting from the reciprocal-space information, ML interatomic models can be extended to describe the long-range potential including not only Coulomb but any other long-range interaction. A model NaCl system considering Coulomb interaction and the GaxNy system with defects are applied to illustrate the advantage of our approach. At the same time, our approach helps to improve the prediction accuracy of some global properties such as the band gap where the full atomic interaction beyond local atomic environments plays a very important role. In summary, our work has expanded the ability of current ML interatomic models and potentials when dealing with the long-range effect, hence paving a new way for accurate prediction of global properties and large-scale dynamic simulations of systems with defects.
    Low-rank Tensor Estimation via Riemannian Gauss-Newton: Statistical Optimality and Second-Order Convergence. (arXiv:2104.12031v3 [stat.ML] UPDATED)
    In this paper, we consider the estimation of a low Tucker rank tensor from a number of noisy linear measurements. The general problem covers many specific examples arising from applications, including tensor regression, tensor completion, and tensor PCA/SVD. We consider an efficient Riemannian Gauss-Newton (RGN) method for low Tucker rank tensor estimation. Different from the generic (super)linear convergence guarantee of RGN in the literature, we prove the first local quadratic convergence guarantee of RGN for low-rank tensor estimation in the noisy setting under some regularity conditions and provide the corresponding estimation error upper bounds. A deterministic estimation error lower bound, which matches the upper bound, is provided that demonstrates the statistical optimality of RGN. The merit of RGN is illustrated through two machine learning applications: tensor regression and tensor SVD. Finally, we provide the simulation results to corroborate our theoretical findings.
    Reinforcement Learning with Dynamic Convex Risk Measures. (arXiv:2112.13414v3 [cs.LG] UPDATED)
    We develop an approach for solving time-consistent risk-sensitive stochastic optimization problems using model-free reinforcement learning (RL). Specifically, we assume agents assess the risk of a sequence of random variables using dynamic convex risk measures. We employ a time-consistent dynamic programming principle to determine the value of a particular policy, and develop policy gradient update rules that aid in obtaining optimal policies. We further develop an actor-critic style algorithm using neural networks to optimize over policies. Finally, we demonstrate the performance and flexibility of our approach by applying it to three optimization problems: statistical arbitrage trading strategies, financial hedging, and obstacle avoidance robot control.
    Kalman Bayesian Neural Networks for Closed-form Online Learning. (arXiv:2110.00944v2 [cs.LG] UPDATED)
    Compared to point estimates calculated by standard neural networks, Bayesian neural networks (BNN) provide probability distributions over the output predictions and model parameters, i.e., the weights. Training the weight distribution of a BNN, however, is more involved due to the intractability of the underlying Bayesian inference problem and thus, requires efficient approximations. In this paper, we propose a novel approach for BNN learning via closed-form Bayesian inference. For this purpose, the calculation of the predictive distribution of the output and the update of the weight distribution are treated as Bayesian filtering and smoothing problems, where the weights are modeled as Gaussian random variables. This allows closed-form expressions for training the network's parameters in a sequential/online fashion without gradient descent. We demonstrate our method on several UCI datasets and compare it to the state of the art.
    Riemannian Metric Learning via Optimal Transport. (arXiv:2205.09244v3 [cs.LG] UPDATED)
    We introduce an optimal transport-based model for learning a metric tensor from cross-sectional samples of evolving probability measures on a common Riemannian manifold. We neurally parametrize the metric as a spatially-varying matrix field and efficiently optimize our model's objective using a simple alternating scheme. Using this learned metric, we can nonlinearly interpolate between probability measures and compute geodesics on the manifold. We show that metrics learned using our method improve the quality of trajectory inference on scRNA and bird migration data at the cost of little additional cross-sectional data.
    A Concentration Bound for LSPE($\lambda$). (arXiv:2111.02644v5 [cs.LG] UPDATED)
    The popular LSPE($\lambda$) algorithm for policy evaluation is revisited to derive a concentration bound that gives high probability performance guarantees from some time on.
    GENNAPE: Towards Generalized Neural Architecture Performance Estimators. (arXiv:2211.17226v1 [cs.LG])
    Predicting neural architecture performance is a challenging task and is crucial to neural architecture design and search. Existing approaches either rely on neural performance predictors which are limited to modeling architectures in a predefined design space involving specific sets of operators and connection rules, and cannot generalize to unseen architectures, or resort to zero-cost proxies which are not always accurate. In this paper, we propose GENNAPE, a Generalized Neural Architecture Performance Estimator, which is pretrained on open neural architecture benchmarks, and aims to generalize to completely unseen architectures through combined innovations in network representation, contrastive pretraining, and fuzzy clustering-based predictor ensemble. Specifically, GENNAPE represents a given neural network as a Computation Graph (CG) of atomic operations which can model an arbitrary architecture. It first learns a graph encoder via Contrastive Learning to encourage network separation by topological features, and then trains multiple predictor heads, which are soft-aggregated according to the fuzzy membership of a neural network. Experiments show that GENNAPE pretrained on NAS-Bench-101 can achieve superior transferability to 5 different public neural network benchmarks, including NAS-Bench-201, NAS-Bench-301, MobileNet and ResNet families under no or minimum fine-tuning. We further introduce 3 challenging newly labelled neural network benchmarks: HiAML, Inception and Two-Path, which can concentrate in narrow accuracy ranges. Extensive experiments show that GENNAPE can correctly discern high-performance architectures in these families. Finally, when paired with a search algorithm, GENNAPE can find architectures that improve accuracy while reducing FLOPs on three families.
    AIO-P: Expanding Neural Performance Predictors Beyond Image Classification. (arXiv:2211.17228v1 [cs.CV])
    Evaluating neural network performance is critical to deep neural network design but a costly procedure. Neural predictors provide an efficient solution by treating architectures as samples and learning to estimate their performance on a given task. However, existing predictors are task-dependent, predominantly estimating neural network performance on image classification benchmarks. They are also search-space dependent; each predictor is designed to make predictions for a specific architecture search space with predefined topologies and set of operations. In this paper, we propose a novel All-in-One Predictor (AIO-P), which aims to pretrain neural predictors on architecture examples from multiple, separate computer vision (CV) task domains and multiple architecture spaces, and then transfer to unseen downstream CV tasks or neural architectures. We describe our proposed techniques for general graph representation, efficient predictor pretraining and knowledge infusion techniques, as well as methods to transfer to downstream tasks/spaces. Extensive experimental results show that AIO-P can achieve Mean Absolute Error (MAE) and Spearman's Rank Correlation (SRCC) below 1% and above 0.5, respectively, on a breadth of target downstream CV tasks with or without fine-tuning, outperforming a number of baselines. Moreover, AIO-P can directly transfer to new architectures not seen during training, accurately rank them and serve as an effective performance estimator when paired with an algorithm designed to preserve performance while reducing FLOPs.
    Topological Data Analysis for Speech Processing. (arXiv:2211.17223v1 [cs.SD])
    We apply topological data analysis (TDA) to speech classification problems and to the introspection of a pretrained speech model, HuBERT. To this end, we introduce a number of topological and algebraic features derived from Transformer attention maps and embeddings. We show that a simple linear classifier built on top of such features outperforms a fine-tuned classification head. In particular, we achieve an improvement of about $9\%$ accuracy and $5\%$ ERR on four common datasets; on CREMA-D, the proposed feature set reaches a new state of the art performance with accuracy $80.155$. We also show that topological features are able to reveal functional roles of speech Transformer heads; e.g., we find the heads capable to distinguish between pairs of sample sources (natural/synthetic) or voices without any downstream fine-tuning. Our results demonstrate that TDA is a promising new approach for speech analysis, especially for tasks that require structural prediction.
    Average Path Length: Sparsification of Nonlinearties Creates Surprisingly Shallow Networks. (arXiv:2211.17180v1 [cs.LG])
    We perform an empirical study of the behaviour of deep networks when pushing its activation functions to become fully linear in some of its feature channels through a sparsity prior on the overall number of nonlinear units in the network. To measure the depth of the resulting partially linearized network, we compute the average number of active nonlinearities encountered along a path in the network graph. In experiments on CNNs with sparsified PReLUs on typical image classification tasks, we make several observations: Under sparsity pressure, the remaining nonlinear units organize into distinct structures, forming core-networks of near constant effective depth and width, which in turn depend on task difficulty. We consistently observe a slow decay of performance with depth until the onset of a rapid collapse in accuracy, allowing for surprisingly shallow networks at moderate losses in accuracy that outperform base-line networks of similar depth, even after increasing width to a comparable number of parameters. In terms of training, we observe a nonlinear advantage: Reducing nonlinearity after training leads to a better performance than before, in line with previous findings in linearized training, but with a gap depending on task difficulty that vanishes for easy problems.
    An Interpretable Hybrid Predictive Model of COVID-19 Cases using Autoregressive Model and LSTM. (arXiv:2211.17014v1 [cs.LG])
    The Coronavirus Disease 2019 (COVID-19) has posed a severe threat to global human health and economic. It is an urgent task to build reliable data-driven prediction models for Covid 19 cases to improve public policy making. However, COVID-19 data shows special transmission characteristics such as significant fluctuations and non-stationarity, which may be difficult to be captured by a single predictive model and poses grand challenges in effective forecasting. In this paper, we proposed a novel Hybrid data-driven model combining Autoregressive model (AR) and long short-term memory neural networks (LSTM). It can be viewed as a new neural network model and the contribution of AR and LSTM is auto tuned in the training procedure. We conduct extensive numerical experiments on data collected from 8 counties of California that display various trends. The numerical results show the Hybrid model' advantages over AR and LSTM by its predictive powers. We show that the Hybrid model achieved 4.195\% MAPE, outperformed the AR 5.629\% and LSTM 5.070\% on average, and provide a discussion on interpretability.
    Multiresolution Textual Inversion. (arXiv:2211.17115v1 [cs.CV])
    We extend Textual Inversion to learn pseudo-words that represent a concept at different resolutions. This allows us to generate images that use the concept with different levels of detail and also to manipulate different resolutions using language. Once learned, the user can generate images at different levels of agreement to the original concept; "A photo of $S^*(0)$" produces the exact object while the prompt "A photo of $S^*(0.8)$" only matches the rough outlines and colors. Our framework allows us to generate images that use different resolutions of an image (e.g. details, textures, styles) as separate pseudo-words that can be composed in various ways. We open-soure our code in the following URL: https://github.com/giannisdaras/multires_textual_inversion
    sEHR-CE: Language modelling of structured EHR data for efficient and generalizable patient cohort expansion. (arXiv:2211.17121v1 [cs.CL])
    Electronic health records (EHR) offer unprecedented opportunities for in-depth clinical phenotyping and prediction of clinical outcomes. Combining multiple data sources is crucial to generate a complete picture of disease prevalence, incidence and trajectories. The standard approach to combining clinical data involves collating clinical terms across different terminology systems using curated maps, which are often inaccurate and/or incomplete. Here, we propose sEHR-CE, a novel framework based on transformers to enable integrated phenotyping and analyses of heterogeneous clinical datasets without relying on these mappings. We unify clinical terminologies using textual descriptors of concepts, and represent individuals' EHR as sections of text. We then fine-tune pre-trained language models to predict disease phenotypes more accurately than non-text and single terminology approaches. We validate our approach using primary and secondary care data from the UK Biobank, a large-scale research study. Finally, we illustrate in a type 2 diabetes use case how sEHR-CE identifies individuals without diagnosis that share clinical characteristics with patients.
    Carbon Emission Prediction on the World Bank Dataset for Canada. (arXiv:2211.17010v1 [cs.LG])
    The continuous rise in CO2 emission into the environment is one of the most crucial issues facing the whole world. Many countries are making crucial decisions to control their carbon footprints to escape some of their catastrophic outcomes. There has been a lot of research going on to project the amount of carbon emissions in the future, which can help us to develop innovative techniques to deal with it in advance. Machine learning is one of the most advanced and efficient techniques for predicting the amount of carbon emissions from current data. This paper provides the methods for predicting carbon emissions (CO2 emissions) for the next few years. The predictions are based on data from the past 50 years. The dataset, which is used for making the prediction, is collected from World Bank datasets. This dataset contains CO2 emissions (metric tons per capita) of all the countries from 1960 to 2018. Our method consists of using machine learning techniques to take the idea of what carbon emission measures will look like in the next ten years and project them onto the dataset taken from the World Bank's data repository. The purpose of this research is to compare how different machine learning models (Decision Tree, Linear Regression, Random Forest, and Support Vector Machine) perform on a similar dataset and measure the difference between their predictions.
    Neural Integro-Differential Equations. (arXiv:2206.14282v4 [cs.LG] UPDATED)
    Modeling continuous dynamical systems from discretely sampled observations is a fundamental problem in data science. Often, such dynamics are the result of non-local processes that present an integral over time. As such, these systems are modeled with Integro-Differential Equations (IDEs); generalizations of differential equations that comprise both an integral and a differential component. For example, brain dynamics are not accurately modeled by differential equations since their behavior is non-Markovian, i.e. dynamics are in part dictated by history. Here, we introduce the Neural IDE (NIDE), a novel deep learning framework based on the theory of IDEs where integral operators are learned using neural networks. We test NIDE on several toy and brain activity datasets and demonstrate that NIDE outperforms other models. These tasks include time extrapolation as well as predicting dynamics from unseen initial conditions, which we test on whole-cortex activity recordings in freely behaving mice. Further, we show that NIDE can decompose dynamics into their Markovian and non-Markovian constituents via the learned integral operator, which we test on fMRI brain activity recordings of people on ketamine. Finally, the integrand of the integral operator provides a latent space that gives insight into the underlying dynamics, which we demonstrate on wide-field brain imaging recordings. Altogether, NIDE is a novel approach that enables modeling of complex non-local dynamics with neural networks.
    Fair Ranking with Noisy Protected Attributes. (arXiv:2211.17067v1 [cs.LG])
    The fair-ranking problem, which asks to rank a given set of items to maximize utility subject to group fairness constraints, has received attention in the fairness, information retrieval, and machine learning literature. Recent works, however, observe that errors in socially-salient (including protected) attributes of items can significantly undermine fairness guarantees of existing fair-ranking algorithms and raise the problem of mitigating the effect of such errors. We study the fair-ranking problem under a model where socially-salient attributes of items are randomly and independently perturbed. We present a fair-ranking framework that incorporates group fairness requirements along with probabilistic information about perturbations in socially-salient attributes. We provide provable guarantees on the fairness and utility attainable by our framework and show that it is information-theoretically impossible to significantly beat these guarantees. Our framework works for multiple non-disjoint attributes and a general class of fairness constraints that includes proportional and equal representation. Empirically, we observe that, compared to baselines, our algorithm outputs rankings with higher fairness, and has a similar or better fairness-utility trade-off compared to baselines.
    Explaining automated gender classification of human gait. (arXiv:2211.17015v1 [cs.LG])
    State-of-the-art machine learning (ML) models are highly effective in classifying gait analysis data, however, they lack in providing explanations for their predictions. This "black-box" characteristic makes it impossible to understand on which input patterns, ML models base their predictions. The present study investigates whether Explainable Artificial Intelligence methods, i.e., Layer-wise Relevance Propagation (LRP), can be useful to enhance the explainability of ML predictions in gait classification. The research question was: Which input patterns are most relevant for an automated gender classification model and do they correspond to characteristics identified in the literature? We utilized a subset of the GAITREC dataset containing five bilateral ground reaction force (GRF) recordings per person during barefoot walking of 62 healthy participants: 34 females and 28 males. Each input signal (right and left side) was min-max normalized before concatenation and fed into a multi-layer Convolutional Neural Network (CNN). The classification accuracy was obtained over a stratified ten-fold cross-validation. To identify gender-specific patterns, the input relevance scores were derived using LRP. The mean classification accuracy of the CNN with 83.3% showed a clear superiority over the zero-rule baseline of 54.8%.
    T2G-Former: Organizing Tabular Features into Relation Graphs Promotes Heterogeneous Feature Interaction. (arXiv:2211.16887v1 [cs.LG])
    Recent development of deep neural networks (DNNs) for tabular learning has largely benefited from the capability of DNNs for automatic feature interaction. However, the heterogeneity nature of tabular features makes such features relatively independent, and developing effective methods to promote tabular feature interaction still remains an open problem. In this paper, we propose a novel Graph Estimator, which automatically estimates the relations among tabular features and builds graphs by assigning edges between related features. Such relation graphs organize independent tabular features into a kind of graph data such that interaction of nodes (tabular features) can be conducted in an orderly fashion. Based on our proposed Graph Estimator, we present a bespoke Transformer network tailored for tabular learning, called T2G-Former, which processes tabular data by performing tabular feature interaction guided by the relation graphs. A specific Cross-level Readout collects salient features predicted by the layers in T2G-Former across different levels, and attains global semantics for final prediction. Comprehensive experiments show that our T2G-Former achieves superior performance among DNNs and is competitive with non-deep Gradient Boosted Decision Tree models.
    Context-Aware Ensemble Learning for Time Series. (arXiv:2211.16884v1 [cs.LG])
    We investigate ensemble methods for prediction in an online setting. Unlike all the literature in ensembling, for the first time, we introduce a new approach using a meta learner that effectively combines the base model predictions via using a superset of the features that is the union of the base models' feature vectors instead of the predictions themselves. Here, our model does not use the predictions of the base models as inputs to a machine learning algorithm, but choose the best possible combination at each time step based on the state of the problem. We explore three different constraint spaces for the ensembling of the base learners that linearly combines the base predictions, which are convex combinations where the components of the ensembling vector are all nonnegative and sum up to 1; affine combinations where the weight vector components are required to sum up to 1; and the unconstrained combinations where the components are free to take any real value. The constraints are both theoretically analyzed under known statistics and integrated into the learning procedure of the meta learner as a part of the optimization in an automated manner. To show the practical efficiency of the proposed method, we employ a gradient-boosted decision tree and a multi-layer perceptron separately as the meta learners. Our framework is generic so that one can use other machine learning architectures as the ensembler as long as they allow for a custom differentiable loss for minimization. We demonstrate the learning behavior of our algorithm on synthetic data and the significant performance improvements over the conventional methods over various real life datasets, extensively used in the well-known data competitions. Furthermore, we openly share the source code of the proposed method to facilitate further research and comparison.
    Learning non-stationary and discontinuous functions using clustering, classification and Gaussian process modelling. (arXiv:2211.16909v1 [stat.ML])
    Surrogate models have shown to be an extremely efficient aid in solving engineering problems that require repeated evaluations of an expensive computational model. They are built by sparsely evaluating the costly original model and have provided a way to solve otherwise intractable problems. A crucial aspect in surrogate modelling is the assumption of smoothness and regularity of the model to approximate. This assumption is however not always met in reality. For instance in civil or mechanical engineering, some models may present discontinuities or non-smoothness, e.g., in case of instability patterns such as buckling or snap-through. Building a single surrogate model capable of accounting for these fundamentally different behaviors or discontinuities is not an easy task. In this paper, we propose a three-stage approach for the approximation of non-smooth functions which combines clustering, classification and regression. The idea is to split the space following the localized behaviors or regimes of the system and build local surrogates that are eventually assembled. A sequence of well-known machine learning techniques are used: Dirichlet process mixtures models (DPMM), support vector machines and Gaussian process modelling. The approach is tested and validated on two analytical functions and a finite element model of a tensile membrane structure.
    Statistical treatment of convolutional neural network super-resolution of inland surface wind for subgrid-scale variability quantification. (arXiv:2211.16708v1 [physics.ao-ph])
    Machine learning models are frequently employed to perform either purely physics-free or hybrid downscaling of climate data. However, the majority of these implementations operate over relatively small downscaling factors of about 4--6x. This study examines the ability of convolutional neural networks (CNN) to downscale surface wind speed data from three different coarse resolutions (25km, 48km, and 100km side-length grid cells) to 3km and additionally focuses on the ability to recover subgrid-scale variability. Within each downscaling factor, namely 8x, 16x, and 32x, we consider models that produce fine-scale wind speed predictions as functions of different input features: coarse wind fields only; coarse wind and fine-scale topography; and coarse wind, topography, and temporal information in the form of a timestamp. Furthermore, we train one model at 25km to 3km resolution whose fine-scale outputs are probability density function parameters through which sample wind speeds can be generated. All CNN predictions performed on one out-of-sample data outperform classical interpolation. Models with coarse wind and fine topography are shown to exhibit the best performance compared to other models operating across the same downscaling factor. Our timestamp encoding results in lower out-of-sample generalizability compared to other input configurations. Overall, the downscaling factor plays the largest role in model performance.
    MLC at HECKTOR 2022: The Effect and Importance of Training Data when Analyzing Cases of Head and Neck Tumors using Machine Learning. (arXiv:2211.16834v1 [eess.IV])
    Head and neck cancers are the fifth most common cancer worldwide, and recently, analysis of Positron Emission Tomography (PET) and Computed Tomography (CT) images has been proposed to identify patients with a prognosis. Even though the results look promising, more research is needed to further validate and improve the results. This paper presents the work done by team MLC for the 2022 version of the HECKTOR grand challenge held at MICCAI 2022. For Task 1, the automatic segmentation task, our approach was, in contrast to earlier solutions using 3D segmentation, to keep it as simple as possible using a 2D model, analyzing every slice as a standalone image. In addition, we were interested in understanding how different modalities influence the results. We proposed two approaches; one using only the CT scans to make predictions and another using a combination of the CT and PET scans. For Task 2, the prediction of recurrence-free survival, we first proposed two approaches, one where we only use patient data and one where we combined the patient data with segmentations from the image model. For the prediction of the first two approaches, we used Random Forest. In our third approach, we combined patient data and image data using XGBoost. Low kidney function might worsen cancer prognosis. In this approach, we therefore estimated the kidney function of the patients and included it as a feature. Overall, we conclude that our simple methods were not able to compete with the highest-ranking submissions, but we still obtained reasonably good scores. We also got interesting insights into how the combination of different modalities can influence the segmentation and predictions.
    General policy mapping: online continual reinforcement learning inspired on the insect brain. (arXiv:2211.16759v1 [cs.LG])
    We have developed a model for online continual or lifelong reinforcement learning (RL) inspired on the insect brain. Our model leverages the offline training of a feature extraction and a common general policy layer to enable the convergence of RL algorithms in online settings. Sharing a common policy layer across tasks leads to positive backward transfer, where the agent continuously improved in older tasks sharing the same underlying general policy. Biologically inspired restrictions to the agent's network are key for the convergence of RL algorithms. This provides a pathway towards efficient online RL in resource-constrained scenarios.
    Adaptive adversarial training method for improving multi-scale GAN based on generalization bound theory. (arXiv:2211.16791v1 [cs.CV])
    In recent years, multi-scale generative adversarial networks (GANs) have been proposed to build generalized image processing models based on single sample. Constraining on the sample size, multi-scale GANs have much difficulty converging to the global optimum, which ultimately leads to limitations in their capabilities. In this paper, we pioneered the introduction of PAC-Bayes generalized bound theory into the training analysis of specific models under different adversarial training methods, which can obtain a non-vacuous upper bound on the generalization error for the specified multi-scale GAN structure. Based on the drastic changes we found of the generalization error bound under different adversarial attacks and different training states, we proposed an adaptive training method which can greatly improve the image manipulation ability of multi-scale GANs. The final experimental results show that our adaptive training method in this paper has greatly contributed to the improvement of the quality of the images generated by multi-scale GANs on several image manipulation tasks. In particular, for the image super-resolution restoration task, the multi-scale GAN model trained by the proposed method achieves a 100% reduction in natural image quality evaluator (NIQE) and a 60% reduction in root mean squared error (RMSE), which is better than many models trained on large-scale datasets.
    Offline Policy Evaluation and Optimization under Confounding. (arXiv:2211.16583v1 [stat.ML])
    With a few exceptions, work in offline reinforcement learning (RL) has so far assumed that there is no confounding. In a classical regression setting, confounders introduce omitted variable bias and inhibit the identification of causal effects. In offline RL, they prevent the identification of a policy's value, and therefore make it impossible to perform policy improvement. Using conventional methods in offline RL in the presence of confounding can therefore not only lead to poor decisions and poor policies, but can also have disastrous effects in applications such as healthcare and education. We provide approaches for both off-policy evaluation (OPE) and local policy optimization in the settings of i.i.d. and global confounders. Theoretical and empirical results confirm the validity and viability of these methods.
    Multimodal Learning for Multi-Omics: A Survey. (arXiv:2211.16509v1 [q-bio.GN])
    With advanced imaging, sequencing, and profiling technologies, multiple omics data become increasingly available and hold promises for many healthcare applications such as cancer diagnosis and treatment. Multimodal learning for integrative multi-omics analysis can help researchers and practitioners gain deep insights into human diseases and improve clinical decisions. However, several challenges are hindering the development in this area, including the availability of easily accessible open-source tools. This survey aims to provide an up-to-date overview of the data challenges, fusion approaches, datasets, and software tools from several new perspectives. We identify and investigate various omics data challenges that can help us understand the field better. We categorize fusion approaches comprehensively to cover existing methods in this area. We collect existing open-source tools to facilitate their broader utilization and development. We explore a broad range of omics data modalities and a list of accessible datasets. Finally, we summarize future directions that can potentially address existing gaps and answer the pressing need to advance multimodal learning for multi-omics data analysis.  ( 2 min )
    CRU: A Novel Neural Architecture for Improving the Predictive Performance of Time-Series Data. (arXiv:2211.16653v1 [cs.LG])
    The time-series forecasting (TSF) problem is a traditional problem in the field of artificial intelligence. Models such as Recurrent Neural Network (RNN), Long Short Term Memory (LSTM), and GRU (Gate Recurrent Units) have contributed to improving the predictive accuracy of TSF. Furthermore, model structures have been proposed to combine time-series decomposition methods, such as seasonal-trend decomposition using Loess (STL) to ensure improved predictive accuracy. However, because this approach is learned in an independent model for each component, it cannot learn the relationships between time-series components. In this study, we propose a new neural architecture called a correlation recurrent unit (CRU) that can perform time series decomposition within a neural cell and learn correlations (autocorrelation and correlation) between each decomposition component. The proposed neural architecture was evaluated through comparative experiments with previous studies using five univariate time-series datasets and four multivariate time-series data. The results showed that long- and short-term predictive performance was improved by more than 10%. The experimental results show that the proposed CRU is an excellent method for TSF problems compared to other neural architectures.  ( 2 min )
    Every Node Counts: Improving the Training of Graph Neural Networks on Node Classification. (arXiv:2211.16631v1 [cs.LG])
    Graph Neural Networks (GNNs) are prominent in handling sparse and unstructured data efficiently and effectively. Specifically, GNNs were shown to be highly effective for node classification tasks, where labelled information is available for only a fraction of the nodes. Typically, the optimization process, through the objective function, considers only labelled nodes while ignoring the rest. In this paper, we propose novel objective terms for the training of GNNs for node classification, aiming to exploit all the available data and improve accuracy. Our first term seeks to maximize the mutual information between node and label features, considering both labelled and unlabelled nodes in the optimization process. Our second term promotes anisotropic smoothness in the prediction maps. Lastly, we propose a cross-validating gradients approach to enhance the learning from labelled data. Our proposed objectives are general and can be applied to various GNNs and require no architectural modifications. Extensive experiments demonstrate our approach using popular GNNs like GCN, GAT and GCNII, reading a consistent and significant accuracy improvement on 10 real-world node classification datasets.  ( 2 min )
    FedGPO: Heterogeneity-Aware Global Parameter Optimization for Efficient Federated Learning. (arXiv:2211.16669v1 [cs.LG])
    Federated learning (FL) has emerged as a solution to deal with the risk of privacy leaks in machine learning training. This approach allows a variety of mobile devices to collaboratively train a machine learning model without sharing the raw on-device training data with the cloud. However, efficient edge deployment of FL is challenging because of the system/data heterogeneity and runtime variance. This paper optimizes the energy-efficiency of FL use cases while guaranteeing model convergence, by accounting for the aforementioned challenges. We propose FedGPO based on a reinforcement learning, which learns how to identify optimal global parameters (B, E, K) for each FL aggregation round adapting to the system/data heterogeneity and stochastic runtime variance. In our experiments, FedGPO improves the model convergence time by 2.4 times, and achieves 3.6 times higher energy efficiency over the baseline settings, respectively.  ( 2 min )
    A Novel Statistical Independence Test for Dynamic Causal Discovery with Rare Events. (arXiv:2211.16596v1 [stat.ML])
    Causal phenomena associated with rare events frequently occur across a wide range of engineering and mathematical problems, such as risk-sensitive safety analysis, accident analysis and prevention, and extreme value theory. However, current methods for causal discovery are often unable to uncover causal links between random variables that manifest only when the variables first experience low-probability realizations. To address this issue, we introduce a novel algorithm that performs statistical independence tests on data collected from time-invariant dynamical systems in which rare but consequential events occur. We seek to understand if the state of the dynamical system causally affects the likelihood of the rare event. In particular, we exploit the time-invariance of the underlying data to superimpose the occurrences of rare events, thus creating a new dataset, with rare events are better represented, on which conditional independence tests can be more efficiently performed. We provide non-asymptotic bounds for the consistency of our algorithm, and validate the performance of our algorithm across various simulated scenarios, with applications to traffic accidents.  ( 2 min )
    A Node-collaboration-informed Graph Convolutional Network for Precise Representation to Undirected Weighted Graphs. (arXiv:2211.16689v1 [cs.LG])
    An undirected weighted graph (UWG) is frequently adopted to describe the interactions among a solo set of nodes from real applications, such as the user contact frequency from a social network services system. A graph convolutional network (GCN) is widely adopted to perform representation learning to a UWG for subsequent pattern analysis tasks such as clustering or missing data estimation. However, existing GCNs mostly neglects the latent collaborative information hidden in its connected node pairs. To address this issue, this study proposes to model the node collaborations via a symmetric latent factor analysis model, and then regards it as a node-collaboration module for supplementing the collaboration loss in a GCN. Based on this idea, a Node-collaboration-informed Graph Convolutional Network (NGCN) is proposed with three-fold ideas: a) Learning latent collaborative information from the interaction of node pairs via a node-collaboration module; b) Building the residual connection and weighted representation propagation to obtain high representation capacity; and c) Implementing the model optimization in an end-to-end fashion to achieve precise representation to the target UWG. Empirical studies on UWGs emerging from real applications demonstrate that owing to its efficient incorporation of node-collaborations, the proposed NGCN significantly outperforms state-of-the-art GCNs in addressing the task of missing weight estimation. Meanwhile, its good scalability ensures its compatibility with more advanced GCN extensions, which will be further investigated in our future studies.  ( 2 min )
    Hierarchically Clustered PCA and CCA via a Convex Clustering Penalty. (arXiv:2211.16553v1 [cs.LG])
    We introduce an unsupervised learning approach that combines the truncated singular value decomposition with convex clustering to estimate within-cluster directions of maximum variance/covariance (in the variables) while simultaneously hierarchically clustering (on observations). In contrast to previous work on joint clustering and embedding, our approach has a straightforward formulation, is readily scalable via distributed optimization, and admits a direct interpretation as hierarchically clustered principal component analysis (PCA) or hierarchically clustered canonical correlation analysis (CCA). Through numerical experiments and real-world examples relevant to precision medicine, we show that our approach outperforms traditional and contemporary clustering methods on underdetermined problems ($p \gg N$ with tens of observations) and scales to large datasets (e.g., $N=100,000$; $p=1,000$) while yielding interpretable dendrograms of hierarchical per-cluster principal components or canonical variates.  ( 2 min )
    Reinforced Genetic Algorithm for Structure-based Drug Design. (arXiv:2211.16508v1 [q-bio.QM])
    Structure-based drug design (SBDD) aims to discover drug candidates by finding molecules (ligands) that bind tightly to a disease-related protein (targets), which is the primary approach to computer-aided drug discovery. Recently, applying deep generative models for three-dimensional (3D) molecular design conditioned on protein pockets to solve SBDD has attracted much attention, but their formulation as probabilistic modeling often leads to unsatisfactory optimization performance. On the other hand, traditional combinatorial optimization methods such as genetic algorithms (GA) have demonstrated state-of-the-art performance in various molecular optimization tasks. However, they do not utilize protein target structure to inform design steps but rely on a random-walk-like exploration, which leads to unstable performance and no knowledge transfer between different tasks despite the similar binding physics. To achieve a more stable and efficient SBDD, we propose Reinforced Genetic Algorithm (RGA) that uses neural models to prioritize the profitable design steps and suppress random-walk behavior. The neural models take the 3D structure of the targets and ligands as inputs and are pre-trained using native complex structures to utilize the knowledge of the shared binding physics from different targets and then fine-tuned during optimization. We conduct thorough empirical studies on optimizing binding affinity to various disease targets and show that RGA outperforms the baselines in terms of docking scores and is more robust to random initializations. The ablation study also indicates that the training on different targets helps improve performance by leveraging the shared underlying physics of the binding processes. The code is available at https://github.com/futianfan/reinforced-genetic-algorithm.  ( 2 min )
    SPARTAN: Sparse Hierarchical Memory for Parameter-Efficient Transformers. (arXiv:2211.16634v1 [cs.CL])
    Fine-tuning pre-trained language models (PLMs) achieves impressive performance on a range of downstream tasks, and their sizes have consequently been getting bigger. Since a different copy of the model is required for each task, this paradigm is infeasible for storage-constrained edge devices like mobile phones. In this paper, we propose SPARTAN, a parameter efficient (PE) and computationally fast architecture for edge devices that adds hierarchically organized sparse memory after each Transformer layer. SPARTAN freezes the PLM parameters and fine-tunes only its memory, thus significantly reducing storage costs by re-using the PLM backbone for different tasks. SPARTAN contains two levels of memory, with only a sparse subset of parents being chosen in the first level for each input, and children cells corresponding to those parents being used to compute an output representation. This sparsity combined with other architecture optimizations improves SPARTAN's throughput by over 90% during inference on a Raspberry Pi 4 when compared to PE baselines (adapters) while also outperforming the latter by 0.1 points on the GLUE benchmark. Further, it can be trained 34% faster in a few-shot setting, while performing within 0.9 points of adapters. Qualitative analysis shows that different parent cells in SPARTAN specialize in different topics, thus dividing responsibility efficiently.  ( 2 min )
    Relative Sparsity for Medical Decision Problems. (arXiv:2211.16566v1 [stat.ME])
    Existing statistical methods can be used to estimate a policy, or a mapping from covariates to decisions, which can then instruct decision makers. There is great interest in using such data-driven policies in healthcare. In healthcare, however, it is often important to explain to the healthcare provider, and to the patient, how a new policy differs from the current standard of care. This end is facilitated if one can pinpoint the aspects (i.e., parameters) of the policy that change most when moving from the standard of care to the new, suggested policy. To this end, we adapt ideas from Trust Region Policy Optimization. In our work, however, unlike in Trust Region Policy Optimization, the difference between the suggested policy and standard of care is required to be sparse, aiding with interpretability. In particular, we trade off between maximizing expected reward and minimizing the $L_1$ norm divergence between the parameters of the two policies. This yields "relative sparsity," where, as a function of a tuning parameter, $\lambda$, we can approximately control the number of parameters in our suggested policy that differ from their counterparts in the standard of care. We develop our methodology for the observational data setting. We propose a problem-specific criterion for selecting $\lambda$, perform simulations, and illustrate our method with a real, observational healthcare dataset, deriving a policy that is easy to explain in the context of the current standard of care. Our work promotes the adoption of data-driven decision aids, which have great potential to improve health outcomes.  ( 2 min )
    SinDDM: A Single Image Denoising Diffusion Model. (arXiv:2211.16582v1 [cs.CV])
    Denoising diffusion models (DDMs) have led to staggering performance leaps in image generation, editing and restoration. However, existing DDMs use very large datasets for training. Here, we introduce a framework for training a DDM on a single image. Our method, which we coin SinDDM, learns the internal statistics of the training image by using a multi-scale diffusion process. To drive the reverse diffusion process, we use a fully-convolutional light-weight denoiser, which is conditioned on both the noise level and the scale. This architecture allows generating samples of arbitrary dimensions, in a coarse-to-fine manner. As we illustrate, SinDDM generates diverse high-quality samples, and is applicable in a wide array of tasks, including style transfer and harmonization. Furthermore, it can be easily guided by external supervision. Particularly, we demonstrate text-guided generation from a single image using a pre-trained CLIP model.  ( 2 min )
    Automatic Discovery of Multi-perspective Process Model using Reinforcement Learning. (arXiv:2211.16687v1 [cs.LG])
    Process mining is a methodology for the derivation and analysis of process models based on the event log. When process mining is employed to analyze business processes, the process discovery step, the conformance checking step, and the enhancements step are repeated. If a user wants to analyze a process from multiple perspectives (such as activity perspectives, originator perspectives, and time perspectives), the above procedure, inconveniently, has to be repeated over and over again. Although past studies involving process mining have applied detailed stepwise methodologies, no attempt has been made to incorporate and optimize multi-perspective process mining procedures. This paper contributes to developing a solution approach to this problem. First, we propose an automatic discovery framework of a multi-perspective process model based on deep Q-Learning. Our Dual Experience Replay with Experience Distribution (DERED) approach can automatically perform process model discovery steps, conformance check steps, and enhancements steps. Second, we propose a new method that further optimizes the experience replay (ER) method, one of the key algorithms of deep Q-learning, to improve the learning performance of reinforcement learning agents. Finally, we validate our approach using six real-world event datasets collected in port logistics, steel manufacturing, finance, IT, and government administration. We show that our DERED approach can provide users with multi-perspective, high-quality process models that can be employed more conveniently for multi-perspective process mining.  ( 2 min )
    Stochastic Parameterization of Column Physics using Generative Adversarial Networks. (arXiv:2211.16654v1 [physics.ao-ph])
    We demonstrate the use of a probabilistic machine learning technique to develop stochastic parameterizations of atmospheric column-physics. After suitable preprocessing of NASA's Modern-Era Retrospective analysis for Research and Applications, version 2 (MERRA2) data to minimize the effects of high-frequency, high-wavenumber component of MERRA2 estimate of vertical velocity, we use generative adversarial networks to learn the probability distribution of vertical profiles of diabatic sources conditioned on vertical profiles of temperature and humidity. This may be viewed as an improvement over previous similar but deterministic approaches that seek to alleviate both, shortcomings of human-designed physics parameterizations, and the computational demand of the "physics" step in climate models.  ( 2 min )
    COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training. (arXiv:2211.16648v1 [cs.DC])
    Modern Deep Learning (DL) models have grown to sizes requiring massive clusters of specialized, high-end nodes to train. Designing such clusters to maximize both performance and utilization to amortize their steep cost is a challenging task requiring careful balance of compute, memory, and network resources. Moreover, a plethora of each model's tuning knobs drastically affect the performance, with optimal values often depending on the underlying cluster's characteristics, which necessitates a complex cluster-workload co-design process. To facilitate the design space exploration of such massive DL training clusters, we introduce COMET a holistic cluster design methodology and workflow to jointly study the impact of parallelization strategies and key cluster resource provisioning on the performance of distributed DL training. We develop a step-by-step process to establish a reusable and flexible methodology, and demonstrate its application with a case study of training a Transformer-1T model on a cluster of variable compute, memory, and network resources. Our case study demonstrates COMET's utility in identifying promising architectural optimization directions and guiding system designers in configuring key model and cluster parameters.  ( 2 min )
    ButterflyNet2D: Bridging Classical Methods and Neural Network Methods in Image Processing. (arXiv:2211.16578v1 [cs.CV])
    Both classical Fourier transform-based methods and neural network methods are widely used in image processing tasks. The former has better interpretability, whereas the latter often achieves better performance in practice. This paper introduces ButterflyNet2D, a regular CNN with sparse cross-channel connections. A Fourier initialization strategy for ButterflyNet2D is proposed to approximate Fourier transforms. Numerical experiments validate the accuracy of ButterflyNet2D approximating both the Fourier and the inverse Fourier transforms. Moreover, through four image processing tasks and image datasets, we show that training ButterflyNet2D from Fourier initialization does achieve better performance than random initialized neural networks.  ( 2 min )
    Numerical evidence against advantage with quantum fidelity kernels on classical data. (arXiv:2211.16551v1 [quant-ph])
    Quantum machine learning techniques are commonly considered one of the most promising candidates for demonstrating practical quantum advantage. In particular, quantum kernel methods have been demonstrated to be able to learn certain classically intractable functions efficiently if the kernel is well-aligned with the target function. In the more general case, quantum kernels are known to suffer from exponential "flattening" of the spectrum as the number of qubits grows, preventing generalization and necessitating the control of the inductive bias by hyperparameters. We show that the general-purpose hyperparameter tuning techniques proposed to improve the generalization of quantum kernels lead to the kernel becoming well-approximated by a classical kernel, removing the possibility of quantum advantage. We provide extensive numerical evidence for this phenomenon utilizing multiple previously studied quantum feature maps and both synthetic and real data. Our results show that unless novel techniques are developed to control the inductive bias of quantum kernels, they are unlikely to provide a quantum advantage on classical data.  ( 2 min )
    Dynamic Sparse Training via Balancing the Exploration-Exploitation Trade-off. (arXiv:2211.16667v1 [cs.LG])
    Over-parameterization of deep neural networks (DNNs) has shown high prediction accuracy for many applications. Although effective, the large number of parameters hinders its popularity on resource-limited devices and has an outsize environmental impact. Sparse training (using a fixed number of nonzero weights in each iteration) could significantly mitigate the training costs by reducing the model size. However, existing sparse training methods mainly use either random-based or greedy-based drop-and-grow strategies, resulting in local minimal and low accuracy. In this work, to assist explainable sparse training, we propose important weights Exploitation and coverage Exploration to characterize Dynamic Sparse Training (DST-EE), and provide quantitative analysis of these two metrics. We further design an acquisition function and provide the theoretical guarantees for the proposed method and clarify its convergence property. Experimental results show that sparse models (up to 98\% sparsity) obtained by our proposed method outperform the SOTA sparse training methods on a wide variety of deep learning tasks. On VGG-19 / CIFAR-100, ResNet-50 / CIFAR-10, ResNet-50 / CIFAR-100, our method has even higher accuracy than dense models. On ResNet-50 / ImageNet, the proposed method has up to 8.2\% accuracy improvement compared to SOTA sparse training methods.  ( 2 min )
    Brain Tumor MRI Classification using a Novel Deep Residual and Regional CNN. (arXiv:2211.16571v1 [eess.IV])
    Brain tumor classification is crucial for clinical analysis and an effective treatment plan to cure patients. Deep learning models help radiologists to accurately and efficiently analyze tumors without manual intervention. However, brain tumor analysis is challenging because of its complex structure, texture, size, location, and appearance. Therefore, a novel deep residual and regional-based Res-BRNet Convolutional Neural Network (CNN) is developed for effective brain tumor (Magnetic Resonance Imaging) MRI classification. The developed Res-BRNet employed Regional and boundary-based operations in a systematic order within the modified spatial and residual blocks. Moreover, the spatial block extract homogeneity and boundary-defined features at the abstract level. Furthermore, the residual blocks employed at the target level significantly learn local and global texture variations of different classes of brain tumors. The efficiency of the developed Res-BRNet is evaluated on a standard dataset; collected from Kaggle and Figshare containing various tumor categories, including meningioma, glioma, pituitary, and healthy images. Experiments prove that the developed Res-BRNet outperforms the standard CNN models and attained excellent performances (accuracy: 98.22%, sensitivity: 0.9811, F-score: 0.9841, and precision: 0.9822) on challenging datasets. Additionally, the performance of the proposed Res-BRNet indicates a strong potential for medical image-based disease analyses.  ( 2 min )
    Hierarchical Transformer for Survival Prediction Using Multimodality Whole Slide Images and Genomics. (arXiv:2211.16632v1 [cs.CV])
    Learning good representation of giga-pixel level whole slide pathology images (WSI) for downstream tasks is critical. Previous studies employ multiple instance learning (MIL) to represent WSIs as bags of sampled patches because, for most occasions, only slide-level labels are available, and only a tiny region of the WSI is disease-positive area. However, WSI representation learning still remains an open problem due to: (1) patch sampling on a higher resolution may be incapable of depicting microenvironment information such as the relative position between the tumor cells and surrounding tissues, while patches at lower resolution lose the fine-grained detail; (2) extracting patches from giant WSI results in large bag size, which tremendously increases the computational cost. To solve the problems, this paper proposes a hierarchical-based multimodal transformer framework that learns a hierarchical mapping between pathology images and corresponding genes. Precisely, we randomly extract instant-level patch features from WSIs with different magnification. Then a co-attention mapping between imaging and genomics is learned to uncover the pairwise interaction and reduce the space complexity of imaging features. Such early fusion makes it computationally feasible to use MIL Transformer for the survival prediction task. Our architecture requires fewer GPU resources compared with benchmark methods while maintaining better WSI representation ability. We evaluate our approach on five cancer types from the Cancer Genome Atlas database and achieved an average c-index of $0.673$, outperforming the state-of-the-art multimodality methods.  ( 2 min )
    Testing GLOM's ability to infer wholes from ambiguous parts. (arXiv:2211.16564v1 [cs.CV])
    The GLOM architecture proposed by Hinton [2021] is a recurrent neural network for parsing an image into a hierarchy of wholes and parts. When a part is ambiguous, GLOM assumes that the ambiguity can be resolved by allowing the part to make multi-modal predictions for the pose and identity of the whole to which it belongs and then using attention to similar predictions coming from other possibly ambiguous parts to settle on a common mode that is predicted by several different parts. In this study, we describe a highly simplified version of GLOM that allows us to assess the effectiveness of this way of dealing with ambiguity. Our results show that, with supervised training, GLOM is able to successfully form islands of very similar embedding vectors for all of the locations occupied by the same object and it is also robust to strong noise injections in the input and to out-of-distribution input transformations.  ( 2 min )
  • Open

    Kalman Bayesian Neural Networks for Closed-form Online Learning. (arXiv:2110.00944v2 [cs.LG] UPDATED)
    Compared to point estimates calculated by standard neural networks, Bayesian neural networks (BNN) provide probability distributions over the output predictions and model parameters, i.e., the weights. Training the weight distribution of a BNN, however, is more involved due to the intractability of the underlying Bayesian inference problem and thus, requires efficient approximations. In this paper, we propose a novel approach for BNN learning via closed-form Bayesian inference. For this purpose, the calculation of the predictive distribution of the output and the update of the weight distribution are treated as Bayesian filtering and smoothing problems, where the weights are modeled as Gaussian random variables. This allows closed-form expressions for training the network's parameters in a sequential/online fashion without gradient descent. We demonstrate our method on several UCI datasets and compare it to the state of the art.
    Causal de Finetti: On the Identification of Invariant Causal Structure in Exchangeable Data. (arXiv:2203.15756v2 [stat.ML] UPDATED)
    Learning causal structure from observational data often assumes that we observe independent and identically distributed (i.\,i.\,d) data. The traditional approach aims to find a graphical representation that encodes the same set of conditional independence relationships as those present in the observed distribution. It is known that under i.\,i.\,d assumption, even with infinite data, there is a limit to how fine-grained a causal structure we can identify. To overcome this limitation, recent work has explored using data originating from different, related environments to learn richer causal structure. These approaches implicitly rely on the independent causal mechanisms (ICM) principle, which postulates that the mechanism giving rise to an effect given its causes and the mechanism which generates the causes do not inform or influence each other. Thus, components of the causal model can independently change from environment to environment. Despite its wide application in machine learning and causal inference, there is a lack of statistical formalization of the ICM principle and how it enables identification of richer causal structures from grouped data. Here we present new causal de Finetti theorems which offer a first statistical formalization of ICM principle and show how causal structure identification is possible from exchangeable data. Our work provides theoretical justification for a broad range of techniques leveraging multi-environment data to learn causal structure.
    Multimodal Learning for Multi-Omics: A Survey. (arXiv:2211.16509v1 [q-bio.GN])
    With advanced imaging, sequencing, and profiling technologies, multiple omics data become increasingly available and hold promises for many healthcare applications such as cancer diagnosis and treatment. Multimodal learning for integrative multi-omics analysis can help researchers and practitioners gain deep insights into human diseases and improve clinical decisions. However, several challenges are hindering the development in this area, including the availability of easily accessible open-source tools. This survey aims to provide an up-to-date overview of the data challenges, fusion approaches, datasets, and software tools from several new perspectives. We identify and investigate various omics data challenges that can help us understand the field better. We categorize fusion approaches comprehensively to cover existing methods in this area. We collect existing open-source tools to facilitate their broader utilization and development. We explore a broad range of omics data modalities and a list of accessible datasets. Finally, we summarize future directions that can potentially address existing gaps and answer the pressing need to advance multimodal learning for multi-omics data analysis.
    High-Fidelity Guided Image Synthesis with Latent Diffusion Models. (arXiv:2211.17084v1 [cs.CV])
    Controllable image synthesis with user scribbles has gained huge public interest with the recent advent of text-conditioned latent diffusion models. The user scribbles control the color composition while the text prompt provides control over the overall image semantics. However, we note that prior works in this direction suffer from an intrinsic domain shift problem, wherein the generated outputs often lack details and resemble simplistic representations of the target domain. In this paper, we propose a novel guided image synthesis framework, which addresses this problem by modeling the output image as the solution of a constrained optimization problem. We show that while computing an exact solution to the optimization is infeasible, an approximation of the same can be achieved while just requiring a single pass of the reverse diffusion process. Additionally, we show that by simply defining a cross-attention based correspondence between the input text tokens and the user stroke-painting, the user is also able to control the semantics of different painted regions without requiring any conditional training or finetuning. Human user study results show that the proposed approach outperforms the previous state-of-the-art by over 85.32% on the overall user satisfaction scores. Project page for our paper is available at https://1jsingh.github.io/gradop.
    Data fission: splitting a single data point. (arXiv:2112.11079v6 [stat.ME] UPDATED)
    Suppose we observe a random vector $X$ from some distribution $P$ in a known family with unknown parameters. We ask the following question: when is it possible to split $X$ into two parts $f(X)$ and $g(X)$ such that neither part is sufficient to reconstruct $X$ by itself, but both together can recover $X$ fully, and the joint distribution of $(f(X),g(X))$ is tractable? As one example, if $X=(X_1,\dots,X_n)$ and $P$ is a product distribution, then for any $m<n$, we can split the sample to define $f(X)=(X_1,\dots,X_m)$ and $g(X)=(X_{m+1},\dots,X_n)$. Rasines and Young (2021) offers an alternative route of accomplishing this task through randomization of $X$ with additive Gaussian noise which enables post-selection inference in finite samples for Gaussian distributed data and asymptotically for non-Gaussian additive models. In this paper, we offer a more general methodology for achieving such a split in finite samples by borrowing ideas from Bayesian inference to yield a (frequentist) solution that can be viewed as a continuous analog of data splitting. We call our method data fission, as an alternative to data splitting, data carving and p-value masking. We exemplify the method on a few prototypical applications, such as post-selection inference for trend filtering and other regression problems.  ( 2 min )
    Semisoft Task Clustering for Multi-Task Learning. (arXiv:2211.17204v1 [cs.LG])
    Multi-task learning (MTL) aims to improve the performance of multiple related prediction tasks by leveraging useful information from them. Due to their flexibility and ability to reduce unknown coefficients substantially, the task-clustering-based MTL approaches have attracted considerable attention. Motivated by the idea of semisoft clustering of data, we propose a semisoft task clustering approach, which can simultaneously reveal the task cluster structure for both pure and mixed tasks as well as select the relevant features. The main assumption behind our approach is that each cluster has some pure tasks, and each mixed task can be represented by a linear combination of pure tasks in different clusters. To solve the resulting non-convex constrained optimization problem, we design an efficient three-step algorithm. The experimental results based on synthetic and real-world datasets validate the effectiveness and efficiency of the proposed approach. Finally, we extend the proposed approach to a robust task clustering problem.  ( 2 min )
    Overcoming the Convex Relaxation Barrier for Neural Network Verification via Nonconvex Low-Rank Semidefinite Relaxations. (arXiv:2211.17244v1 [cs.LG])
    To rigorously certify the robustness of neural networks to adversarial perturbations, most state-of-the-art techniques rely on a triangle-shaped linear programming (LP) relaxation of the ReLU activation. While the LP relaxation is exact for a single neuron, recent results suggest that it faces an inherent "convex relaxation barrier" as additional activations are added, and as the attack budget is increased. In this paper, we propose a nonconvex relaxation for the ReLU relaxation, based on a low-rank restriction of a semidefinite programming (SDP) relaxation. We show that the nonconvex relaxation has a similar complexity to the LP relaxation, but enjoys improved tightness that is comparable to the much more expensive SDP relaxation. Despite nonconvexity, we prove that the verification problem satisfies constraint qualification, and therefore a Riemannian staircase approach is guaranteed to compute a near-globally optimal solution in polynomial time. Our experiments provide evidence that our nonconvex relaxation almost completely overcome the "convex relaxation barrier" faced by the LP relaxation.  ( 2 min )
    High-Dimensional Wide Gap $k$-Means Versus Clustering Axioms. (arXiv:2211.17036v1 [cs.LG])
    Kleinberg's axioms for distance based clustering proved to be contradictory. Various efforts have been made to overcome this problem. Here we make an attempt to handle the issue by embedding in high-dimensional space and granting wide gaps between clusters.  ( 2 min )
    A Deep Learning Approach to the Prediction of Drug Side-Effects on Molecular Graphs. (arXiv:2211.16871v1 [stat.ML])
    Predicting drug side-effects before they occur is a key task in keeping the number of drug-related hospitalizations low and to improve drug discovery processes. Automatic predictors of side-effects generally are not able to process the structure of the drug, resulting in a loss of information. Graph neural networks have seen great success in recent years, thanks to their ability of exploiting the information conveyed by the graph structure and labels. These models have been used in a wide variety of biological applications, among which the prediction of drug side-effects on a large knowledge graph. Exploiting the molecular graph encoding the structure of the drug represents a novel approach, in which the problem is formulated as a multi-class multi-label graph-focused classification. We developed a methodology to carry out this task, using recurrent Graph Neural Networks, and building a dataset from freely accessible and well established data sources. The results show that our method has an improved classification capability, under many parameters and metrics, with respect to previously available predictors.  ( 2 min )
    Riemannian Metric Learning via Optimal Transport. (arXiv:2205.09244v3 [cs.LG] UPDATED)
    We introduce an optimal transport-based model for learning a metric tensor from cross-sectional samples of evolving probability measures on a common Riemannian manifold. We neurally parametrize the metric as a spatially-varying matrix field and efficiently optimize our model's objective using a simple alternating scheme. Using this learned metric, we can nonlinearly interpolate between probability measures and compute geodesics on the manifold. We show that metrics learned using our method improve the quality of trajectory inference on scRNA and bird migration data at the cost of little additional cross-sectional data.  ( 2 min )
    Private Stochastic Optimization With Large Worst-Case Lipschitz Parameter: Optimal Rates for (Non-Smooth) Convex Losses and Extension to Non-Convex Losses. (arXiv:2209.07403v3 [cs.LG] UPDATED)
    We study differentially private (DP) stochastic optimization (SO) with loss functions whose worst-case Lipschitz parameter over all data points may be extremely large. To date, the vast majority of work on DP SO assumes that the loss is uniformly Lipschitz continuous over data (i.e. stochastic gradients are uniformly bounded over all data points). While this assumption is convenient, it often leads to pessimistic excess risk bounds. In many practical problems, the worst-case Lipschitz parameter of the loss over all data points may be extremely large due to outliers. In such cases, the error bounds for DP SO, which scale with the worst-case Lipschitz parameter of the loss, are vacuous. To address these limitations, this work provides near-optimal excess risk bounds that do not depend on the uniform Lipschitz parameter of the loss. Building on a recent line of work [WXDX20, KLZ22], we assume that stochastic gradients have bounded $k$-th order moments for some $k \geq 2$. Compared with works on uniformly Lipschitz DP SO, our excess risk scales with the $k$-th moment bound instead of the uniform Lipschitz parameter of the loss, allowing for significantly faster rates in the presence of outliers and/or heavy-tailed data. For convex and strongly convex loss functions, we provide the first asymptotically optimal excess risk bounds (up to a logarithmic factor). In contrast to [WXDX20, KLZ22], our bounds do not require the loss function to be differentiable/smooth. We also devise an accelerated algorithm for smooth losses that runs in linear time and has excess risk that is tight in certain practical parameter regimes. Additionally, our work is the first to address non-convex non-uniformly Lipschitz loss functions satisfying the Proximal-PL inequality; this covers some practical machine learning models. Our Proximal-PL algorithm has near-optimal excess risk.  ( 3 min )
    Estimation and Inference on Heterogeneous Treatment Effects in High-Dimensional Dynamic Panels under Weak Dependence. (arXiv:1712.09988v5 [stat.ML] UPDATED)
    This paper provides estimation and inference methods for a conditional average treatment effects (CATE) characterized by a high-dimensional parameter in both homogeneous cross-sectional and unit-heterogeneous dynamic panel data settings. In our leading example, we model CATE by interacting the base treatment variable with explanatory variables. The first step of our procedure is orthogonalization, where we partial out the controls and unit effects from the outcome and the base treatment and take the cross-fitted residuals. This step uses a novel generic cross-fitting method we design for weakly dependent time series and panel data. This method "leaves out the neighbors" when fitting nuisance components, and we theoretically power it by using Strassen's coupling. As a result, we can rely on any modern machine learning method in the first step, provided it learns the residuals well enough. Second, we construct an orthogonal (or residual) learner of CATE -- the Lasso CATE -- that regresses the outcome residual on the vector of interactions of the residualized treatment with explanatory variables. If the complexity of CATE function is simpler than that of the first-stage regression, the orthogonal learner converges faster than the single-stage regression-based learner. Third, we perform simultaneous inference on parameters of the CATE function using debiasing. We also can use ordinary least squares in the last two steps when CATE is low-dimensional. In heterogeneous panel data settings, we model the unobserved unit heterogeneity as a weakly sparse deviation from Mundlak (1978)'s model of correlated unit effects as a linear function of time-invariant covariates and make use of L1-penalization to estimate these models. We demonstrate our methods by estimating price elasticities of groceries based on scanner data. We note that our results are new even for the cross-sectional (i.i.d) case.  ( 3 min )
    Continual Learning with Distributed Optimization: Does COCOA Forget?. (arXiv:2211.16994v1 [stat.ML])
    We focus on the continual learning problem where the tasks arrive sequentially and the aim is to perform well on the newly arrived task without performance degradation on the previously seen tasks. In contrast to the continual learning literature focusing on the centralized setting, we investigate the distributed estimation framework. We consider the well-established distributed learning algorithm \cocoa{}. We derive closed form expressions for the iterations for the overparametrized case. We illustrate the convergence and the error performance of the algorithm based on the over/under-parametrization of the problem. Our results show that depending on the problem dimensions and data generation assumptions, \cocoa{} can perform continual learning over a sequence of tasks, i.e., it can learn a new task without forgetting previously learned tasks, with access only to one task at a time.  ( 2 min )
    Low-rank Tensor Estimation via Riemannian Gauss-Newton: Statistical Optimality and Second-Order Convergence. (arXiv:2104.12031v3 [stat.ML] UPDATED)
    In this paper, we consider the estimation of a low Tucker rank tensor from a number of noisy linear measurements. The general problem covers many specific examples arising from applications, including tensor regression, tensor completion, and tensor PCA/SVD. We consider an efficient Riemannian Gauss-Newton (RGN) method for low Tucker rank tensor estimation. Different from the generic (super)linear convergence guarantee of RGN in the literature, we prove the first local quadratic convergence guarantee of RGN for low-rank tensor estimation in the noisy setting under some regularity conditions and provide the corresponding estimation error upper bounds. A deterministic estimation error lower bound, which matches the upper bound, is provided that demonstrates the statistical optimality of RGN. The merit of RGN is illustrated through two machine learning applications: tensor regression and tensor SVD. Finally, we provide the simulation results to corroborate our theoretical findings.  ( 2 min )
    Learning non-stationary and discontinuous functions using clustering, classification and Gaussian process modelling. (arXiv:2211.16909v1 [stat.ML])
    Surrogate models have shown to be an extremely efficient aid in solving engineering problems that require repeated evaluations of an expensive computational model. They are built by sparsely evaluating the costly original model and have provided a way to solve otherwise intractable problems. A crucial aspect in surrogate modelling is the assumption of smoothness and regularity of the model to approximate. This assumption is however not always met in reality. For instance in civil or mechanical engineering, some models may present discontinuities or non-smoothness, e.g., in case of instability patterns such as buckling or snap-through. Building a single surrogate model capable of accounting for these fundamentally different behaviors or discontinuities is not an easy task. In this paper, we propose a three-stage approach for the approximation of non-smooth functions which combines clustering, classification and regression. The idea is to split the space following the localized behaviors or regimes of the system and build local surrogates that are eventually assembled. A sequence of well-known machine learning techniques are used: Dirichlet process mixtures models (DPMM), support vector machines and Gaussian process modelling. The approach is tested and validated on two analytical functions and a finite element model of a tensile membrane structure.  ( 2 min )
    Quantum Kerr Learning. (arXiv:2205.12004v2 [quant-ph] UPDATED)
    Quantum machine learning is a rapidly evolving field of research that could facilitate important applications for quantum computing and also significantly impact data-driven sciences. In our work, based on various arguments from complexity theory and physics, we demonstrate that a single Kerr mode can provide some "quantum enhancements" when dealing with kernel-based methods. Using kernel properties, neural tangent kernel theory, first-order perturbation theory of the Kerr non-linearity, and non-perturbative numerical simulations, we show that quantum enhancements could happen in terms of convergence time and generalization error. Furthermore, we make explicit indications on how higher-dimensional input data could be considered. Finally, we propose an experimental protocol, that we call \emph{quantum Kerr learning}, based on circuit QED.  ( 2 min )
    Parameters or Privacy: A Provable Tradeoff Between Overparameterization and Membership Inference. (arXiv:2202.01243v2 [stat.ML] UPDATED)
    A surprising phenomenon in modern machine learning is the ability of a highly overparameterized model to generalize well (small error on the test data) even when it is trained to memorize the training data (zero error on the training data). This has led to an arms race towards increasingly overparameterized models (c.f., deep learning). In this paper, we study an underexplored hidden cost of overparameterization: the fact that overparameterized models may be more vulnerable to privacy attacks, in particular the membership inference attack that predicts the (potentially sensitive) examples used to train a model. We significantly extend the relatively few empirical results on this problem by theoretically proving for an overparameterized linear regression model in the Gaussian data setting that membership inference vulnerability increases with the number of parameters. Moreover, a range of empirical studies indicates that more complex, nonlinear models exhibit the same behavior. Finally, we extend our analysis towards ridge-regularized linear regression and show in the Gaussian data setting that increased regularization also increases membership inference vulnerability in the overparameterized regime.  ( 2 min )
    Fair Ranking with Noisy Protected Attributes. (arXiv:2211.17067v1 [cs.LG])
    The fair-ranking problem, which asks to rank a given set of items to maximize utility subject to group fairness constraints, has received attention in the fairness, information retrieval, and machine learning literature. Recent works, however, observe that errors in socially-salient (including protected) attributes of items can significantly undermine fairness guarantees of existing fair-ranking algorithms and raise the problem of mitigating the effect of such errors. We study the fair-ranking problem under a model where socially-salient attributes of items are randomly and independently perturbed. We present a fair-ranking framework that incorporates group fairness requirements along with probabilistic information about perturbations in socially-salient attributes. We provide provable guarantees on the fairness and utility attainable by our framework and show that it is information-theoretically impossible to significantly beat these guarantees. Our framework works for multiple non-disjoint attributes and a general class of fairness constraints that includes proportional and equal representation. Empirically, we observe that, compared to baselines, our algorithm outputs rankings with higher fairness, and has a similar or better fairness-utility trade-off compared to baselines.  ( 2 min )
    Hierarchically Clustered PCA and CCA via a Convex Clustering Penalty. (arXiv:2211.16553v1 [cs.LG])
    We introduce an unsupervised learning approach that combines the truncated singular value decomposition with convex clustering to estimate within-cluster directions of maximum variance/covariance (in the variables) while simultaneously hierarchically clustering (on observations). In contrast to previous work on joint clustering and embedding, our approach has a straightforward formulation, is readily scalable via distributed optimization, and admits a direct interpretation as hierarchically clustered principal component analysis (PCA) or hierarchically clustered canonical correlation analysis (CCA). Through numerical experiments and real-world examples relevant to precision medicine, we show that our approach outperforms traditional and contemporary clustering methods on underdetermined problems ($p \gg N$ with tens of observations) and scales to large datasets (e.g., $N=100,000$; $p=1,000$) while yielding interpretable dendrograms of hierarchical per-cluster principal components or canonical variates.  ( 2 min )
    Transfer Learning with Uncertainty Quantification: Random Effect Calibration of Source to Target (RECaST). (arXiv:2211.16557v1 [stat.ME])
    Transfer learning uses a data model, trained to make predictions or inferences on data from one population, to make reliable predictions or inferences on data from another population. Most existing transfer learning approaches are based on fine-tuning pre-trained neural network models, and fail to provide crucial uncertainty quantification. We develop a statistical framework for model predictions based on transfer learning, called RECaST. The primary mechanism is a Cauchy random effect that recalibrates a source model to a target population; we mathematically and empirically demonstrate the validity of our RECaST approach for transfer learning between linear models, in the sense that prediction sets will achieve their nominal stated coverage, and we numerically illustrate the method's robustness to asymptotic approximations for nonlinear models. Whereas many existing techniques are built on particular source models, RECaST is agnostic to the choice of source model. For example, our RECaST transfer learning approach can be applied to a continuous or discrete data model with linear or logistic regression, deep neural network architectures, etc. Furthermore, RECaST provides uncertainty quantification for predictions, which is mostly absent in the literature. We examine our method's performance in a simulation study and in an application to real hospital data.  ( 2 min )
    Optimistic search: Change point estimation for large-scale data via adaptive logarithmic queries. (arXiv:2010.10194v3 [stat.ME] UPDATED)
    Change point estimation is often formulated as a search for the maximum of a gain function describing improved fits when segmenting the data. Searching through all candidates requires $O(n)$ evaluations of the gain function for an interval with $n$ observations. If each evaluation is computationally demanding (e.g. in high-dimensional models), this can become infeasible. Instead, we propose optimistic search methods with $O(\log n)$ evaluations exploiting specific structure of the gain function. Towards solid understanding of our strategy, we investigate in detail the $p$-dimensional Gaussian changing means setup, including high-dimensional scenarios. For some of our proposals, we prove asymptotic minimax optimality for detecting change points and derive their asymptotic localization rate. These rates (up to a possible log factor) are optimal for the univariate and multivariate scenarios, and are by far the fastest in the literature under the weakest possible detection condition on the signal-to-noise ratio in the high-dimensional scenario. Computationally, our proposed methodology has the worst case complexity of $O(np)$, which can be improved to be sublinear in $n$ if some a-priori knowledge on the length of the shortest segment is available. Our search strategies generalize far beyond the theoretically analyzed setup. We illustrate, as an example, massive computational speedup in change point detection for high-dimensional Gaussian graphical models.  ( 2 min )
    Infinite-width limit of deep linear neural networks. (arXiv:2211.16980v1 [cs.LG])
    This paper studies the infinite-width limit of deep linear neural networks initialized with random parameters. We obtain that, when the number of neurons diverges, the training dynamics converge (in a precise sense) to the dynamics obtained from a gradient descent on an infinitely wide deterministic linear neural network. Moreover, even if the weights remain random, we get their precise law along the training dynamics, and prove a quantitative convergence result of the linear predictor in terms of the number of neurons. We finally study the continuous-time limit obtained for infinitely wide linear neural networks and show that the linear predictors of the neural network converge at an exponential rate to the minimal $\ell_2$-norm minimizer of the risk.  ( 2 min )
    Variational Autoencoders for Anomalous Jet Tagging. (arXiv:2007.01850v4 [hep-ph] UPDATED)
    We present a detailed study on Variational Autoencoders (VAEs) for anomalous jet tagging at the Large Hadron Collider. By taking in low-level jet constituents' information, and training with background QCD jets in an unsupervised manner, the VAE is able to encode important information for reconstructing jets, while learning an expressive posterior distribution in the latent space. When using the VAE as an anomaly detector, we present different approaches to detect anomalies: directly comparing in the input space or, instead, working in the latent space. In order to facilitate general search approaches such as bump-hunt, mass-decorrelated VAEs based on distance correlation regularization are also studied. We find that the naive mass-decorrelated VAEs fail at maintaining proper detection performance, by assigning higher probabilities to some anomalous samples. To build a performant mass-decorrelated anomalous jet tagger, we propose the Outlier Exposed VAE (OE-VAE), for which some outlier samples are introduced in the training process to guide the learned information. OE-VAEs are employed to achieve two goals at the same time: increasing sensitivity of outlier detection and decorrelating jet mass from the anomaly score. We succeed in reaching excellent results from both aspects. Code implementation of this work can be found at https://github.com/taolicheng/VAE-Jet  ( 2 min )
    Estimation under Model Misspecification with Fake Features. (arXiv:2203.03398v2 [eess.SP] UPDATED)
    We consider estimation under model misspecification where there is a model mismatch between the underlying system, which generates the data, and the model used during estimation. We propose a model misspecification framework which enables a joint treatment of the model misspecification types of having fake features as well as incorrect covariance assumptions on the unknowns and the noise. We present a decomposition of the output error into components that relate to different subsets of the model parameters corresponding to underlying, fake and missing features. Here, fake features are features which are included in the model but are not present in the underlying system. Under this framework, we characterize the estimation performance and reveal trade-offs between the number of samples, number of fake features, and the possibly incorrect noise level assumption. In contrast to existing work focusing on incorrect covariance assumptions or missing features, fake features is a central component of our framework. Our results show that fake features can significantly improve the estimation performance, even though they are not correlated with the features in the underlying system. In particular, we show that the estimation error can be decreased by including more fake features in the model, even to the point where the model is overparametrized, i.e., the model contains more unknowns than observations.  ( 2 min )
    Weisfeiler and Leman Go Relational. (arXiv:2211.17113v1 [cs.LG])
    Knowledge graphs, modeling multi-relational data, improve numerous applications such as question answering or graph logical reasoning. Many graph neural networks for such data emerged recently, often outperforming shallow architectures. However, the design of such multi-relational graph neural networks is ad-hoc, driven mainly by intuition and empirical insights. Up to now, their expressivity, their relation to each other, and their (practical) learning performance is poorly understood. Here, we initiate the study of deriving a more principled understanding of multi-relational graph neural networks. Namely, we investigate the limitations in the expressive power of the well-known Relational GCN and Compositional GCN architectures and shed some light on their practical learning performance. By aligning both architectures with a suitable version of the Weisfeiler-Leman test, we establish under which conditions both models have the same expressive power in distinguishing non-isomorphic (multi-relational) graphs or vertices with different structural roles. Further, by leveraging recent progress in designing expressive graph neural networks, we introduce the $k$-RN architecture that provably overcomes the expressiveness limitations of the above two architectures. Empirically, we confirm our theoretical findings in a vertex classification setting over small and large multi-relational graphs.  ( 2 min )
    Universal Feature Selection Tool (UniFeat): An Open-Source Tool for Dimensionality Reduction. (arXiv:2211.16846v1 [cs.LG])
    The Universal Feature Selection Tool (UniFeat) is an open-source tool developed entirely in Java for performing feature selection processes in various research areas. It provides a set of well-known and advanced feature selection methods within its significant auxiliary tools. This allows users to compare the performance of feature selection methods. Moreover, due to the open-source nature of UniFeat, researchers can use and modify it in their research, which facilitates the rapid development of new feature selection algorithms.  ( 2 min )
    PAC Verification of Statistical Algorithms. (arXiv:2211.17096v1 [stat.ML])
    Goldwasser et al.\ (2021) recently proposed the setting of PAC verification, where a hypothesis (machine learning model) that purportedly satisfies the agnostic PAC learning objective is verified using an interactive proof. In this paper we develop this notion further in a number of ways. First, we prove a lower bound for PAC verification of $\Omega(\sqrt{d})$ i.i.d.\ samples for hypothesis classes of VC dimension $d$. Second, we present a protocol for PAC verification of unions of intervals over $\mathbb{R}$ that improves upon their proposed protocol for that task, and matches our lower bound. Third, we introduce a natural generalization of their definition to verification of general statistical algorithms, which is applicable to a wider variety of practical algorithms beyond agnostic PAC learning. Showcasing our proposed definition, our final result is a protocol for the verification of statistical query algorithms that satisfy a combinatorial constraint on their queries.  ( 2 min )
    Statistical treatment of convolutional neural network super-resolution of inland surface wind for subgrid-scale variability quantification. (arXiv:2211.16708v1 [physics.ao-ph])
    Machine learning models are frequently employed to perform either purely physics-free or hybrid downscaling of climate data. However, the majority of these implementations operate over relatively small downscaling factors of about 4--6x. This study examines the ability of convolutional neural networks (CNN) to downscale surface wind speed data from three different coarse resolutions (25km, 48km, and 100km side-length grid cells) to 3km and additionally focuses on the ability to recover subgrid-scale variability. Within each downscaling factor, namely 8x, 16x, and 32x, we consider models that produce fine-scale wind speed predictions as functions of different input features: coarse wind fields only; coarse wind and fine-scale topography; and coarse wind, topography, and temporal information in the form of a timestamp. Furthermore, we train one model at 25km to 3km resolution whose fine-scale outputs are probability density function parameters through which sample wind speeds can be generated. All CNN predictions performed on one out-of-sample data outperform classical interpolation. Models with coarse wind and fine topography are shown to exhibit the best performance compared to other models operating across the same downscaling factor. Our timestamp encoding results in lower out-of-sample generalizability compared to other input configurations. Overall, the downscaling factor plays the largest role in model performance.  ( 2 min )
    Targets in Reinforcement Learning to solve Stackelberg Security Games. (arXiv:2211.17132v1 [cs.LG])
    Reinforcement Learning (RL) algorithms have been successfully applied to real world situations like illegal smuggling, poaching, deforestation, climate change, airport security, etc. These scenarios can be framed as Stackelberg security games (SSGs) where defenders and attackers compete to control target resources. The algorithm's competency is assessed by which agent is controlling the targets. This review investigates modeling of SSGs in RL with a focus on possible improvements of target representations in RL algorithms.  ( 2 min )
    Offline Policy Evaluation and Optimization under Confounding. (arXiv:2211.16583v1 [stat.ML])
    With a few exceptions, work in offline reinforcement learning (RL) has so far assumed that there is no confounding. In a classical regression setting, confounders introduce omitted variable bias and inhibit the identification of causal effects. In offline RL, they prevent the identification of a policy's value, and therefore make it impossible to perform policy improvement. Using conventional methods in offline RL in the presence of confounding can therefore not only lead to poor decisions and poor policies, but can also have disastrous effects in applications such as healthcare and education. We provide approaches for both off-policy evaluation (OPE) and local policy optimization in the settings of i.i.d. and global confounders. Theoretical and empirical results confirm the validity and viability of these methods.  ( 2 min )
    A Novel Statistical Independence Test for Dynamic Causal Discovery with Rare Events. (arXiv:2211.16596v1 [stat.ML])
    Causal phenomena associated with rare events frequently occur across a wide range of engineering and mathematical problems, such as risk-sensitive safety analysis, accident analysis and prevention, and extreme value theory. However, current methods for causal discovery are often unable to uncover causal links between random variables that manifest only when the variables first experience low-probability realizations. To address this issue, we introduce a novel algorithm that performs statistical independence tests on data collected from time-invariant dynamical systems in which rare but consequential events occur. We seek to understand if the state of the dynamical system causally affects the likelihood of the rare event. In particular, we exploit the time-invariance of the underlying data to superimpose the occurrences of rare events, thus creating a new dataset, with rare events are better represented, on which conditional independence tests can be more efficiently performed. We provide non-asymptotic bounds for the consistency of our algorithm, and validate the performance of our algorithm across various simulated scenarios, with applications to traffic accidents.  ( 2 min )
    On Regret-optimal Cooperative Nonstochastic Multi-armed Bandits. (arXiv:2211.17154v1 [stat.ML])
    We consider the nonstochastic multi-agent multi-armed bandit problem with agents collaborating via a communication network with delays. We show a lower bound for individual regret of all agents. We show that with suitable regularizers and communication protocols, a collaborative multi-agent \emph{follow-the-regularized-leader} (FTRL) algorithm has an individual regret upper bound that matches the lower bound up to a constant factor when the number of arms is large enough relative to degrees of agents in the communication graph. We also show that an FTRL algorithm with a suitable regularizer is regret optimal with respect to the scaling with the edge-delay parameter. We present numerical experiments validating our theoretical results and demonstrate cases when our algorithms outperform previously proposed algorithms.  ( 2 min )

  • Open

    [p] Really Dumb Idea(bear with me)
    Really Dumb Question (bear with me) I am an avid outdoorsman and sometimes use camouflage when outdoors for airsoft. It’s been difficult finding a great Camo for my environment. I thought maybe just maybe someone out there could make a program that finds the best Camo possible for a given environment. I am very limited when it comes to programming but I would assume the program would work like this: AI finds color values and ratios from photos, then compares those values to Camo patterns, at last gives results of what Camos match environment. Is this possible? submitted by /u/poobispoob [link] [comments]  ( 59 min )
    I made a Short form + AI based Spotify tool [Project]
    My friend and I got annoyed with trying to find new music on Spotify So for class we built a program that shortens a song to the 'best' 10-60 seconds snippets for you to help you find new songs/artists and add to your playlists, faster Some factors include valence/energy/bpm/key/wave/bridge/genre etc App Store link: https://apps.apple.com/us/app/smores-music-discovery/id1626768775 Would love any feedback/criticisms/feature requests, thanks :) ​ https://preview.redd.it/klnobo5afd3a1.png?width=420&format=png&auto=webp&s=0552d4b5ec026fe1e627ae50f04f8fb57316f0a2 submitted by /u/Aromatic_Hat2715 [link] [comments]  ( 59 min )
    [D] can a MODEL (not code) created using licensed code (StyleGAN) be used for commercial use?
    Hi All, I am not using any of the licensed code in my product, just a model trained on custom data for inference! Is this legal under the creative commons license? Thanks! submitted by /u/willowill5 [link] [comments]  ( 59 min )
    [D] What are promising research areas of machine learning in the humanities?
    How could we strengthen the interdiciplinary exchange between the departments in academia? What professorships in development of ML applications might foster collaboration? Any thoughts welcome. submitted by /u/hogfd [link] [comments]  ( 56 min )
    [R] Coder Reviewer Reranking for Code Generation - Facebook Research 2022 Tianyi Zhang et al - Coder-Reviewer reranking leads to up to 17% absolute accuracy gain!
    Paper: https://arxiv.org/abs/2211.16490 Github: https://github.com/facebookresearch/coder_reviewer_reranking Twitter: https://twitter.com/Tianyi_Zh/status/1598105103244103680 Abstract: "Sampling diverse programs from a code language model and reranking with model likelihood is a popular method for code generation but it is prone to preferring degenerate solutions. Inspired by collaborative programming, we propose Coder-Reviewer reranking. We augment Coder language models from past work, which generate programs given language instructions, with Reviewer models, which evaluate the likelihood of the instruction given the generated programs. We perform an extensive study across six datasets with eight models from three model families. Experimental results show that Coder-Reviewer reranking leads to consistent and significant improvement (up to 17% absolute accuracy gain) over reranking with the Coder model only. When combined with executability filtering, Coder-Reviewer reranking can often outperform the minimum Bayes risk method. Coder-Reviewer reranking is easy to implement by prompting, can generalize to different programming languages, and works well with off-the-shelf hyperparameters." https://preview.redd.it/styjowhzxc3a1.jpg?width=1009&format=pjpg&auto=webp&s=0ad14723918af76c2a68e5d0d7d9574fa888e8f5 https://preview.redd.it/q47kz1izxc3a1.jpg?width=1644&format=pjpg&auto=webp&s=b9c380bc9039af0977a4a6d3884977c295b56a46 https://preview.redd.it/sg0q52izxc3a1.jpg?width=1205&format=pjpg&auto=webp&s=6c30d12f122ae5963fefd98877e16f9bd83b5934 submitted by /u/Singularian2501 [link] [comments]  ( 60 min )
    [D] This book was entirely generated using ChatGPT from OpenAI!
    The Man Who Taught His Fish to Talk Generated using ChatGPT Chapter 1: The Beginning It all started on a warm summer afternoon, when I was sitting on my porch with my trusty old fishing rod in hand. I had been trying to catch some fish in the nearby pond for hours, but I hadn't had any luck. I was starting to lose hope, when I noticed a small goldfish swimming by. I quickly baited my hook and cast it into the water, and to my amazement, the little fish took the bait. I pulled it out of the water and examined it closely. It was a small, unremarkable fish, with a dull gold color and a pair of beady black eyes. But as I looked at the fish, I realized that it was unlike any fish I had ever seen. It had a curious, intelligent expression on its face, and it seemed to be looking at me with a…  ( 66 min )
    [P] Step by step guide to perform A/B test and measure ML models performance
    Hello, Recently I worked on a recommendation system for a media company, that helped us increase engagement. I have written an article summarizing how we setup the experiments and measured statistically the performance. Enjoy the article and let me know if you have any questions ! https://medium.com/@kaislar17/measure-machine-learning-models-live-performance-through-a-b-testing-7398f0a3edcc submitted by /u/Spirited-Singer-6150 [link] [comments]  ( 59 min )
    [P] Demo for gpt chat "frontend", which adds a talking face w/ audio via realtime vid generation. really brings chat to life!
    https://www.spacemonkey.ai/demo submitted by /u/willowill5 [link] [comments]  ( 67 min )
    [R] Latest Trigger Word Detection?
    Can anyone point me to the latest neural net model architectures (and maybe models) for trigger word detection? Searches on both reddit and arxiv yielded old results but I know there is research in this area. My use case is actually looking for specific sounds, but the trigger word architectures should work for this too. Think of triggering on a specific phoneme rather than a word. submitted by /u/ugeb318 [link] [comments]  ( 57 min )
    [R] Statistical vs Deep Learning forecasting methods
    ​ https://preview.redd.it/c59sra8nwb3a1.png?width=1190&format=png&auto=webp&s=80b3f1a83d190ac0349ec97908aa806aaa03abc3 Machine learning progress is plagued by the conflict between competing ideas, with no shortage of failed reviews, underdelivering models, and failed investments in expensive over-engineered solutions. We don't subscribe the Deep Learning hype for time series and present a fully reproducible experiment that shows that: A simple statistical ensemble outperforms most individual deep-learning models. A simple statistical ensemble is 25,000 faster and only slightly less accurate than an ensemble of deep learning models. In other words, deep-learning ensembles outperform statistical ensembles just by 0.36 points in SMAPE. However, the DL ensemble takes more than 14 days to run and costs around USD 11,000, while the statistical ensemble takes 6 minutes to run and costs $0.5c. For the 3,003 series of M3, these are the results. https://preview.redd.it/89bhlcg9wb3a1.png?width=1678&format=png&auto=webp&s=e5471331b081142ba201b81ba3346a890d474c50 In conclusion: in terms of speed, costs, simplicity and interpretability, deep learning is far behind the simple statistical ensemble. In terms of accuracy, they are rather close. You can read the full report and reproduce the experiments in this Github repo: https://github.com/Nixtla/statsforecast/tree/main/experiments/m3 submitted by /u/fedegarzar [link] [comments]  ( 79 min )
    [Discussion] - "data sourcing will be more important than model building in the era of foundational model fine-tuning"
    I was recently having this debate with a data engineering friend. My position was that as foundational models "eat the world" it will become more valuable to be good at sourcing high quality training data for finetuning that building new models. Would love to trigger a wider debate here! submitted by /u/fourcornerclub [link] [comments]  ( 60 min )
    [P] Releasing customized language model pre-training acceleration toolkit: ExtremeBERT
    Language model pre-training demonstrates great promise in Natural Language Processing (NLP). However, the language model pre-training requires large even staggering pretraining costs. We present ExtremeBERT, a toolkit for accelerating and customizing BERT pretraining. Our goal is to provide an easy-to-use BERT pretraining toolkit for the research community and industry. Thus, the pretraining of popular language models on customized datasets is affordable with limited resources. Experiments show that, to achieve the same or better GLUE scores, the time cost of our toolkit is over 6x times less for BERT Base and 9x times less for BERT Large when compared with the original BERT paper. Three highlighted features: 🥳Easy-to-use Pipeline: one-line command pipeline without pain 🚀Acceleration: train your own BERT in one day 🌐Customized Datasets: compatible with huggingface datasets, support customization as well Checkout ExtremeBERT: 📃Paper: https://arxiv.org/abs/2211.17201 ⭐️Code: https://github.com/extreme-bert/extreme-bert 🔍Documentation: https://extreme-bert.github.io/extreme-bert-page Give it a ⭐ if you loved it :) submitted by /u/Snoo_97274 [link] [comments]  ( 56 min )
    [P] Probably the Fastest Open Source Stable Diffusion is released
    Hi everyone, we just release probably the fastest Stable Diffusion. The following two pictures show that on A100 GPU, whether it is PCIe 40GB or SXM 80GB, OneFlow Stable Diffusion leads the performance results compared to other deep learning frameworks/compilers. GitHub URL: https://github.com/Oneflow-Inc/diffusers/wiki/How-to-Run-OneFlow-Stable-Diffusion OneFlow URL:https://github.com/Oneflow-Inc/oneflow/ ​ https://preview.redd.it/z0r7tgioua3a1.png?width=612&format=png&auto=webp&s=ed1cf29d62adec7082a4cabfe35f0c0012a4a7a7 https://preview.redd.it/9nntibfpua3a1.png?width=612&format=png&auto=webp&s=b7cd03cebca7133b84d6d33bf0ac9e6cae8df4ee Before that, On November 7th, OneFlow accelerated the Stable Diffusion to the era of "generating in one second" for the first time. On A100 SXM 80GB, OneFlow Stable Diffusion reaches a groundbreaking inference speed of 50 it/s, which means that the required 50 rounds of sampling to generate an image can be done in exactly 1 second. Now, OneFlow refreshed the SOTA record again. You might wonder how OneFlow Stable Diffusion made this exciting result. Actually, OneFlow's compiler has played a pivotal role in accelerating the model. The compiler can allow any PyTorch frontend-built models to run faster on NVIDIA GPUs. Welcome to try OneFlow Stable Diffusion and make your own masterpiece using Docker! all you need is to execute the following snippet: docker run --rm -it \ --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \ -v ${HF_HOME}:${HF_HOME} \ -v ${PWD}:${PWD} \ -w ${PWD} \ -e HF_HOME=${HF_HOME} \ -e HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN} \ oneflowinc/oneflow-sd:cu112 \ python3 /demos/oneflow-t2i.py # --prompt "a photo of an astronaut riding a horse on mars" Check out OneFlow on GitHub . We'd love to hear your feedback! submitted by /u/Just0by [link] [comments]  ( 62 min )
    [Project] I used whisper to transcribe 2500 episodes from around 80 podcasts and made it searchable.
    Hi all, This is similar to some other posts about doing podcast transcribing episodes. I used whisper models to downloade and transcribe them and then made them in to Full text searchable. The architecture is simple from RSS -> Download -> Transcribe -> Segment -> Ingest to DB for search. For the fully available transcript, I also use auto highlighting to highlight important segments of podcast using Wink NLP. ​ here is the URl : https://www.castdop.com ​ I can add around 1400 hours of content per day. Any feedback / comment /questions is appreciated. P.S. : let me know if this violates some rules, I just posted because I saw similar posts before. submitted by /u/t0mkaka [link] [comments]  ( 64 min )
    [D] Pretraining a visual model
    Hello, I’m actually trying a pre-train an encoder decoder model and I have many questions to which I didn’t find answers. So I’m wondering if there is a source that contains the good practices for pre-training model. Some of the question That I have in mind : In many papers (Swin Convnext ..) they use a certain base learning rate and a cosine decay, but they never mention the value of the final learning rate. So do we generally go to small learning rates for pre-training models ? Should we use dropout for pre-training ? (It depends on the architecture but typically for an architecture where we use dropout when training from scratch on small dataset should we still use dropout in smaller values for pre-training a model) Are there any hyper parameters that change from pre-training to fine tuning apart from the learning rate ? Thanks in advance 🙏 submitted by /u/Meddhouib10 [link] [comments]  ( 58 min )
    [D] Annotations Tools' Bounding Box to Mask Feature Implementation
    Hello, In many AI-assisted annotation tools, there is a feature in which the annotator creates a bounding box around an object, and the tool creates a mask of whatever object is inside this box. I was wondering what computer vision approaches could be running under the hood here. Is there any paper/blog exploring this? Thanks! submitted by /u/TryingToGeek [link] [comments]  ( 60 min )
    IEEE ICASSP Clairty Challenge for ML/AI Speech Enhancement [R]
    This ICASSP SP Clarity Challenge (Speech Enhancement for Hearing Aids) is about improving the performance of hearing aids for speech-in-noise. According to the World Health Organization, 430 million people worldwide require rehabilitation to address hearing loss. By 2050, this will increase to one in ten people having disabling hearing loss. Yet even in developed countries, only 40% of people who could benefit from hearing aids have them. A major reason for low uptake and use is the perception that hearing aids perform poorly. Speech enhancement is a major research area with thousands of papers each year, yet only a tiny percentage of these explicitly consider improvements for listeners who have a hearing loss. Consequently, this signal processing challenge is designed to get the latest advancements in speech enhancement applied to hearing aids. Entrants are tasked to enhance speech-in-noise for input into a hearing aid amplification stage. The hearing aid will be tuned to the hearing characteristics of particular people. Thus you can enter without in-depth knowledge of hearing aids, and just concentrate on the task of de-noising. The scenario is listening to speech in the presence of typical domestic noise. We provide the signals captured by the microphones on a pair of behind-the-ear hearing aids and those captured at the eardrum. The target speech will be a short sentence. The interfering noises will be a mix of speech, domestic appliance noise and music. The audio includes the simulation of the acoustic of typical small living rooms. The challenge is to improve the speech intelligibility without excessive loss of quality. To this end, entries will be evaluated using an objective metric that is an average of the Hearing Aid Speech Perception Index (HASPI) and Hearing Aid Speech Quality Index (HASQI). ​ Link in the comments for more info and to register. submitted by /u/clarity_challenges [link] [comments]  ( 60 min )
    [D] Is it possible to get a confusion matrix for Optical Character Recognition
    Is there an established analytical method to find out whether an OCR system is confusing between specific text classes? The normal confusion matrix approach does not work since an arbitrary number of classes can exist in the output prediction for word or sentence level OCR submitted by /u/theahmedmustafa [link] [comments]  ( 57 min )
    [D] Cloud providers for hobby use
    I am looking for ML cloud providers for my hobby projects. I found replicate dot com but I would like to try other providers. What are the best / most used/stable providers out there? I am not looking for free options and am happy to pay. submitted by /u/gyurisc [link] [comments]  ( 61 min )
    [R] On Distillation of Guided Diffusion Models: “For diffusion models trained on the latent-space (Stable Diffusion), our approach is able to generate hi-fidelity images using as few as 1-4 denoising steps, accelerating inference by >10x compared to existing methods on ImageNet and LAION datasets.”
    submitted by /u/hardmaru [link] [comments]  ( 58 min )
    OpenAI ChatGPT [R]
    From the blog "ChatGPT model interacts in a conversational way. The dialogue format makes it possible for ChatGPT to answer followup questions, admit its mistakes, challenge incorrect premises, and reject inappropriate requests. ChatGPT is a sibling model to InstructGPT, which is trained to follow an instruction in a prompt and provide a detailed response." I tried out ChatGPT and have made a video on it. Seems impressive. Maintains context and memory well Do checkout the video: https://youtu.be/MbzGbqnTctc submitted by /u/Sea-Photo5230 [link] [comments]  ( 64 min )
    [D] Best Practices for Training From Scratch With Large Datasets
    Hello, I'm planning to train a wav2vec2 model from scratch with thousands of hours of audio. I'm used to training smaller scale models with datasets that easily fit on disk and require only 1-2 GPUs. Can anyone recommend resources for learning the modern best practices for this sort of training? My plan is to mount an external disk (or several) to hold the datasets, and attach GPUs to the VM instance. I also plan to experiment with smaller training runs on increasing fractions of the dataset to make sure it trains properly, before training with all the data. I've been working from this as a starting point: https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-pretraining Anything I'm missing or doing wrong? Thank you! submitted by /u/iocuydi [link] [comments]  ( 57 min )
    [R] Overinterpretation reveals image classification model pathologies - e.g. prominent models classify a black image w/ 4-5 gray pixels as "airplane" with >99% confidence
    submitted by /u/Ok-Cheesecake-1753 [link] [comments]  ( 65 min )
  • Open

    Illustrative notebooks in Amazon SageMaker JumpStart
    Amazon SageMaker JumpStart is the Machine Learning (ML) hub of SageMaker providing pre-trained, publicly available models for a wide range of problem types to help you get started with machine learning. JumpStart also offers example notebooks that use Amazon SageMaker features like spot instance training and experiments over a large variety of model types and […]  ( 11 min )
    Interactive data prep widget for notebooks powered by Amazon SageMaker Data Wrangler
    According to a 2020 survey of data scientists conducted by Anaconda, data preparation is one of the critical steps in machine learning (ML) and data analytics workflows, and often very time consuming for data scientists. Data scientists spend about 66% of their time on data preparation and analysis tasks, including loading (19%), cleaning (26%), and […]  ( 9 min )
    Run notebooks as batch jobs in Amazon SageMaker Studio Lab
    Recently, the Amazon SageMaker Studio launched an easy way to run notebooks as batch jobs that can run on a recurring schedule. Amazon SageMaker Studio Lab also supports this feature, enabling you to run notebooks that you develop in SageMaker Studio Lab in your AWS account. This enables you to quickly scale your machine learning […]  ( 8 min )
    Organize machine learning development using shared spaces in SageMaker Studio for real-time collaboration
    Amazon SageMaker Studio is the first fully integrated development environment (IDE) for machine learning (ML). It provides a single, web-based visual interface where you can perform all ML development steps, including preparing data and building, training, and deploying models. Within an Amazon SageMaker Domain, users can provision a personal Amazon SageMaker Studio IDE application, which […]  ( 6 min )
    Minimize the production impact of ML model updates with Amazon SageMaker shadow testing
    Amazon SageMaker now allows you to compare the performance of a new version of a model serving stack with the currently deployed version prior to a full production rollout using a deployment safety practice known as shadow testing. Shadow testing can help you identify potential configuration errors and performance issues before they impact end-users. With […]  ( 11 min )
    Improve governance of your machine learning models with Amazon SageMaker
    As companies are increasingly adopting machine learning (ML) for their mainstream enterprise applications, more of their business decisions are influenced by ML models. As a result of this, having simplified access control and enhanced transparency across all your ML models makes it easier to validate that your models are performing well and take action when […]  ( 10 min )
    Define customized permissions in minutes with Amazon SageMaker Role Manager
    Administrators of machine learning (ML) workloads are focused on ensuring that users are operating in the most secure manner, striving towards a principal of least privilege design. They have a wide variety of personas to account for, each with their own unique sets of needs, and building the right sets of permissions policies to meet […]  ( 13 min )
    Build an agronomic data platform with Amazon SageMaker geospatial capabilities
    The world is at increasing risk of global food shortage as a consequence of geopolitical conflict, supply chain disruptions, and climate change. Simultaneously, there’s an increase in overall demand from population growth and shifting diets that focus on nutrient- and protein-rich food. To meet the excess demand, farmers need to maximize crop yield and effectively […]  ( 11 min )
    Separate lines of business or teams with multiple Amazon SageMaker domains
    Amazon SageMaker Studio is a fully integrated development environment (IDE) for machine learning (ML) that enables data scientists and developers to perform every step of the ML workflow, from preparing data to building, training, tuning, and deploying models. To access SageMaker Studio, Amazon SageMaker Canvas, or other Amazon ML environments like RStudio on Amazon SageMaker, […]  ( 6 min )
    Operationalize your Amazon SageMaker Studio notebooks as scheduled notebook jobs
    Amazon SageMaker Studio provides a fully managed solution for data scientists to interactively build, train, and deploy machine learning (ML) models. In addition to the interactive ML experience, data workers also seek solutions to run notebooks as ephemeral jobs without the need to refactor code as Python modules or learn DevOps tools and best practices […]  ( 7 min )
    How xarvio Digital Farming Solutions accelerates its development with Amazon SageMaker geospatial capabilities
    This is a guest post co-written by Julian Blau, Data Scientist at xarvio Digital Farming Solutions; BASF Digital Farming GmbH, and Antonio Rodriguez, AI/ML Specialist Solutions Architect at AWS xarvio Digital Farming Solutions is a brand from BASF Digital Farming GmbH, which is part of BASF Agricultural Solutions division. xarvio Digital Farming Solutions offers precision […]  ( 13 min )
    Protecting Consumers and Promoting Innovation – AI Regulation and Building Trust in Responsible AI
    Artificial intelligence (AI) is one of the most transformational technologies of our generation and provides huge opportunities to be a force for good and drive economic growth. It can help scientists cure terminal diseases, engineers build inconceivable structures, and farmers yield more crops. AI allows us to make sense of our world as never before—and […]  ( 5 min )
  • Open

    Robotics/artificial life project ideas?
    Hey guys I am looking for ideas to start learning robotics and reinforcement learning through a project that I have to do for my masters in AI. I am a CS major and I have experience in ML and DL, but none in robotics. From what I have been reading online, I have found interesting combining ROS, Gazebo and some reinforcement learning with OpenAI Gym but I do not know what a beginner level project with this stack could be like. I also have around a month and a half to do the project. Any other ideas related to ROS and Gazebo + any ML subfield are welcomed, specially if they are related to artificial life. Thanks! submitted by /u/AcD_South [link] [comments]  ( 61 min )
    How much of a MuJoCo simulation or real life robot can you train on a 3090?
    I'm training a few algorithms from Deepmind's acme library on some MuJoCo models and I'm wondering how long this will take to train and what it's going to do to my electric bill. Is a 3090 or two enough to train something to keep its balance, or do a task, or do I need to wait for the 8090 to come out? Also, do you think there would be an advantage to writing everything in C++, from the RL algorithms in Torch to the programming of the actuators and sensors on the (real life) robot? submitted by /u/user_00000000000001 [link] [comments]  ( 53 min )
    [P] Sample Factory 2.0: A lightning-fast production-grade Deep RL library
    submitted by /u/cranthir_ [link] [comments]  ( 53 min )
    Sampled Prioritized Experience Replay (faster solution to PER)
    Dear RL community, If you trained off-policy algorithm using Prioritized Experience Replay(PER) with random.choices function, may be you experienced how training process is slowing down with size of replay buffer. But if the desire to use it still is very high, because you go through Replay Buffer efficiently, (may be it was introduced before, may be the author is first), try this: Take bigger sample from all population of indices. (10xbatch_size) Take priorities corresponded to theses sample of indices. Take prioritized samples of transitions from buffer (I am calling it record for my algorithm) For SPER I do sampling once in 4 steps with 2x learning rate, and don't use batch size higher than 128, sample size higher than 10x-20x: https://preview.redd.it/olcnn2qptb3a1.png?width=733&format=png&auto=webp&s=d7ba1f727484d862ed0b10d32c75d5c46ee45777 Part from LLPG project is done apart from Univeristy or Government funding, solely funded by Jesus Christ, used in code (https://github.com/timurgepard/LLPG) with np.array new generator: https://preview.redd.it/p4l16nx1hc3a1.png?width=1166&format=png&auto=webp&s=2889e5cdcf0b0d58105b0b4fa7ce5b00e2aa2f32 ​ submitted by /u/Timur_1988 [link] [comments]  ( 56 min )
    Seeking mentor to help me learn and understand RL!
    Hello all, I am a beginner in RL that is requesting for a mentor to help me learn and understand policy gradient methods, invalid action masking, and rewards shaping applied to Wordle. I have successfully built a custom Wordle Gym environment that mimics the rules associated to Wordle. I am now trying to train an agent to strategize how to make optimal guesses that maximizes its likelihood to guess the correct answer. All of my code is developed in Colab, and I have passed the check_env checks in stable_baselines to check if the Gym environment is compliant. Feel free to DM me if you have any questions or if you are happy to help me out and learn :) Thanks! submitted by /u/WirrryWoo [link] [comments]  ( 61 min )
    In reinforcement learning, slower networks can learn faster
    submitted by /u/amazonscience [link] [comments]  ( 58 min )
    Multi-Agent RL algorithms for discrete actions and partially-observable environments
    I need advice on which algorithms I should try to implement for a multi-agent environment where: - Agents have partial observations, i.e. they don't have a full view of their local environment (i.e. the portion of environment around them or the one they're directly responsible for) - Reward is strongly dependent on the actual state of the environment, and therefore noisy from the point of view of the agents that cannot directly observe it - Agents can share their observations - The number of agents can be up to 20, but each agent is mostly affected by its neighbors - Actions are discrete, around 8 actions per agent I'm currently trying QMIX, but it does not learn easy, and I am not confident it can scale to a high number of agents submitted by /u/fedetask [link] [comments]  ( 60 min )
    PhD at a biotech company
    Hey all, I’m a software engineer at a smaller biotech company where we specialise in building equipment and software so that all the logging and data is available in one place (cuz biologists track their data by hand). I have an MSc in AI where I did my masters project in Lifelong RL. I’ve been really wanting to do my PhD the last little while, but struggled to get interviews or my foot in the door at some universities because my grades are a bit too low. I did an intense bachelors in bioengineering and Computer Science While dealing with undiagnosed adhd and ASD up until I started working at this company, hence the grades. My question is this: do you see the potential of applying RL into the biotech production process setting? The only thing I’m able to see it could be applied to is control systems, but wanted to see if there’s potential before I approach my CEO. The company has really strong ties to the best technical university in Europe, so if they agree to it, I wouldn’t struggle to find a professor here who could act as the university supervisor. Thanks for reading and sorry for the info dump. submitted by /u/uniqueusername_here_ [link] [comments]  ( 60 min )
    Q Learning Sum of Probabilities for Actions
    Hey guys, This might seem like a noob question but I am struggling with this a bit. I don't want to copy and past peoples code so I tend to try to implement algorithms myself from scratch to get a better understanding. anyway this is the bellman equation (which seems can take many forms with a general base to it) https://preview.redd.it/4xfj0ghuo83a1.jpg?width=732&format=pjpg&auto=webp&s=045b07d8d909d271041d7821fe86e20377805c8b The Sum of P(s,a,s') is the part that confuses me. I have the deterministic part working fine. However without hard coding in probabilities I don't understand how to figure them out. I'm told the sum of probabilities is to account for randomness. So I guess my question would be, does the sum of P need to be known for it to be implemented? or is there a way to determine / calculate that there is a random element? Most content uses the move with 80% chance of it moving the way you want with 10% chance of it moving to a side which ends up being 0.8 * s1 + 0.1 * s2 + 0.1 * s3. That makes complete sense, but again how do you get those values without hard coding them in? (by hard coding I mean you already know the probabilities and insert them via some variable) Thanks, submitted by /u/Vunpac [link] [comments]  ( 58 min )
    Scaling a set of numbers while preserving the sum
    Hello everyone, I've been googling for quite some time, but cannot seem to find an appropriate solution. I want to scale a set of numbers where the sum of the numbers is 0 and the current range is, for example, [-0.1666, 0.8334] to a new range [-1,1] while preserving its sum, 0, and of course the relative magnitude of each number. I've been trying different things, but not even sure if it's possible mathematically. I would really appreciate some help!! (a generic solution would be much appreciated!) Thanks a lot in advance. submitted by /u/Hot-Chair-8304 [link] [comments]  ( 57 min )
    Augmenting a Model to traditionally “model free algorithms”
    Can you augment a model of your environment to model free algorithms? How do you know when you’re “doing” model based RL versus just crafting a specific reward function? For example, what if the goal is to fully light a field for example using an array of lights, if you model the light transfer from the light to the surface of the field, and come up with some equation that describes the uniformity of light on the field in order to craft rewards based on the level of uniformity, are you then creating a model or just crafting a reward function? Edit: I think perhaps I had a misconception about model-based vs. model free. I think I was thinking model free meant there was no model of the environment at all. But then the agent would have nothing to interact with. It seems to actually be the case that model based reinforcement learning is when the agent itself tries to create a model, rather than just learning purely based off experience. If this understanding in the edit is correct, I think this question is no longer sensical. submitted by /u/tmt22459 [link] [comments]  ( 54 min )
    how to take gradient of value function?
    In Sutton reinforce with baseline, if my value function is linearly approximated how do I take the gradient of it? submitted by /u/Mammoth-Refuse5846 [link] [comments]  ( 63 min )
  • Open

    HXOUSE LABS - Faking it, Making it workshop.
    Hey everyone I wanted to share this workshop that is happening next weekend in Toronto @ Hxouse You can find the link to the application below. You still have about a week to apply and they are looking for people from all across the creative and tech sector to apply. INTELLIGENT MACHINERY HXOUSE LABS PRESENTS INTELLIGENT MACHINERY- a program focused on artificial intelligence and machine learning. Composed of panels and technical and philosophical workshops, the program touches on everything from current innovations in natural language image generation and automated vehicles, to future scenarios dealing with artificial general intelligence and super intelligence. INTELLIGENT MACHINERY will welcome talented individuals from diverse backgrounds and experience levels to participate in groundbreaking workshops developed in collaboration with the world's leading companies. Through this novel programming HXOUSE LABS will enable and activate a new generation of innovation in the world's most important technical fields. FAKING IT, MAKING IT Faking It, Making It is a technical workshop that will explore the latest deep fake technologies with a pioneer in the field, Carl Bogan, a.k.a Myster Giraffe. Carl will delve into his creative process, from ideation and narrative building, preparing assets, sourcing content, and training deep fake models, to processing faked footage and compositing final content. Deep fakes have been in the news for around a decade; First known for its nefarious use in pornography and espionage, the technology has developed into an everyday part of our entertainment through film and televison, and online content creation. The ambition of the workshop is to equip Tenants with knowledge and experience to develop in this new, exciting, and controversial, creative field. This is a two- day workshop that will take place on December 10th and 11th, from 9am to 6pm. https://labs.hxouse.com/ submitted by /u/No_Candidate4104 [link] [comments]  ( 47 min )
    The Splendor of Color Kaleidoscope Video v1.7 Colorful Psychedelic Fract...
    submitted by /u/LordPewPew777 [link] [comments]  ( 46 min )
    If used correctly, math in your AI animations can create some wild results (guide in the comments)
    submitted by /u/LorestForest [link] [comments]  ( 49 min )
    You can try out multiple styles at a time on synesthetic.ai (free generation included)
    ​ https://preview.redd.it/h2y0nag9yb3a1.png?width=1732&format=png&auto=webp&s=21c184ae67e9a9fdb2c1bc48d403558710160c5d submitted by /u/notrealAI [link] [comments]  ( 46 min )
    Finally. A feminist AI bot.
    I upgraded Princess Peach. Lvl up to Nectarine! Here. Princess Peach goes feminist. submitted by /u/garfield5684 [link] [comments]  ( 45 min )
    Small chat with an ai (ai dungeon)
    submitted by /u/yeti9876 [link] [comments]  ( 49 min )
    Any idea which AI app they used to get these pictures?
    submitted by /u/MC_Languste [link] [comments]  ( 44 min )
    Probably the Fastest Open Source Stable Diffusion is released
    submitted by /u/Just0by [link] [comments]  ( 51 min )
    Which of the three books do you recommend?
    View Poll submitted by /u/sergiCrack9 [link] [comments]  ( 47 min )
    A.I conversion course
    So I'm planning to do a conversion course in Data Science and AI . I currently have a bachelors degree in Aeronautics . I wanted to know whether its actually worth doing this course and whether it will affect any jobs I could get in this field submitted by /u/Keith__2510 [link] [comments]  ( 46 min )
    The real “Bitter Lesson” of artificial intelligence
    submitted by /u/bendee983 [link] [comments]  ( 47 min )
    Here's What You Should Know to Launch Your First AI Pilot Project
    Are you looking to adopt AI into your business but not sure how? A strategically chosen AI pilot project can give you the insights you need. Read here: https://www.artiba.org/blog/heres-what-you-should-know-to-launch-your-first-ai-pilot-project submitted by /u/Emily-joe [link] [comments]  ( 46 min )
    AIxhuman art
    Hello you beautiful people, I have spent a lot of time on DALL.E, and have just started an instagram page. The art is magical. I can't stand to only see fake bots commenting.. I need some support here. Instagram @ B0klava submitted by /u/yourclotheswack [link] [comments]  ( 46 min )
    [P] New Features of Image Segmentation Project PaddleSeg (6k stars)
    Hi, All, PadleSeg, an awesome image segmentation project, releases 2.7 and brings several new features. Hope this be some help to you. Github: https://github.com/PaddlePaddle/PaddleSeg New Features: Release PP-MattingV2, a real-time human matting model with SOTA performance. Compared to MODNet, the mean error is reduced by 17.91%, the inference speed is improved by 44.6% on GPU. Release MedicalSegV2, a superior 3D medical image segmentation solution, including an intelligent annotation toolkit called EISeg-Med3D, several state-of-the-art models and an optimized nnUNet-D with high performance. Release RTFormer, a real-time semantic segmentation model accepted by NeurIPS 2022. Add 3 semantic segmentation models, i.e., UHRNet, TopFormer and MscaleOCRNet-PSA. https://i.redd.it/m6768lkt6a3a1.gif https://i.redd.it/89ai0lj57a3a1.gif submitted by /u/Effective_Tax_2096 [link] [comments]  ( 47 min )
    Inworld AI launches best conversational AI / characters, now with web-based interactions
    submitted by /u/general_gengen [link] [comments]  ( 47 min )
    Pretty sure these AI avatars pass the Turing Test.
    https://www.producthunt.com/posts/inworld-arcade https://reddit.com/link/z9is1s/video/6vhj0frs093a1/player submitted by /u/garfield5684 [link] [comments]  ( 47 min )
    The Raven by Edgar Allan Poe Brought to Life By AI Generated Art Animations
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 47 min )
  • Open

    Talking to Robots in Real Time
    Posted by Corey Lynch, Research Scientist, and Ayzaan Wahid, Research Engineer, Robotics at Google A grand vision in robot learning, going back to the SHRDLU experiments in the late 1960s, is that of helpful robots that inhabit human spaces and follow a wide variety of natural language commands. Over the last few years, there have been significant advances in the application of machine learning (ML) for instruction following, both in simulation and in real world systems. Recent Palm-SayCan work has produced robots that leverage language models to plan long-horizon behaviors and reason about abstract goals. Code as Policies has shown that code-generating language models combined with pre-trained perception systems can produce language conditioned policies for zero shot robot manipulation…  ( 92 min )
  • Open

    Meet the Omnivore: Cloud Architect Takes Infrastructure Visualization to New Heights With NVIDIA Omniverse
    As a Microsoft Certified Azure cloud specialist and DevOps automation engineer, Gavin Stevens is deeply in tune with cloud architect workflows. The post Meet the Omnivore: Cloud Architect Takes Infrastructure Visualization to New Heights With NVIDIA Omniverse appeared first on NVIDIA Blog.  ( 6 min )
    Cheers to AI: Monarch Tractor Launches First Commercially Available Electric, ‘Driver Optional’ Smart Tractor
    Livermore, Calif., renowned for research and vineyards, is plowing in a new distinction: the birthplace of the first commercially available smart tractor. Local startup Monarch Tractor has announced the first of six Founder Series MK-V tractors are rolling off the production line at its headquarters. Constellation Brands, a leading wine and spirits producer and beer Read article > The post Cheers to AI: Monarch Tractor Launches First Commercially Available Electric, ‘Driver Optional’ Smart Tractor appeared first on NVIDIA Blog.  ( 6 min )
    GFN Thursday Dashes Into December With 22 New Games, Including ‘Marvel’s Midnight Suns’ Streaming Soon
    It’s a new month, which means GeForce NOW’s got the list of 22 new games arriving in December. Rise up for Marvel’s Midnight Suns, from publisher 2K Games, streaming on GeForce NOW later this month. Then get ready to move out, members. Battlefield 2042 is the latest game from the Electronic Arts catalog streaming on Read article > The post GFN Thursday Dashes Into December With 22 New Games, Including ‘Marvel’s Midnight Suns’ Streaming Soon appeared first on NVIDIA Blog.  ( 6 min )
  • Open

    AI Advent Calendar 2022
    Please enjoy this advent calendar, generated and illustrated with the help of three machine learning models (GPT-3, DALL-E, and Midjourney) Full door descriptions Eggnog as far as the eye could see The fantastical lion of Mor-Bollox Saturated Red Turkeys. Blue reindeer (they're bouncing) Candy Cane Palm Trees A  ( 6 min )
    Bonus: rejected advent calendar doors
    AI Weirdness: the strange side of machine learning  ( 2 min )
  • Open

    Telescopes, awk, and learning
    Here’s a quote I think about often: “It is faster to make a four-inch mirror and then a six-inch mirror than to make a six-inch mirror.” — Bill McKeenan, Thompson’s law of telescopes If your goal is to make a six-inch mirror, why make a four-inch mirror first? From a reductionist perspective this makes no […] Telescopes, awk, and learning first appeared on John D. Cook.  ( 8 min )
    The messy version of Napoleon’s theorem
    Napoleon’s theorem is usually presented as I presented it in the previous post. You start with a triangle (solid blue) and add equilateral triangles (dashed green) on the outside of the triangle. When you connect the centroids of these triangles you get a (dotted red) equilateral triangle. But Napoleon’s theorem is more general than this. […] The messy version of Napoleon’s theorem first appeared on John D. Cook.  ( 4 min )
  • Open

    Large language models help decipher clinical notes
    Researchers used a powerful deep-learning model to extract important data from electronic health records that could assist with personalized medicine.  ( 10 min )
  • Open

    How To Improve Water Efficiency for UK Businesses (2023)
    Water efficiency is one of the most precious natural resources, so it’s important to do your bit to conserve it. The last thing anyone wants is to find out they are unnecessarily overpaying when it comes to the bills. Taking note of where your company’s water is going and how you might be wasting it… Read More »How To Improve Water Efficiency for UK Businesses (2023) The post How To Improve Water Efficiency for UK Businesses (2023) appeared first on Data Science Central.  ( 20 min )
  • Open

    Nonconvex Matrix Factorization is Geodesically Convex: Global Landscape Analysis for Fixed-rank Matrix Optimization From a Riemannian Perspective. (arXiv:2209.15130v2 [math.OC] UPDATED)
    We study a general matrix optimization problem with a fixed-rank positive semidefinite (PSD) constraint. We perform the Burer-Monteiro factorization and consider a particular Riemannian quotient geometry in a search space that has a total space equipped with the Euclidean metric. When the original objective f satisfies standard restricted strong convexity and smoothness properties, we characterize the global landscape of the factorized objective under the Riemannian quotient geometry. We show the entire search space can be divided into three regions: (R1) the region near the target parameter of interest, where the factorized objective is geodesically strongly convex and smooth; (R2) the region containing neighborhoods of all strict saddle points; (R3) the remaining regions, where the factorized objective has a large gradient. To our best knowledge, this is the first global landscape analysis of the Burer-Monteiro factorized objective under the Riemannian quotient geometry. Our results provide a fully geometric explanation for the superior performance of vanilla gradient descent under the Burer-Monteiro factorization. When f satisfies a weaker restricted strict convexity property, we show there exists a neighborhood near local minimizers such that the factorized objective is geodesically convex. To prove our results we provide a comprehensive landscape analysis of a matrix factorization problem with a least squares objective, which serves as a critical bridge. Our conclusions are also based on a result of independent interest stating that the geodesic ball centered at Y with a radius 1/3 of the least singular value of Y is a geodesically convex set under the Riemannian quotient geometry, which as a corollary, also implies a quantitative bound of the convexity radius in the Bures-Wasserstein space. The convexity radius obtained is sharp up to constants.  ( 3 min )
    Behavior Estimation from Multi-Source Data for Offline Reinforcement Learning. (arXiv:2211.16078v1 [cs.LG])
    Offline reinforcement learning (RL) have received rising interest due to its appealing data efficiency. The present study addresses behavior estimation, a task that lays the foundation of many offline RL algorithms. Behavior estimation aims at estimating the policy with which training data are generated. In particular, this work considers a scenario where the data are collected from multiple sources. In this case, neglecting data heterogeneity, existing approaches for behavior estimation suffers from behavior misspecification. To overcome this drawback, the present study proposes a latent variable model to infer a set of policies from data, which allows an agent to use as behavior policy the policy that best describes a particular trajectory. This model provides with a agent fine-grained characterization for multi-source data and helps it overcome behavior misspecification. This work also proposes a learning algorithm for this model and illustrates its practical usage via extending an existing offline RL algorithm. Lastly, with extensive evaluation this work confirms the existence of behavior misspecification and the efficacy of the proposed model.  ( 2 min )
    Causal Inference with Conditional Instruments using Deep Generative Models. (arXiv:2211.16246v1 [cs.LG])
    The instrumental variable (IV) approach is a widely used way to estimate the causal effects of a treatment on an outcome of interest from observational data with latent confounders. A standard IV is expected to be related to the treatment variable and independent of all other variables in the system. However, it is challenging to search for a standard IV from data directly due to the strict conditions. The conditional IV (CIV) method has been proposed to allow a variable to be an instrument conditioning on a set of variables, allowing a wider choice of possible IVs and enabling broader practical applications of the IV approach. Nevertheless, there is not a data-driven method to discover a CIV and its conditioning set directly from data. To fill this gap, in this paper, we propose to learn the representations of the information of a CIV and its conditioning set from data with latent confounders for average causal effect estimation. By taking advantage of deep generative models, we develop a novel data-driven approach for simultaneously learning the representation of a CIV from measured variables and generating the representation of its conditioning set given measured variables. Extensive experiments on synthetic and real-world datasets show that our method outperforms the existing IV methods.
    Model Extraction Attack against Self-supervised Speech Models. (arXiv:2211.16044v1 [cs.SD])
    Self-supervised learning (SSL) speech models generate meaningful representations of given clips and achieve incredible performance across various downstream tasks. Model extraction attack (MEA) often refers to an adversary stealing the functionality of the victim model with only query access. In this work, we study the MEA problem against SSL speech model with a small number of queries. We propose a two-stage framework to extract the model. In the first stage, SSL is conducted on the large-scale unlabeled corpus to pre-train a small speech model. Secondly, we actively sample a small portion of clips from the unlabeled corpus and query the target model with these clips to acquire their representations as labels for the small model's second-stage training. Experiment results show that our sampling methods can effectively extract the target model without knowing any information about its model architecture.
    ImmunoLingo: Linguistics-based formalization of the antibody language. (arXiv:2209.12635v2 [q-bio.QM] UPDATED)
    Apparent parallels between natural language and biological sequence have led to a recent surge in the application of deep language models (LMs) to the analysis of antibody and other biological sequences. However, a lack of a rigorous linguistic formalization of biological sequence languages, which would define basic components, such as lexicon (i.e., the discrete units of the language) and grammar (i.e., the rules that link sequence well-formedness, structure, and meaning) has led to largely domain-unspecific applications of LMs, which do not take into account the underlying structure of the biological sequences studied. A linguistic formalization, on the other hand, establishes linguistically-informed and thus domain-adapted components for LM applications. It would facilitate a better understanding of how differences and similarities between natural language and biological sequences influence the quality of LMs, which is crucial for the design of interpretable models with extractable sequence-functions relationship rules, such as the ones underlying the antibody specificity prediction problem. Deciphering the rules of antibody specificity is crucial to accelerating rational and in silico biotherapeutic drug design. Here, we formalize the properties of the antibody language and thereby establish not only a foundation for the application of linguistic tools in adaptive immune receptor analysis but also for the systematic immunolinguistic studies of immune receptor specificity in general.
    Text Representation Enrichment Utilizing Graph based Approaches: Stock Market Technical Analysis Case Study. (arXiv:2211.16103v1 [cs.LG])
    Graph neural networks (GNNs) have been utilized for various natural language processing (NLP) tasks lately. The ability to encode corpus-wide features in graph representation made GNN models popular in various tasks such as document classification. One major shortcoming of such models is that they mainly work on homogeneous graphs, while representing text datasets as graphs requires several node types which leads to a heterogeneous schema. In this paper, we propose a transductive hybrid approach composed of an unsupervised node representation learning model followed by a node classification/edge prediction model. The proposed model is capable of processing heterogeneous graphs to produce unified node embeddings which are then utilized for node classification or link prediction as the downstream task. The proposed model is developed to classify stock market technical analysis reports, which to our knowledge is the first work in this domain. Experiments, which are carried away using a constructed dataset, demonstrate the ability of the model in embedding extraction and the downstream tasks.
    A Revenue Function for Comparison-Based Hierarchical Clustering. (arXiv:2211.16459v1 [cs.LG])
    Comparison-based learning addresses the problem of learning when, instead of explicit features or pairwise similarities, one only has access to comparisons of the form: \emph{Object $A$ is more similar to $B$ than to $C$.} Recently, it has been shown that, in Hierarchical Clustering, single and complete linkage can be directly implemented using only such comparisons while several algorithms have been proposed to emulate the behaviour of average linkage. Hence, finding hierarchies (or dendrograms) using only comparisons is a well understood problem. However, evaluating their meaningfulness when no ground-truth nor explicit similarities are available remains an open question. In this paper, we bridge this gap by proposing a new revenue function that allows one to measure the goodness of dendrograms using only comparisons. We show that this function is closely related to Dasgupta's cost for hierarchical clustering that uses pairwise similarities. On the theoretical side, we use the proposed revenue function to resolve the open problem of whether one can approximately recover a latent hierarchy using few triplet comparisons. On the practical side, we present principled algorithms for comparison-based hierarchical clustering based on the maximisation of the revenue and we empirically compare them with existing methods.
    MC-GEN:Multi-level Clustering for Private Synthetic Data Generation. (arXiv:2205.14298v2 [cs.LG] UPDATED)
    With the development of machine learning and data science, data sharing is very common between companies and research institutes to avoid data scarcity. However, sharing original datasets that contain private information can cause privacy leakage. A reliable solution is to utilize private synthetic datasets which preserve statistical information from original datasets. In this paper, we propose MC-GEN, a privacy-preserving synthetic data generation method under differential privacy guarantee for machine learning classification tasks. MC-GEN applies multi-level clustering and differential private generative model to improve the utility of synthetic data. In the experimental evaluation, we evaluated the effects of parameters and the effectiveness of MC-GEN. The results showed that MC-GEN can achieve significant effectiveness under certain privacy guarantees on multiple classification tasks. Moreover, we compare MC-GEN with three existing methods. The results showed that MC-GEN outperforms other methods in terms of utility.
    Balanced Semi-Supervised Generative Adversarial Network for Damage Assessment from Low-Data Imbalanced-Class Regime. (arXiv:2211.15961v1 [cs.LG])
    In recent years, applying deep learning (DL) to assess structural damages has gained growing popularity in vision-based structural health monitoring (SHM). However, both data deficiency and class-imbalance hinder the wide adoption of DL in practical applications of SHM. Common mitigation strategies include transfer learning, over-sampling, and under-sampling, yet these ad-hoc methods only provide limited performance boost that varies from one case to another. In this work, we introduce one variant of the Generative Adversarial Network (GAN), named the balanced semi-supervised GAN (BSS-GAN). It adopts the semi-supervised learning concept and applies balanced-batch sampling in training to resolve low-data and imbalanced-class problems. A series of computer experiments on concrete cracking and spalling classification were conducted under the low-data imbalanced-class regime with limited computing power. The results show that the BSS-GAN is able to achieve better damage detection in terms of recall and $F_\beta$ score than other conventional methods, indicating its state-of-the-art performance.
    Diagnosing and Fixing Manifold Overfitting in Deep Generative Models. (arXiv:2204.07172v4 [stat.ML] UPDATED)
    Likelihood-based, or explicit, deep generative models use neural networks to construct flexible high-dimensional densities. This formulation directly contradicts the manifold hypothesis, which states that observed data lies on a low-dimensional manifold embedded in high-dimensional ambient space. In this paper we investigate the pathologies of maximum-likelihood training in the presence of this dimensionality mismatch. We formally prove that degenerate optima are achieved wherein the manifold itself is learned but not the distribution on it, a phenomenon we call manifold overfitting. We propose a class of two-step procedures consisting of a dimensionality reduction step followed by maximum-likelihood density estimation, and prove that they recover the data-generating distribution in the nonparametric regime, thus avoiding manifold overfitting. We also show that these procedures enable density estimation on the manifolds learned by implicit models, such as generative adversarial networks, hence addressing a major shortcoming of these models. Several recently proposed methods are instances of our two-step procedures; we thus unify, extend, and theoretically justify a large class of models.
    Using a Conditional Generative Adversarial Network to Control the Statistical Characteristics of Generated Images for IACT Data Analysis. (arXiv:2211.15807v1 [astro-ph.IM])
    Generative adversarial networks are a promising tool for image generation in the astronomy domain. Of particular interest are conditional generative adversarial networks (cGANs), which allow you to divide images into several classes according to the value of some property of the image, and then specify the required class when generating new images. In the case of images from Imaging Atmospheric Cherenkov Telescopes (IACTs), an important property is the total brightness of all image pixels (image size), which is in direct correlation with the energy of primary particles. We used a cGAN technique to generate images similar to whose obtained in the TAIGA-IACT experiment. As a training set, we used a set of two-dimensional images generated using the TAIGA Monte Carlo simulation software. We artificiallly divided the training set into 10 classes, sorting images by size and defining the boundaries of the classes so that the same number of images fall into each class. These classes were used while training our network. The paper shows that for each class, the size distribution of the generated images is close to normal with the mean value located approximately in the middle of the corresponding class. We also show that for the generated images, the total image size distribution obtained by summing the distributions over all classes is close to the original distribution of the training set. The results obtained will be useful for more accurate generation of realistic synthetic images similar to the ones taken by IACTs.
    D\'ecouvrir de nouvelles classes dans des donn\'ees tabulaires. (arXiv:2211.16352v1 [cs.LG])
    In Novel Class Discovery (NCD), the goal is to find new classes in an unlabeled set given a labeled set of known but different classes. While NCD has recently gained attention from the community, no framework has yet been proposed for heterogeneous tabular data, despite being a very common representation of data. In this paper, we propose TabularNCD, a new method for discovering novel classes in tabular data. We show a way to extract knowledge from already known classes to guide the discovery process of novel classes in the context of tabular data which contains heterogeneous variables. A part of this process is done by a new method for defining pseudo labels, and we follow recent findings in Multi-Task Learning to optimize a joint objective function. Our method demonstrates that NCD is not only applicable to images but also to heterogeneous tabular data.
    POViT: Vision Transformer for Multi-objective Design and Characterization of Nanophotonic Devices. (arXiv:2205.09045v2 [cs.LG] UPDATED)
    We solve a fundamental challenge in semiconductor IC design: the fast and accurate characterization of nanoscale photonic devices. Much like the fusion between AI and EDA, many efforts have been made to apply DNNs such as convolutional neural networks (CNN) to prototype and characterize next-gen optoelectronic devices commonly found in photonic integrated circuits (PIC) and LiDAR. These prior works generally strive to predict the quality factor (Q) and modal volume (V) of for instance, photonic crystals, with ultra-high accuracy and speed. However, state-of-the-art models are still far from being directly applicable in the real-world: e.g. the correlation coefficient of V ($V_{coeff}$ ) is only about 80%, which is much lower than what it takes to generate reliable and reproducible nanophotonic designs. Recently, attention-based transformer models have attracted extensive interests and been widely used in CV and NLP. In this work, we propose the first-ever Transformer model (POViT) to efficiently design and simulate semiconductor photonic devices with multiple objectives. Unlike the standard Vision Transformer (ViT), we supplied photonic crystals as data input and changed the activation layer from GELU to an absolute-value function (ABS). Our experiments show that POViT exceeds results reported by previous models significantly. The correlation coefficient $V_{coeff}$ increases by over 12% (i.e., to 92.0%) and the prediction errors of Q is reduced by an order of magnitude, among several other key metric improvements. Our work has the potential to drive the expansion of EDA to fully automated photonic design. The complete dataset and code will be released to aid researchers endeavoring in the interdisciplinary field of physics and computer science.
    Outlier-Robust Sparse Mean Estimation for Heavy-Tailed Distributions. (arXiv:2211.16333v1 [cs.DS])
    We study the fundamental task of outlier-robust mean estimation for heavy-tailed distributions in the presence of sparsity. Specifically, given a small number of corrupted samples from a high-dimensional heavy-tailed distribution whose mean $\mu$ is guaranteed to be sparse, the goal is to efficiently compute a hypothesis that accurately approximates $\mu$ with high probability. Prior work had obtained efficient algorithms for robust sparse mean estimation of light-tailed distributions. In this work, we give the first sample-efficient and polynomial-time robust sparse mean estimator for heavy-tailed distributions under mild moment assumptions. Our algorithm achieves the optimal asymptotic error using a number of samples scaling logarithmically with the ambient dimension. Importantly, the sample complexity of our method is optimal as a function of the failure probability $\tau$, having an additive $\log(1/\tau)$ dependence. Our algorithm leverages the stability-based approach from the algorithmic robust statistics literature, with crucial (and necessary) adaptations required in our setting. Our analysis may be of independent interest, involving the delicate design of a (non-spectral) decomposition for positive semi-definite matrices satisfying certain sparsity properties.
    DMFormer: Closing the Gap Between CNN and Vision Transformers. (arXiv:2209.07738v3 [cs.CV] UPDATED)
    Vision transformers have shown excellent performance in computer vision tasks. As the computation cost of their self-attention mechanism is expensive, recent works tried to replace the self-attention mechanism in vision transformers with convolutional operations, which is more efficient with built-in inductive bias. However, these efforts either ignore multi-level features or lack dynamic prosperity, leading to sub-optimal performance. In this paper, we propose a Dynamic Multi-level Attention mechanism (DMA), which captures different patterns of input images by multiple kernel sizes and enables input-adaptive weights with a gating mechanism. Based on DMA, we present an efficient backbone network named DMFormer. DMFormer adopts the overall architecture of vision transformers, while replacing the self-attention mechanism with our proposed DMA. Extensive experimental results on ImageNet-1K and ADE20K datasets demonstrated that DMFormer achieves state-of-the-art performance, which outperforms similar-sized vision transformers(ViTs) and convolutional neural networks (CNNs).
    Accelerated Nonnegative Tensor Completion via Integer Programming. (arXiv:2211.15770v1 [cs.LG])
    The problem of tensor completion has applications in healthcare, computer vision, and other domains. However, past approaches to tensor completion have faced a tension in that they either have polynomial-time computation but require exponentially more samples than the information-theoretic rate, or they use fewer samples but require solving NP-hard problems for which there are no known practical algorithms. A recent approach, based on integer programming, resolves this tension for nonnegative tensor completion. It achieves the information-theoretic sample complexity rate and deploys the Blended Conditional Gradients algorithm, which requires a linear (in numerical tolerance) number of oracle steps to converge to the global optimum. The tradeoff in this approach is that, in the worst case, the oracle step requires solving an integer linear program. Despite this theoretical limitation, numerical experiments show that this algorithm can, on certain instances, scale up to 100 million entries while running on a personal computer. The goal of this paper is to further enhance this algorithm, with the intention to expand both the breadth and scale of instances that can be solved. We explore several variants that can maintain the same theoretical guarantees as the algorithm, but offer potentially faster computation. We consider different data structures, acceleration of gradient descent steps, and the use of the Blended Pairwise Conditional Gradients algorithm. We describe the original approach and these variants, and conduct numerical experiments in order to explore various tradeoffs in these algorithmic design choices.
    Disentangling the Mechanisms Behind Implicit Regularization in SGD. (arXiv:2211.15853v1 [cs.LG])
    A number of competing hypotheses have been proposed to explain why small-batch Stochastic Gradient Descent (SGD)leads to improved generalization over the full-batch regime, with recent work crediting the implicit regularization of various quantities throughout training. However, to date, empirical evidence assessing the explanatory power of these hypotheses is lacking. In this paper, we conduct an extensive empirical evaluation, focusing on the ability of various theorized mechanisms to close the small-to-large batch generalization gap. Additionally, we characterize how the quantities that SGD has been claimed to (implicitly) regularize change over the course of training. By using micro-batches, i.e. disjoint smaller subsets of each mini-batch, we empirically show that explicitly penalizing the gradient norm or the Fisher Information Matrix trace, averaged over micro-batches, in the large-batch regime recovers small-batch SGD generalization, whereas Jacobian-based regularizations fail to do so. This generalization performance is shown to often be correlated with how well the regularized model's gradient norms resemble those of small-batch SGD. We additionally show that this behavior breaks down as the micro-batch size approaches the batch size. Finally, we note that in this line of inquiry, positive experimental findings on CIFAR10 are often reversed on other datasets like CIFAR100, highlighting the need to test hypotheses on a wider collection of datasets.
    Multi-Class Anomaly Detection. (arXiv:2110.15108v3 [cs.LG] UPDATED)
    We study anomaly detection for the case when the normal class consists of more than one object category. This is an obvious generalization of the standard one-class anomaly detection problem. However, we show that jointly using multiple one-class anomaly detectors to solve this problem yields poorer results as compared to training a single one-class anomaly detector on all normal object categories together. We further develop a new anomaly detector called DeepMAD that learns compact distinguishing features by exploiting the multiple normal objects categories. This algorithm achieves higher AUC values for different datasets compared to two top performing one-class algorithms that either are trained on each normal object category or jointly trained on all normal object categories combined. In addition to theoretical results we present empirical results using the CIFAR-10, fMNIST, CIFAR-100, and a new dataset we developed called RECYCLE.
    Learning Control Policies for Stochastic Systems with Reach-avoid Guarantees. (arXiv:2210.05308v2 [cs.LG] UPDATED)
    We study the problem of learning controllers for discrete-time non-linear stochastic dynamical systems with formal reach-avoid guarantees. This work presents the first method for providing formal reach-avoid guarantees, which combine and generalize stability and safety guarantees, with a tolerable probability threshold $p\in[0,1]$ over the infinite time horizon. Our method leverages advances in machine learning literature and it represents formal certificates as neural networks. In particular, we learn a certificate in the form of a reach-avoid supermartingale (RASM), a novel notion that we introduce in this work. Our RASMs provide reachability and avoidance guarantees by imposing constraints on what can be viewed as a stochastic extension of level sets of Lyapunov functions for deterministic systems. Our approach solves several important problems -- it can be used to learn a control policy from scratch, to verify a reach-avoid specification for a fixed control policy, or to fine-tune a pre-trained policy if it does not satisfy the reach-avoid specification. We validate our approach on $3$ stochastic non-linear reinforcement learning tasks.
    Optimal variance-reduced stochastic approximation in Banach spaces. (arXiv:2201.08518v2 [math.ST] UPDATED)
    We study the problem of estimating the fixed point of a contractive operator defined on a separable Banach space. Focusing on a stochastic query model that provides noisy evaluations of the operator, we analyze a variance-reduced stochastic approximation scheme, and establish non-asymptotic bounds for both the operator defect and the estimation error, measured in an arbitrary semi-norm. In contrast to worst-case guarantees, our bounds are instance-dependent, and achieve the local asymptotic minimax risk non-asymptotically. For linear operators, contractivity can be relaxed to multi-step contractivity, so that the theory can be applied to problems like average reward policy evaluation problem in reinforcement learning. We illustrate the theory via applications to stochastic shortest path problems, two-player zero-sum Markov games, as well as policy evaluation and $Q$-learning for tabular Markov decision processes.
    Energy-Motivated Equivariant Pretraining for 3D Molecular Graphs. (arXiv:2207.08824v4 [q-bio.QM] UPDATED)
    Pretraining molecular representation models without labels is fundamental to various applications. Conventional methods mainly process 2D molecular graphs and focus solely on 2D tasks, making their pretrained models incapable of characterizing 3D geometry and thus defective for downstream 3D tasks. In this work, we tackle 3D molecular pretraining in a complete and novel sense. In particular, we first propose to adopt an equivariant energy-based model as the backbone for pretraining, which enjoys the merits of fulfilling the symmetry of 3D space. Then we develop a node-level pretraining loss for force prediction, where we further exploit the Riemann-Gaussian distribution to ensure the loss to be E(3)-invariant, enabling more robustness. Moreover, a graph-level noise scale prediction task is also leveraged to further promote the eventual performance. We evaluate our model pretrained from a large-scale 3D dataset GEOM-QM9 on two challenging 3D benchmarks: MD17 and QM9. Experimental results demonstrate the efficacy of our method against current state-of-the-art pretraining approaches, and verify the validity of our design for each proposed component.
    Advancing Deep Metric Learning Through Multiple Batch Norms And Multi-Targeted Adversarial Examples. (arXiv:2211.16253v1 [cs.LG])
    Deep Metric Learning (DML) is a prominent field in machine learning with extensive practical applications that concentrate on learning visual similarities. It is known that inputs such as Adversarial Examples (AXs), which follow a distribution different from that of clean data, result in false predictions from DML systems. This paper proposes MDProp, a framework to simultaneously improve the performance of DML models on clean data and inputs following multiple distributions. MDProp utilizes multi-distribution data through an AX generation process while leveraging disentangled learning through multiple batch normalization layers during the training of a DML model. MDProp is the first to generate feature space multi-targeted AXs to perform targeted regularization on the training model's denser embedding space regions, resulting in improved embedding space densities contributing to the improved generalization in the trained models. From a comprehensive experimental analysis, we show that MDProp results in up to 2.95% increased clean data Recall@1 scores and up to 2.12 times increased robustness against different input distributions compared to the conventional methods.
    Direct Heterogeneous Causal Learning for Resource Allocation Problems in Marketing. (arXiv:2211.15728v1 [cs.LG])
    Marketing is an important mechanism to increase user engagement and improve platform revenue, and heterogeneous causal learning can help develop more effective strategies. Most decision-making problems in marketing can be formulated as resource allocation problems and have been studied for decades. Existing works usually divide the solution procedure into two fully decoupled stages, i.e., machine learning (ML) and operation research (OR) -- the first stage predicts the model parameters and they are fed to the optimization in the second stage. However, the error of the predicted parameters in ML cannot be respected and a series of complex mathematical operations in OR lead to the increased accumulative errors. Essentially, the improved precision on the prediction parameters may not have a positive correlation on the final solution due to the side-effect from the decoupled design. In this paper, we propose a novel approach for solving resource allocation problems to mitigate the side-effects. Our key intuition is that we introduce the decision factor to establish a bridge between ML and OR such that the solution can be directly obtained in OR by only performing the sorting or comparison operations on the decision factor. Furthermore, we design a customized loss function that can conduct direct heterogeneous causal learning on the decision factor, an unbiased estimation of which can be guaranteed when the loss converges. As a case study, we apply our approach to two crucial problems in marketing: the binary treatment assignment problem and the budget allocation problem with multiple treatments. Both large-scale simulations and online A/B Tests demonstrate that our approach achieves significant improvement compared with state-of-the-art.
    Approximating Intersections and Differences Between Statistical Shape Models. (arXiv:2211.16314v1 [cs.CV])
    To date, the comparison of Statistical Shape Models (SSMs) is often solely performance-based and carried out by means of simplistic metrics such as compactness, generalization, or specificity. Any similarities or differences between the actual shape spaces can neither be visualized nor quantified. In this paper, we present a first method to compare two SSMs in dense correspondence by computing approximate intersection spaces and set-theoretic differences between the affine vector spaces spanned by the models. To this end, we approximate the distribution of shapes lying in the intersection space using Markov Chain Monte Carlo, and then apply Principal Component Analysis (PCA) to its samples. By representing the resulting spaces again as an SSM, our method enables an easy and intuitive analysis of similarities between two model's shape spaces. We estimate differences between SSMs in a similar manner; here, however, the resulting shape spaces are not linear vector spaces anymore and we do not apply PCA but instead use the posterior samples for visualization. We showcase the proposed algorithm qualitatively by computing and analyzing intersection spaces and differences between publicly available face models focusing on gender-specific male and female as well as identity and expression models. Our quantitative evaluation based on SSMs built from synthetic and real-world data sets provides detailed evidence that the introduced method is able to recover ground-truth intersection spaces and differences. Finally, we demonstrate that the proposed algorithm can be easily adapted to also compute intersections and differences between color spaces.
    DIGRAC: Digraph Clustering Based on Flow Imbalance. (arXiv:2106.05194v8 [stat.ML] UPDATED)
    Node clustering is a powerful tool in the analysis of networks. We introduce a graph neural network framework, named DIGRAC, to obtain node embeddings for directed networks in a self-supervised manner, including a novel probabilistic imbalance loss, which can be used for network clustering. Here, we propose \textit{directed flow imbalance} measures, which are tightly related to directionality, to reveal clusters in the network even when there is no density difference between clusters. In contrast to standard approaches in the literature, in this paper, directionality is not treated as a nuisance, but rather contains the main signal. DIGRAC optimizes directed flow imbalance for clustering without requiring label supervision, unlike existing graph neural network methods, and can naturally incorporate node features, unlike existing spectral methods. Extensive experimental results on synthetic data, in the form of directed stochastic block models, and real-world data at different scales, demonstrate that our method, based on flow imbalance, attains state-of-the-art results on directed graph clustering when compared against 10 state-of-the-art methods from the literature, for a wide range of noise and sparsity levels, graph structures, and topologies, and even outperforms supervised methods.
    Learning and Understanding a Disentangled Feature Representation for Hidden Parameters in Reinforcement Learning. (arXiv:2211.16315v1 [cs.LG])
    Hidden parameters are latent variables in reinforcement learning (RL) environments that are constant over the course of a trajectory. Understanding what, if any, hidden parameters affect a particular environment can aid both the development and appropriate usage of RL systems. We present an unsupervised method to map RL trajectories into a feature space where distance represents the relative difference in system behavior due to hidden parameters. Our approach disentangles the effects of hidden parameters by leveraging a recurrent neural network (RNN) world model as used in model-based RL. First, we alter the standard world model training algorithm to isolate the hidden parameter information in the world model memory. Then, we use a metric learning approach to map the RNN memory into a space with a distance metric approximating a bisimulation metric with respect to the hidden parameters. The resulting disentangled feature space can be used to meaningfully relate trajectories to each other and analyze the hidden parameter. We demonstrate our approach on four hidden parameters across three RL environments. Finally we present two methods to help identify and understand the effects of hidden parameters on systems.
    Mirror descent of Hopfield model. (arXiv:2211.15880v1 [cs.LG])
    Mirror descent is a gradient descent method that uses a dual space of parametric models. The great idea has been developed in convex optimization, but not yet widely applied in machine learning. In this study, we provide a possible way that the mirror descent can help data-driven parameter initialization of neural networks. We adopt the Hopfield model as a prototype of neural networks, we demonstrate that the mirror descent can train the model more effectively than the usual gradient descent with random parameter initialization.
    A Survey on Model Compression and Acceleration for Pretrained Language Models. (arXiv:2202.07105v2 [cs.CL] UPDATED)
    Despite achieving state-of-the-art performance on many NLP tasks, the high energy cost and long inference delay prevent Transformer-based pretrained language models (PLMs) from seeing broader adoption including for edge and mobile computing. Efficient NLP research aims to comprehensively consider computation, time and carbon emission for the entire life-cycle of NLP, including data preparation, model training and inference. In this survey, we focus on the inference stage and review the current state of model compression and acceleration for pretrained language models, including benchmarks, metrics and methodology.
    BrainGB: A Benchmark for Brain Network Analysis with Graph Neural Networks. (arXiv:2204.07054v3 [q-bio.NC] UPDATED)
    Mapping the connectome of the human brain using structural or functional connectivity has become one of the most pervasive paradigms for neuroimaging analysis. Recently, Graph Neural Networks (GNNs) motivated from geometric deep learning have attracted broad interest due to their established power for modeling complex networked data. Despite their superior performance in many fields, there has not yet been a systematic study of how to design effective GNNs for brain network analysis. To bridge this gap, we present BrainGB, a benchmark for brain network analysis with GNNs. BrainGB standardizes the process by (1) summarizing brain network construction pipelines for both functional and structural neuroimaging modalities and (2) modularizing the implementation of GNN designs. We conduct extensive experiments on datasets across cohorts and modalities and recommend a set of general recipes for effective GNN designs on brain networks. To support open and reproducible research on GNN-based brain network analysis, we host the BrainGB website at https://braingb.us with models, tutorials, examples, as well as an out-of-box Python package. We hope that this work will provide useful empirical evidence and offer insights for future research in this novel and promising direction.
    Graph Neural Networks: A Powerful and Versatile Tool for Advancing Design, Reliability, and Security of ICs. (arXiv:2211.16495v1 [cs.LG])
    Graph neural networks (GNNs) have pushed the state-of-the-art (SOTA) for performance in learning and predicting on large-scale data present in social networks, biology, etc. Since integrated circuits (ICs) can naturally be represented as graphs, there has been a tremendous surge in employing GNNs for machine learning (ML)-based methods for various aspects of IC design. Given this trajectory, there is a timely need to review and discuss some powerful and versatile GNN approaches for advancing IC design. In this paper, we propose a generic pipeline for tailoring GNN models toward solving challenging problems for IC design. We outline promising options for each pipeline element, and we discuss selected and promising works, like leveraging GNNs to break SOTA logic obfuscation. Our comprehensive overview of GNNs frameworks covers (i) electronic design automation (EDA) and IC design in general, (ii) design of reliable ICs, and (iii) design as well as analysis of secure ICs. We provide our overview and related resources also in the GNN4IC hub at https://github.com/DfX-NYUAD/GNN4IC. Finally, we discuss interesting open problems for future research.
    Optimizing Stock Option Forecasting with the Assembly of Machine Learning Models and Improved Trading Strategies. (arXiv:2211.15912v1 [q-fin.CP])
    This paper introduced key aspects of applying Machine Learning (ML) models, improved trading strategies, and the Quasi-Reversibility Method (QRM) to optimize stock option forecasting and trading results. It presented the findings of the follow-up project of the research "Application of Convolutional Neural Networks with Quasi-Reversibility Method Results for Option Forecasting". First, the project included an application of Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM) networks to provide a novel way of predicting stock option trends. Additionally, it examined the dependence of the ML models by evaluating the experimental method of combining multiple ML models to improve prediction results and decision-making. Lastly, two improved trading strategies and simulated investing results were presented. The Binomial Asset Pricing Model with discrete time stochastic process analysis and portfolio hedging was applied and suggested an optimized investment expectation. These results can be utilized in real-life trading strategies to optimize stock option investment results based on historical data.
    Succinct Explanations With Cascading Decision Trees. (arXiv:2010.06631v2 [cs.LG] UPDATED)
    The decision tree is one of the most popular and classical machine learning models from the 1980s. However, in many practical applications, decision trees tend to generate decision paths with excessive depth. Long decision paths often cause overfitting problems, and make models difficult to interpret. With longer decision paths, inference is also more likely to fail when the data contain missing values. In this work, we propose a new tree model called Cascading Decision Trees to alleviate this problem. The key insight of Cascading Decision Trees is to separate the decision path and the explanation path. Our experiments show that on average, Cascading Decision Trees generate 63.38% shorter explanation paths, avoiding overfitting and thus achieve higher test accuracy. We also empirically demonstrate that Cascading Decision Trees have advantages in the robustness against missing values.
    Obtaining Dyadic Fairness by Optimal Transport. (arXiv:2202.04520v2 [cs.LG] UPDATED)
    Fairness has been taken as a critical metric in machine learning models, which is considered as an important component of trustworthy machine learning. In this paper, we focus on obtaining fairness for popular link prediction tasks, which are measured by dyadic fairness. A novel pre-processing methodology is proposed to establish dyadic fairness through data repairing based on optimal transport theory. With the well-established theoretical connection between the dyadic fairness for graph link prediction and a conditional distribution alignment problem, the dyadic repairing scheme can be equivalently transformed into a conditional distribution alignment problem. Furthermore, an optimal transport-based dyadic fairness algorithm called DyadicOT is obtained by efficiently solving the alignment problem, satisfying flexibility and unambiguity requirements. The proposed DyadicOT algorithm shows superior results in obtaining fairness compared to other fairness methods on two benchmark graph datasets.
    Performance Evaluation, Optimization and Dynamic Decision in Blockchain Systems: A Recent Overview. (arXiv:2211.15907v1 [cs.PF])
    With rapid development of blockchain technology as well as integration of various application areas, performance evaluation, performance optimization, and dynamic decision in blockchain systems are playing an increasingly important role in developing new blockchain technology. This paper provides a recent systematic overview of this class of research, and especially, developing mathematical modeling and basic theory of blockchain systems. Important examples include (a) performance evaluation: Markov processes, queuing theory, Markov reward processes, random walks, fluid and diffusion approximations, and martingale theory; (b) performance optimization: Linear programming, nonlinear programming, integer programming, and multi-objective programming; (c) optimal control and dynamic decision: Markov decision processes, and stochastic optimal control; and (d) artificial intelligence: Machine learning, deep reinforcement learning, and federated learning. So far, a little research has focused on these research lines. We believe that the basic theory with mathematical methods, algorithms and simulations of blockchain systems discussed in this paper will strongly support future development and continuous innovation of blockchain technology.
    Machine learning emulation of a local-scale UK climate model. (arXiv:2211.16116v1 [physics.ao-ph])
    Climate change is causing the intensification of rainfall extremes. Precipitation projections with high spatial resolution are important for society to prepare for these changes, e.g. to model flooding impacts. Physics-based simulations for creating such projections are very computationally expensive. This work demonstrates the effectiveness of diffusion models, a form of deep generative models, for generating much more cheaply realistic high resolution rainfall samples for the UK conditioned on data from a low resolution simulation. We show for the first time a machine learning model that is able to produce realistic samples of high-resolution rainfall based on a physical model that resolves atmospheric convection, a key process behind extreme rainfall. By adding self-learnt, location-specific information to low resolution relative vorticity, quantiles and time-mean of the samples match well their counterparts from the high-resolution simulation.
    Homotopic Policy Mirror Descent: Policy Convergence, Implicit Regularization, and Improved Sample Complexity. (arXiv:2201.09457v9 [cs.LG] UPDATED)
    We propose a new policy gradient method, named homotopic policy mirror descent (HPMD), for solving discounted, infinite horizon MDPs with finite state and action spaces. HPMD performs a mirror descent type policy update with an additional diminishing regularization term, and possesses several computational properties that seem to be new in the literature. We first establish the global linear convergence of HPMD instantiated with Kullback-Leibler divergence, for both the optimality gap, and a weighted distance to the set of optimal policies. Then local superlinear convergence is obtained for both quantities without any assumption. With local acceleration and diminishing regularization, we establish the first result among policy gradient methods on certifying and characterizing the limiting policy, by showing, with a non-asymptotic characterization, that the last-iterate policy converges to the unique optimal policy with the maximal entropy. We then extend all the aforementioned results to HPMD instantiated with a broad class of decomposable Bregman divergences, demonstrating the generality of the these computational properties. As a by product, we discover the finite-time exact convergence for some commonly used Bregman divergences, implying the continuing convergence of HPMD to the limiting policy even if the current policy is already optimal. Finally, we develop a stochastic version of HPMD and establish similar convergence properties. By exploiting the local acceleration, we show that for small optimality gap, a better than $\tilde{\mathcal{O}}(\left|\mathcal{S}\right| \left|\mathcal{A}\right| / \epsilon^2)$ sample complexity holds with high probability, when assuming a generative model for policy evaluation.
    Neural Feature-Adaptation for Symbolic Predictions Using Pre-Training and Semantic Loss. (arXiv:2211.16047v1 [cs.AI])
    We are interested in neurosymbolic systems consisting of a high-level symbolic layer for explainable prediction in terms of human-intelligible concepts; and a low-level neural layer for extracting symbols required to generate the symbolic explanation. Real data is often imperfect meaning that even if the symbolic theory remains unchanged, we may still need to address the problem of mapping raw data to high-level symbols, each time there is a change in the data acquisition environment or equipment. Manual (re-)annotation of the raw data each time this happens is laborious and expensive; and automated labelling methods are often imperfect, especially for complex problems. NEUROLOG proposed the use of a semantic loss function that allows an existing feature-based symbolic model to guide the extraction of feature-values from raw data, using `abduction'. However, the experiments demonstrating the use of semantic loss through abduction appear to rely heavily on a domain-specific pre-processing step that enables a prior delineation of feature locations in the raw data. We examine the use of semantic loss in domains where such pre-processing is not possible, or is not obvious. We show that without any prior information about the features, the NEUROLOG approach can continue to predict accurately even with substantially incorrect feature predictions. We show also that prior information about the features in the form of even imperfect pre-training can help correct this situation. These findings are replicated on the original problem considered by NEUROLOG, without the use of feature-delineation. This suggests that symbolic explanations constructed for data in a domain could be re-used in a related domain, by `feature-adaptation' of pre-trained neural extractors using the semantic loss function constrained by abductive feedback.
    Incorporating Sum Constraints into Multitask Gaussian Processes. (arXiv:2202.01793v2 [stat.ML] UPDATED)
    Machine learning models can be improved by adapting them to respect existing background knowledge. In this paper we consider multitask Gaussian processes, with background knowledge in the form of constraints that require a specific sum of the outputs to be constant. This is achieved by conditioning the prior distribution on the constraint fulfillment. The approach allows for both linear and nonlinear constraints. We demonstrate that the constraints are fulfilled with high precision and that the construction can improve the overall prediction accuracy as compared to the standard Gaussian process.
    A Cross-Conformal Predictor for Multi-label Classification. (arXiv:2211.16238v1 [cs.LG])
    Unlike the typical classification setting where each instance is associated with a single class, in multi-label learning each instance is associated with multiple classes simultaneously. Therefore the learning task in this setting is to predict the subset of classes to which each instance belongs. This work examines the application of a recently developed framework called Conformal Prediction (CP) to the multi-label learning setting. CP complements the predictions of machine learning algorithms with reliable measures of confidence. As a result the proposed approach instead of just predicting the most likely subset of classes for a new unseen instance, also indicates the likelihood of each predicted subset being correct. This additional information is especially valuable in the multi-label setting where the overall uncertainty is extremely high.
    COVID-19 Classification Using Deep Learning Two-Stage Approach. (arXiv:2211.15817v1 [eess.IV])
    In this paper, deep-learning-based approaches namely fine-tuning of pretrained convolutional neural networks (VGG16 and VGG19), and end-to-end training of a developed CNN model, have been used in order to classify X-Ray images into four different classes that include COVID-19, normal, opacity and pneumonia cases. A dataset containing more than 20,000 X-ray scans was retrieved from Kaggle and used in this experiment. A two-stage classification approach was implemented to be compared to the one-shot classification approach. Our hypothesis was that a two-stage model will be able to achieve better performance than a one-shot model. Our results show otherwise as VGG16 achieved 95% accuracy using one-shot approach over 5-fold of training. Future work will focus on a more robust implementation of the two-stage classification model Covid-TSC. The main improvement will be allowing data to flow from the output of stage-1 to the input of stage-2, where stage-1 and stage-2 models are VGG16 models fine-tuned on the Covid-19 dataset.
    Proximal boosting: aggregating weak learners to minimize non-differentiable losses. (arXiv:1808.09670v4 [cs.LG] UPDATED)
    Gradient boosting is a prediction method that iteratively combines weak learners to produce a complex and accurate model. From an optimization point of view, the learning procedure of gradient boosting mimics a gradient descent on a functional variable. This paper proposes to build upon the proximal point algorithm, when the empirical risk to minimize is not differentiable, in order to introduce a novel boosting approach, called proximal boosting. It comes with a companion algorithm inspired by [1] and called residual proximal boosting, which is aimed at better controlling the approximation error. Theoretical convergence is proved for these two procedures under different hypotheses on the empirical risk and advantages of leveraging proximal methods for boosting are illustrated by numerical experiments on simulated and real-world data. In particular, we exhibit a favorable comparison over gradient boosting regarding convergence rate and prediction accuracy.
    BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis. (arXiv:2205.14807v2 [eess.AS] UPDATED)
    Binaural audio plays a significant role in constructing immersive augmented and virtual realities. As it is expensive to record binaural audio from the real world, synthesizing them from mono audio has attracted increasing attention. This synthesis process involves not only the basic physical warping of the mono audio, but also room reverberations and head/ear related filtrations, which, however, are difficult to accurately simulate in traditional digital signal processing. In this paper, we formulate the synthesis process from a different perspective by decomposing the binaural audio into a common part that shared by the left and right channels as well as a specific part that differs in each channel. Accordingly, we propose BinauralGrad, a novel two-stage framework equipped with diffusion models to synthesize them respectively. Specifically, in the first stage, the common information of the binaural audio is generated with a single-channel diffusion model conditioned on the mono audio, based on which the binaural audio is generated by a two-channel diffusion model in the second stage. Combining this novel perspective of two-stage synthesis with advanced generative models (i.e., the diffusion models),the proposed BinauralGrad is able to generate accurate and high-fidelity binaural audio samples. Experiment results show that on a benchmark dataset, BinauralGrad outperforms the existing baselines by a large margin in terms of both object and subject evaluation metrics (Wave L2: 0.128 vs. 0.157, MOS: 3.80 vs. 3.61). The generated audio samples (https://speechresearch.github.io/binauralgrad) and code (https://github.com/microsoft/NeuralSpeech/tree/master/BinauralGrad) are available online.
    Fuzzy clustering for the within-season estimation of cotton phenology. (arXiv:2211.14099v2 [cs.LG] UPDATED)
    Crop phenology is crucial information for crop yield estimation and agricultural management. Traditionally, phenology has been observed from the ground; however Earth observation, weather and soil data have been used to capture the physiological growth of crops. In this work, we propose a new approach for the within-season phenology estimation for cotton at the field level. For this, we exploit a variety of Earth observation vegetation indices (derived from Sentinel-2) and numerical simulations of atmospheric and soil parameters. Our method is unsupervised to address the ever-present problem of sparse and scarce ground truth data that makes most supervised alternatives impractical in real-world scenarios. We applied fuzzy c-means clustering to identify the principal phenological stages of cotton and then used the cluster membership weights to further predict the transitional phases between adjacent stages. In order to evaluate our models, we collected 1,285 crop growth ground observations in Orchomenos, Greece. We introduced a new collection protocol, assigning up to two phenology labels that represent the primary and secondary growth stage in the field and thus indicate when stages are transitioning. Our model was tested against a baseline model that allowed to isolate the random agreement and evaluate its true competence. The results showed that our model considerably outperforms the baseline one, which is promising considering the unsupervised nature of the approach. The limitations and the relevant future work are thoroughly discussed. The ground observations are formatted in an ready-to-use dataset and will be available at https://github.com/Agri-Hub/cotton-phenology-dataset upon publication.
    Device Modeling Bias in ReRAM-based Neural Network Simulations. (arXiv:2211.15925v1 [cs.ET])
    Data-driven modeling approaches such as jump tables are promising techniques to model populations of resistive random-access memory (ReRAM) or other emerging memory devices for hardware neural network simulations. As these tables rely on data interpolation, this work explores the open questions about their fidelity in relation to the stochastic device behavior they model. We study how various jump table device models impact the attained network performance estimates, a concept we define as modeling bias. Two methods of jump table device modeling, binning and Optuna-optimized binning, are explored using synthetic data with known distributions for benchmarking purposes, as well as experimental data obtained from TiOx ReRAM devices. Results on a multi-layer perceptron trained on MNIST show that device models based on binning can behave unpredictably particularly at low number of points in the device dataset, sometimes over-promising, sometimes under-promising target network accuracy. This paper also proposes device level metrics that indicate similar trends with the modeling bias metric at the network level. The proposed approach opens the possibility for future investigations into statistical device models with better performance, as well as experimentally verified modeling bias in different in-memory computing and neural network architectures.
    A Search and Detection Autonomous Drone System: from Design to Implementation. (arXiv:2211.15866v1 [cs.RO])
    Utilizing autonomous drones or unmanned aerial vehicles (UAVs) has shown great advantages over preceding methods in support of urgent scenarios such as search and rescue (SAR) and wildfire detection. In these operations, search efficiency in terms of the amount of time spent to find the target is crucial since with the passing of time the survivability of the missing person decreases or wildfire management becomes more difficult with disastrous consequences. In this work, it is considered a scenario where a drone is intended to search and detect a missing person (e.g., a hiker or a mountaineer) or a potential fire spot in a given area. In order to obtain the shortest path to the target, a general framework is provided to model the problem of target detection when the target's location is probabilistically known. To this end, two algorithms are proposed: Path planning and target detection. The path planning algorithm is based on Bayesian inference and the target detection is accomplished by means of a residual neural network (ResNet) trained on the image dataset captured by the drone as well as existing pictures and datasets on the web. Through simulation and experiment, the proposed path planning algorithm is compared with two benchmark algorithms. It is shown that the proposed algorithm significantly decreases the average time of the mission.
    Towards the Systematic Reporting of the Energy and Carbon Footprints of Machine Learning. (arXiv:2002.05651v2 [cs.CY] UPDATED)
    Accurate reporting of energy and carbon usage is essential for understanding the potential climate impacts of machine learning research. We introduce a framework that makes this easier by providing a simple interface for tracking realtime energy consumption and carbon emissions, as well as generating standardized online appendices. Utilizing this framework, we create a leaderboard for energy efficient reinforcement learning algorithms to incentivize responsible research in this area as an example for other areas of machine learning. Finally, based on case studies using our framework, we propose strategies for mitigation of carbon emissions and reduction of energy consumption. By making accounting easier, we hope to further the sustainable development of machine learning experiments and spur more research into energy efficient algorithms.
    You Can Have Better Graph Neural Networks by Not Training Weights at All: Finding Untrained GNNs Tickets. (arXiv:2211.15335v2 [cs.LG] UPDATED)
    Recent works have impressively demonstrated that there exists a subnetwork in randomly initialized convolutional neural networks (CNNs) that can match the performance of the fully trained dense networks at initialization, without any optimization of the weights of the network (i.e., untrained networks). However, the presence of such untrained subnetworks in graph neural networks (GNNs) still remains mysterious. In this paper we carry out the first-of-its-kind exploration of discovering matching untrained GNNs. With sparsity as the core tool, we can find \textit{untrained sparse subnetworks} at the initialization, that can match the performance of \textit{fully trained dense} GNNs. Besides this already encouraging finding of comparable performance, we show that the found untrained subnetworks can substantially mitigate the GNN over-smoothing problem, hence becoming a powerful tool to enable deeper GNNs without bells and whistles. We also observe that such sparse untrained subnetworks have appealing performance in out-of-distribution detection and robustness of input perturbations. We evaluate our method across widely-used GNN architectures on various popular datasets including the Open Graph Benchmark (OGB).
    Multi-Agent Reinforcement Learning for Microprocessor Design Space Exploration. (arXiv:2211.16385v1 [cs.AR])
    Microprocessor architects are increasingly resorting to domain-specific customization in the quest for high-performance and energy-efficiency. As the systems grow in complexity, fine-tuning architectural parameters across multiple sub-systems (e.g., datapath, memory blocks in different hierarchies, interconnects, compiler optimization, etc.) quickly results in a combinatorial explosion of design space. This makes domain-specific customization an extremely challenging task. Prior work explores using reinforcement learning (RL) and other optimization methods to automatically explore the large design space. However, these methods have traditionally relied on single-agent RL/ML formulations. It is unclear how scalable single-agent formulations are as we increase the complexity of the design space (e.g., full stack System-on-Chip design). Therefore, we propose an alternative formulation that leverages Multi-Agent RL (MARL) to tackle this problem. The key idea behind using MARL is an observation that parameters across different sub-systems are more or less independent, thus allowing a decentralized role assigned to each agent. We test this hypothesis by designing domain-specific DRAM memory controller for several workload traces. Our evaluation shows that the MARL formulation consistently outperforms single-agent RL baselines such as Proximal Policy Optimization and Soft Actor-Critic over different target objectives such as low power and latency. To this end, this work opens the pathway for new and promising research in MARL solutions for hardware architecture search.
    Backdoor Vulnerabilities in Normally Trained Deep Learning Models. (arXiv:2211.15929v1 [cs.CR])
    We conduct a systematic study of backdoor vulnerabilities in normally trained Deep Learning models. They are as dangerous as backdoors injected by data poisoning because both can be equally exploited. We leverage 20 different types of injected backdoor attacks in the literature as the guidance and study their correspondences in normally trained models, which we call natural backdoor vulnerabilities. We find that natural backdoors are widely existing, with most injected backdoor attacks having natural correspondences. We categorize these natural backdoors and propose a general detection framework. It finds 315 natural backdoors in the 56 normally trained models downloaded from the Internet, covering all the different categories, while existing scanners designed for injected backdoors can at most detect 65 backdoors. We also study the root causes and defense of natural backdoors.
    Abstract Visual Reasoning with Tangram Shapes. (arXiv:2211.16492v1 [cs.CL])
    We introduce KiloGram, a resource for studying abstract visual reasoning in humans and machines. Drawing on the history of tangram puzzles as stimuli in cognitive science, we build a richly annotated dataset that, with >1k distinct stimuli, is orders of magnitude larger and more diverse than prior resources. It is both visually and linguistically richer, moving beyond whole shape descriptions to include segmentation maps and part labels. We use this resource to evaluate the abstract visual reasoning capacities of recent multi-modal models. We observe that pre-trained weights demonstrate limited abstract reasoning, which dramatically improves with fine-tuning. We also observe that explicitly describing parts aids abstract reasoning for both humans and models, especially when jointly encoding the linguistic and visual inputs. KiloGram is available at https://lil.nlp.cornell.edu/kilogram .
    Evaluating and reducing the distance between synthetic and real speech distributions. (arXiv:2211.16049v1 [eess.AS])
    While modern Text-to-Speech (TTS) systems can produce speech rated highly in terms of subjective evaluation, the distance between real and synthetic speech distributions remains understudied, where we use the term \textit{distribution} to mean the sample space of all possible real speech recordings from a given set of speakers; or of the synthetic samples that could be generated for the same set of speakers. We evaluate the distance of real and synthetic speech distributions along the dimensions of the acoustic environment, speaker characteristics and prosody using a range of speech processing measures and the respective Wasserstein distances of their distributions. We reduce these distribution distances along said dimensions by providing utterance-level information derived from the measures to the model and show they can be generated at inference time. The improvements to the dimensions translate to overall distribution distance reduction approximated using Automatic Speech Recognition (ASR) by evaluating the fitness of the synthetic data as training data.
    Timing-Based Backpropagation in Spiking Neural Networks Without Single-Spike Restrictions. (arXiv:2211.16113v1 [cs.NE])
    We propose a novel backpropagation algorithm for training spiking neural networks (SNNs) that encodes information in the relative multiple spike timing of individual neurons without single-spike restrictions. The proposed algorithm inherits the advantages of conventional timing-based methods in that it computes accurate gradients with respect to spike timing, which promotes ideal temporal coding. Unlike conventional methods where each neuron fires at most once, the proposed algorithm allows each neuron to fire multiple times. This extension naturally improves the computational capacity of SNNs. Our SNN model outperformed comparable SNN models and achieved as high accuracy as non-convolutional artificial neural networks. The spike count property of our networks was altered depending on the time constant of the postsynaptic current and the membrane potential. Moreover, we found that there existed the optimal time constant with the maximum test accuracy. That was not seen in conventional SNNs with single-spike restrictions on time-to-fast-spike (TTFS) coding. This result demonstrates the computational properties of SNNs that biologically encode information into the multi-spike timing of individual neurons. Our code would be publicly available.
    Counterfactual Supervision-based Information Bottleneck for Out-of-Distribution Generalization. (arXiv:2208.07798v2 [cs.LG] UPDATED)
    Learning invariant (causal) features for out-of-distribution (OOD) generalization has attracted extensive attention recently, and among the proposals invariant risk minimization (IRM) is a notable solution. In spite of its theoretical promise for linear regression, the challenges of using IRM in linear classification problems remain. By introducing the information bottleneck (IB) principle into the learning of IRM, IB-IRM approach has demonstrated its power to solve these challenges. In this paper, we further improve IB-IRM from tow aspects. First, we show that the key assumption of support overlap of invariant features used in IB-IRM is strong for the guarantee of OOD generalization and it is still possible to achieve the optimal solution without this assumption. Second, we illustrate two failure modes that IB-IRM (and IRM) could fail for learning the invariant features, and to address such failures, we propose a \textit{Counterfactual Supervision-based Information Bottleneck (CSIB)} learning algorithm that provably recovers the invariant features. By requiring counterfactual inference, CSIB works even when accessing data from a single environment. Empirical experiments on several datasets verify our theoretical results.
    DBA: Efficient Transformer with Dynamic Bilinear Low-Rank Attention. (arXiv:2211.16368v1 [cs.LG])
    Many studies have been conducted to improve the efficiency of Transformer from quadric to linear. Among them, the low-rank-based methods aim to learn the projection matrices to compress the sequence length. However, the projection matrices are fixed once they have been learned, which compress sequence length with dedicated coefficients for tokens in the same position. Adopting such input-invariant projections ignores the fact that the most informative part of a sequence varies from sequence to sequence, thus failing to preserve the most useful information that lies in varied positions. In addition, previous efficient Transformers only focus on the influence of sequence length while neglecting the effect of hidden state dimension. To address the aforementioned problems, we present an efficient yet effective attention mechanism, namely the Dynamic Bilinear Low-Rank Attention (DBA), which compresses the sequence length by input-sensitive dynamic projection matrices and achieves linear time and space complexity by jointly optimizing the sequence length and hidden state dimension while maintaining state-of-the-art performance. Specifically, we first theoretically demonstrate that the sequence length can be compressed non-destructively from a novel perspective of information theory, with compression matrices dynamically determined by the input sequence. Furthermore, we show that the hidden state dimension can be approximated by extending the Johnson-Lindenstrauss lemma, optimizing the attention in bilinear form. Theoretical analysis shows that DBA is proficient in capturing high-order relations in cross-attention problems. Experiments over tasks with diverse sequence length conditions show that DBA achieves state-of-the-art performance compared with various strong baselines while maintaining less memory consumption with higher speed.
    Finding Differences Between Transformers and ConvNets Using Counterfactual Simulation Testing. (arXiv:2211.16499v1 [cs.CV])
    Modern deep neural networks tend to be evaluated on static test sets. One shortcoming of this is the fact that these deep neural networks cannot be easily evaluated for robustness issues with respect to specific scene variations. For example, it is hard to study the robustness of these networks to variations of object scale, object pose, scene lighting and 3D occlusions. The main reason is that collecting real datasets with fine-grained naturalistic variations of sufficient scale can be extremely time-consuming and expensive. In this work, we present Counterfactual Simulation Testing, a counterfactual framework that allows us to study the robustness of neural networks with respect to some of these naturalistic variations by building realistic synthetic scenes that allow us to ask counterfactual questions to the models, ultimately providing answers to questions such as "Would your classification still be correct if the object were viewed from the top?" or "Would your classification still be correct if the object were partially occluded by another object?". Our method allows for a fair comparison of the robustness of recently released, state-of-the-art Convolutional Neural Networks and Vision Transformers, with respect to these naturalistic variations. We find evidence that ConvNext is more robust to pose and scale variations than Swin, that ConvNext generalizes better to our simulated domain and that Swin handles partial occlusion better than ConvNext. We also find that robustness for all networks improves with network scale and with data scale and variety. We release the Naturalistic Variation Object Dataset (NVD), a large simulated dataset of 272k images of everyday objects with naturalistic variations such as object pose, scale, viewpoint, lighting and occlusions. Project page: https://counterfactualsimulation.github.io
    Batch Reinforcement Learning from Crowds. (arXiv:2111.04279v2 [cs.LG] UPDATED)
    A shortcoming of batch reinforcement learning is its requirement for rewards in data, thus not applicable to tasks without reward functions. Existing settings for lack of reward, such as behavioral cloning, rely on optimal demonstrations collected from humans. Unfortunately, extensive expertise is required for ensuring optimality, which hinder the acquisition of large-scale data for complex tasks. This paper addresses the lack of reward in a batch reinforcement learning setting by learning a reward function from preferences. Generating preferences only requires a basic understanding of a task. Being a mental process, generating preferences is faster than performing demonstrations. So preferences can be collected at scale from non-expert humans using crowdsourcing. This paper tackles a critical challenge that emerged when collecting data from non-expert humans: the noise in preferences. A novel probabilistic model is proposed for modelling the reliability of labels, which utilizes labels collaboratively. Moreover, the proposed model smooths the estimation with a learned reward function. Evaluation on Atari datasets demonstrates the effectiveness of the proposed model, followed by an ablation study to analyze the relative importance of the proposed ideas.
    AirFormer: Predicting Nationwide Air Quality in China with Transformers. (arXiv:2211.15979v1 [eess.SP])
    Air pollution is a crucial issue affecting human health and livelihoods, as well as one of the barriers to economic and social growth. Forecasting air quality has become an increasingly important endeavor with significant social impacts, especially in emerging countries like China. In this paper, we present a novel Transformer architecture termed AirFormer to collectively predict nationwide air quality in China, with an unprecedented fine spatial granularity covering thousands of locations. AirFormer decouples the learning process into two stages -- 1) a bottom-up deterministic stage that contains two new types of self-attention mechanisms to efficiently learn spatio-temporal representations; 2) a top-down stochastic stage with latent variables to capture the intrinsic uncertainty of air quality data. We evaluate AirFormer with 4-year data from 1,085 stations in the Chinese Mainland. Compared to the state-of-the-art model, AirFormer reduces prediction errors by 5%~8% on 72-hour future predictions. Our source code is available at https://github.com/yoshall/airformer.
    A survey on multi-player bandits. (arXiv:2211.16275v1 [stat.ML])
    Due mostly to its application to cognitive radio networks, multiplayer bandits gained a lot of interest in the last decade. A considerable progress has been made on its theoretical aspect. However, the current algorithms are far from applicable and many obstacles remain between these theoretical results and a possible implementation of multiplayer bandits algorithms in real cognitive radio networks. This survey contextualizes and organizes the rich multiplayer bandits literature. In light of the existing works, some clear directions for future research appear. We believe that a further study of these different directions might lead to theoretical algorithms adapted to real-world situations.
    Approximating Martingale Process for Variance Reduction in Deep Reinforcement Learning with Large State Space. (arXiv:2211.15886v1 [cs.LG])
    Approximating Martingale Process (AMP) is proven to be effective for variance reduction in reinforcement learning (RL) in specific cases such as Multiclass Queueing Networks. However, in the already proven cases, the state space is relatively small and all possible state transitions can be iterated through. In this paper, we consider systems in which state space is large and have uncertainties when considering state transitions, thus making AMP a generalized variance-reduction method in RL. Specifically, we will investigate the application of AMP in ride-hailing systems like Uber, where Proximal Policy Optimization (PPO) is incorporated to optimize the policy of matching drivers and customers.
    Revisiting Embeddings for Graph Neural Networks. (arXiv:2209.09338v4 [cs.LG] UPDATED)
    Current graph representation learning techniques use Graph Neural Networks (GNNs) to extract features from dataset embeddings. In this work, we examine the quality of these embeddings and assess how changing them can affect the accuracy of GNNs. We explore different embedding extraction techniques for both images and texts; and find that the performance of different GNN architectures is dependent on the embedding style used. We see a prevalence of bag of words (BoW) embeddings and text classification tasks in available graph datasets. Given the impact embeddings has on GNN performance. this leads to a phenomenon that GNNs being optimised for BoW vectors.
    CWD: A Machine Learning based Approach to Detect Unknown Cloud Workloads. (arXiv:2211.15739v1 [cs.DC])
    Workloads in modern cloud data centers are becoming increasingly complex. The number of workloads running in cloud data centers has been growing exponentially for the last few years, and cloud service providers (CSP) have been supporting on-demand services in real-time. Realizing the growing complexity of cloud environment and cloud workloads, hardware vendors such as Intel and AMD are increasingly introducing cloud-specific workload acceleration features in their CPU platforms. These features are typically targeted towards popular and commonly-used cloud workloads. Nonetheless, uncommon, customer-specific workloads (unknown workloads), if their characteristics are different from common workloads (known workloads), may not realize the potential of the underlying platform. To address this problem of realizing the full potential of the underlying platform, we develop a machine learning based technique to characterize, profile and predict workloads running in the cloud environment. Experimental evaluation of our technique demonstrates good prediction performance. We also develop techniques to analyze the performance of the model in a standalone manner.
    Composition based oxidation state prediction of materials using deep learning. (arXiv:2211.15895v1 [cond-mat.mtrl-sci])
    Oxidation states are the charges of atoms after their ionic approximation of their bonds, which have been widely used in charge-neutrality verification, crystal structure determination, and reaction estimation. Currently only heuristic rules exist for guessing the oxidation states of a given compound with many exceptions. Recent work has developed machine learning models based on heuristic structural features for predicting the oxidation states of metal ions. However, composition based oxidation state prediction still remains elusive so far, which is more important in new material discovery for which the structures are not even available. This work proposes a novel deep learning based BERT transformer language model BERTOS for predicting the oxidation states of all elements of inorganic compounds given only their chemical composition. Our model achieves 96.82\% accuracy for all-element oxidation states prediction benchmarked on the cleaned ICSD dataset and achieves 97.61\% accuracy for oxide materials. We also demonstrate how it can be used to conduct large-scale screening of hypothetical material compositions for materials discovery.
    FakeEdge: Alleviate Dataset Shift in Link Prediction. (arXiv:2211.15899v1 [cs.LG])
    Link prediction is a crucial problem in graph-structured data. Due to the recent success of graph neural networks (GNNs), a variety of GNN-based models were proposed to tackle the link prediction task. Specifically, GNNs leverage the message passing paradigm to obtain node representation, which relies on link connectivity. However, in a link prediction task, links in the training set are always present while ones in the testing set are not yet formed, resulting in a discrepancy of the connectivity pattern and bias of the learned representation. It leads to a problem of dataset shift which degrades the model performance. In this paper, we first identify the dataset shift problem in the link prediction task and provide theoretical analyses on how existing link prediction methods are vulnerable to it. We then propose FakeEdge, a model-agnostic technique, to address the problem by mitigating the graph topological gap between training and testing sets. Extensive experiments demonstrate the applicability and superiority of FakeEdge on multiple datasets across various domains.
    Bayesian Experimental Design for Symbolic Discovery. (arXiv:2211.15860v1 [cs.LG])
    This study concerns the formulation and application of Bayesian optimal experimental design to symbolic discovery, which is the inference from observational data of predictive models taking general functional forms. We apply constrained first-order methods to optimize an appropriate selection criterion, using Hamiltonian Monte Carlo to sample from the prior. A step for computing the predictive distribution, involving convolution, is computed via either numerical integration, or via fast transform methods.
    On the power of foundation models. (arXiv:2211.16327v1 [cs.AI])
    With infinitely many high-quality data points, infinite computational power, an infinitely large foundation model with a perfect training algorithm and guaranteed zero generalization error on the pretext task, can the model be used for everything? This question cannot be answered by the existing theory of representation, optimization or generalization, because the issues they mainly investigate are assumed to be nonexistent here. In this paper, we show that category theory provides powerful machinery to answer this question. We have proved three results. The first one limits the power of prompt-based learning, saying that the model can solve a downstream task with prompts if and only if the task is representable. The second one says fine tuning does not have this limit, as a foundation model with the minimum power (up to symmetry) can theoretically solve downstream tasks with fine tuning and enough resources. Our final result can be seen as a new type of generalization theorem, showing that the foundation model can generate unseen objects from the target category (e.g., images) using the structural information from the source category (e.g., texts). Along the way, we provide a categorical framework for supervised and self-supervised learning, which might be of independent interest.
    Detection of Risk Predictors of COVID-19 Mortality with Classifier Machine Learning Models Operated with Routine Laboratory Biomarkers. (arXiv:2210.12342v2 [cs.LG] UPDATED)
    Early evaluation of patients who require special care and who have high death-expectancy in COVID-19, and the effective determination of relevant biomarkers on large sample-groups are important to reduce mortality. This study aimed to reveal the routine blood-value predictors of COVID-19 mortality and to determine the lethal-risk levels of these predictors during the disease process. The dataset of the study consists of 38 routine blood-values of 2597 patients who died (n = 233) and those who recovered (n = 2364) from COVID-19 in August-December, 2021. In this study, the histogram-based gradient-boosting (HGB) model was the most successful machine-learning classifier in detecting living and deceased COVID-19 patients (with squared F1 metrics F1^2 = 1). The most efficient binary combinations with procalcitonin were obtained with D-dimer, ESR, D-Bil and ferritin. The HGB model operated with these feature pairs correctly detected almost all of the patients who survived and those who died (precision > 0.98, recall > 0.98, F1^2 > 0.98). Furthermore, in the HGB model operated with a single feature, the most efficient features were procalcitonin (F1^2 = 0.96) and ferritin (F1^2 = 0.91). In addition, according to the two-threshold approach, ferritin values between 376.2 mkg/L and 396.0 mkg/L (F1^2 = 0.91) and pro-calcitonin values between 0.2 mkg/L and 5.2 mkg/L (F1^2 = 0.95) were found to be fatal risk levels for COVID-19. Considering all the results, we suggest that many features combined with these features, especially procalcitonin and ferritin, operated with the HGB model, can be used to achieve very successful results in the classification of those who live, and those who die from COVID-19. Moreover, we strongly recommend that clinicians consider the critical levels we have found for procalcitonin and ferritin properties, to reduce the lethality of the COVID-19 disease.
    Exploring the Long-Term Generalization of Counting Behavior in RNNs. (arXiv:2211.16429v1 [cs.NE])
    In this study, we investigate the generalization of LSTM, ReLU and GRU models on counting tasks over long sequences. Previous theoretical work has established that RNNs with ReLU activation and LSTMs have the capacity for counting with suitable configuration, while GRUs have limitations that prevent correct counting over longer sequences. Despite this and some positive empirical results for LSTMs on Dyck-1 languages, our experimental results show that LSTMs fail to learn correct counting behavior for sequences that are significantly longer than in the training data. ReLUs show much larger variance in behavior and in most cases worse generalization. The long sequence generalization is empirically related to validation loss, but reliable long sequence generalization seems not practically achievable through backpropagation with current techniques. We demonstrate different failure modes for LSTMs, GRUs and ReLUs. In particular, we observe that the saturation of activation functions in LSTMs and the correct weight setting for ReLUs to generalize counting behavior are not achieved in standard training regimens. In summary, learning generalizable counting behavior is still an open problem and we discuss potential approaches for further research.
    Learning Antidote Data to Individual Unfairness. (arXiv:2211.15897v1 [cs.LG])
    Fairness is an essential factor for machine learning systems deployed in high-stake applications. Among all fairness notions, individual fairness, following a consensus that `similar individuals should be treated similarly,' is a vital notion to guarantee fair treatment for individual cases. Previous methods typically characterize individual fairness as a prediction-invariant problem when perturbing sensitive attributes, and solve it by adopting the Distributionally Robust Optimization (DRO) paradigm. However, adversarial perturbations along a direction covering sensitive information do not consider the inherent feature correlations or innate data constraints, and thus mislead the model to optimize at off-manifold and unrealistic samples. In light of this, we propose a method to learn and generate antidote data that approximately follows the data distribution to remedy individual unfairness. These on-manifold antidote data can be used through a generic optimization procedure with original training data, resulting in a pure pre-processing approach to individual unfairness, or can also fit well with the in-processing DRO paradigm. Through extensive experiments, we demonstrate our antidote data resists individual unfairness at a minimal or zero cost to the model's predictive utility.
    BARTSmiles: Generative Masked Language Models for Molecular Representations. (arXiv:2211.16349v1 [cs.LG])
    We discover a robust self-supervised strategy tailored towards molecular representations for generative masked language models through a series of tailored, in-depth ablations. Using this pre-training strategy, we train BARTSmiles, a BART-like model with an order of magnitude more compute than previous self-supervised molecular representations. In-depth evaluations show that BARTSmiles consistently outperforms other self-supervised representations across classification, regression, and generation tasks setting a new state-of-the-art on 11 tasks. We then quantitatively show that when applied to the molecular domain, the BART objective learns representations that implicitly encode our downstream tasks of interest. For example, by selecting seven neurons from a frozen BARTSmiles, we can obtain a model having performance within two percentage points of the full fine-tuned model on task Clintox. Lastly, we show that standard attribution interpretability methods, when applied to BARTSmiles, highlight certain substructures that chemists use to explain specific properties of molecules. The code and the pretrained model are publicly available.
    OPERA: Omni-Supervised Representation Learning with Hierarchical Supervisions. (arXiv:2210.05557v2 [cs.CV] UPDATED)
    The pretrain-finetune paradigm in modern computer vision facilitates the success of self-supervised learning, which tends to achieve better transferability than supervised learning. However, with the availability of massive labeled data, a natural question emerges: how to train a better model with both self and full supervision signals? In this paper, we propose Omni-suPErvised Representation leArning with hierarchical supervisions (OPERA) as a solution. We provide a unified perspective of supervisions from labeled and unlabeled data and propose a unified framework of fully supervised and self-supervised learning. We extract a set of hierarchical proxy representations for each image and impose self and full supervisions on the corresponding proxy representations. Extensive experiments on both convolutional neural networks and vision transformers demonstrate the superiority of OPERA in image classification, segmentation, and object detection. Code is available at: https://github.com/wangck20/OPERA.
    Rethinking Transfer Learning for Medical Image Classification. (arXiv:2106.05152v6 [eess.IV] UPDATED)
    Transfer learning (TL) from pretrained deep models is a standard practice in modern medical image classification (MIC). However, what levels of features to be reused are problem-dependent, and uniformly finetuning all layers of pretrained models may be suboptimal. This insight has partly motivated the recent \emph{differential} TL strategies, such as TransFusion (TF) and layer-wise finetuning (LWFT), which treat the layers in the pretrained models differentially. In this paper, we add one more strategy into this family, called \emph{TruncatedTL}, which reuses and finetunes appropriate bottom layers and directly discards the remaining layers. This yields not only superior MIC performance but also compact models for efficient inference, compared to other differential TL methods. We validate the performance and model efficiency of TruncatedTL on three MIC tasks covering both 2D and 3D images. For example, on the BIMCV COVID-19 classification dataset, we obtain improved performance with around $1/4$ model size and $2/3$ inference time compared to the standard full TL model. Code is available at https://github.com/sun-umn/Transfer-Learning-in-Medical-Imaging.
    ACE: Cooperative Multi-agent Q-learning with Bidirectional Action-Dependency. (arXiv:2211.16068v1 [cs.LG])
    Multi-agent reinforcement learning (MARL) suffers from the non-stationarity problem, which is the ever-changing targets at every iteration when multiple agents update their policies at the same time. Starting from first principle, in this paper, we manage to solve the non-stationarity problem by proposing bidirectional action-dependent Q-learning (ACE). Central to the development of ACE is the sequential decision-making process wherein only one agent is allowed to take action at one time. Within this process, each agent maximizes its value function given the actions taken by the preceding agents at the inference stage. In the learning phase, each agent minimizes the TD error that is dependent on how the subsequent agents have reacted to their chosen action. Given the design of bidirectional dependency, ACE effectively turns a multiagent MDP into a single-agent MDP. We implement the ACE framework by identifying the proper network representation to formulate the action dependency, so that the sequential decision process is computed implicitly in one forward pass. To validate ACE, we compare it with strong baselines on two MARL benchmarks. Empirical experiments demonstrate that ACE outperforms the state-of-the-art algorithms on Google Research Football and StarCraft Multi-Agent Challenge by a large margin. In particular, on SMAC tasks, ACE achieves 100% success rate on almost all the hard and super-hard maps. We further study extensive research problems regarding ACE, including extension, generalization, and practicability. Code is made available to facilitate further research.
    Quantization-aware Interval Bound Propagation for Training Certifiably Robust Quantized Neural Networks. (arXiv:2211.16187v1 [cs.LG])
    We study the problem of training and certifying adversarially robust quantized neural networks (QNNs). Quantization is a technique for making neural networks more efficient by running them using low-bit integer arithmetic and is therefore commonly adopted in industry. Recent work has shown that floating-point neural networks that have been verified to be robust can become vulnerable to adversarial attacks after quantization, and certification of the quantized representation is necessary to guarantee robustness. In this work, we present quantization-aware interval bound propagation (QA-IBP), a novel method for training robust QNNs. Inspired by advances in robust learning of non-quantized networks, our training algorithm computes the gradient of an abstract representation of the actual network. Unlike existing approaches, our method can handle the discrete semantics of QNNs. Based on QA-IBP, we also develop a complete verification procedure for verifying the adversarial robustness of QNNs, which is guaranteed to terminate and produce a correct answer. Compared to existing approaches, the key advantage of our verification procedure is that it runs entirely on GPU or other accelerator devices. We demonstrate experimentally that our approach significantly outperforms existing methods and establish the new state-of-the-art for training and certifying the robustness of QNNs.
    Two Is Better Than One: Dual Embeddings for Complementary Product Recommendations. (arXiv:2211.14982v2 [cs.IR] UPDATED)
    Embedding based product recommendations have gained popularity in recent years due to its ability to easily integrate to large-scale systems and allowing nearest neighbor searches in real-time. The bulk of studies in this area has predominantly been focused on similar item recommendations. Research on complementary item recommendations, on the other hand, still remains considerably under-explored. We define similar items as items that are interchangeable in terms of their utility and complementary items as items that serve different purposes, yet are compatible when used with one another. In this paper, we apply a novel approach to finding complementary items by leveraging dual embedding representations for products. We demonstrate that the notion of relatedness discovered in NLP for skip-gram negative sampling (SGNS) models translates effectively to the concept of complementarity when training item representations using co-purchase data. Since sparsity of purchase data is a major challenge in real-world scenarios, we further augment the model using synthetic samples to extend coverage. This allows the model to provide complementary recommendations for items that do not share co-purchase data by leveraging other abundantly available data modalities such as images, text, clicks etc. We establish the effectiveness of our approach in improving both coverage and quality of recommendations on real world data for a major online retail company. We further show the importance of task specific hyperparameter tuning in training SGNS. Our model is effective yet simple to implement, making it a great candidate for generating complementary item recommendations at any e-commerce website.
    The Vanishing Decision Boundary Complexity and the Strong First Component. (arXiv:2211.16209v1 [cs.LG])
    We show that unlike machine learning classifiers, there are no complex boundary structures in the decision boundaries for well-trained deep models. However, we found that the complicated structures do appear in training but they vanish shortly after shaping. This is a pessimistic news if one seeks to capture different levels of complexity in the decision boundary for understanding generalization, which works well in machine learning. Nonetheless, we found that the decision boundaries of predecessor models on the training data are reflective of the final model's generalization. We show how to use the predecessor decision boundaries for studying the generalization of deep models. We have three major findings. One is on the strength of the first principle component of deep models, another about the singularity of optimizers, and the other on the effects of the skip connections in ResNets. Code is at https://github.com/hengshu1/decision_boundary_github.
    Entropy Approximation by Machine Learning Regression: Application for Irregularity Evaluation of Images in Remote Sensing. (arXiv:2210.06901v2 [cs.LG] UPDATED)
    Approximation of entropies of various types using machine learning (ML) regression methods are shown for the first time. The ML models presented in this study define the complexity of the short time series by approximating dissimilar entropy techniques such as Singular value decomposition entropy (SvdEn), Permutation entropy (PermEn), Sample entropy (SampEn) and Neural Network entropy (NNetEn) and their 2D analogies. A new method for calculating SvdEn2D, PermEn2D and SampEn2D for 2D images was tested using the technique of circular kernels. Training and testing datasets on the basis of Sentinel-2 images are presented (two training images and one hundred and ninety-eight testing images). The results of entropy approximation are demonstrated using the example of calculating the 2D entropy of Sentinel-2 images and R^2 metric evaluation. The applicability of the method for the short time series with a length from N = 5 to N = 113 elements is shown. A tendency for the R^2 metric to decrease with an increase in the length of the time series was found. For SvdEn entropy, the regression accuracy is R^2 > 0.99 for N = 5 and R^2 > 0.82 for N = 113. The best metrics were observed for the ML_SvdEn2D and ML_NNetEn2D models. The results of the study can be used for fundamental research of entropy approximations of various types using ML regression, as well as for accelerating entropy calculations in remote sensing. The versatility of the model is shown on a synthetic chaotic time series using Planck map and logistic map.
    The Union of Manifolds Hypothesis. (arXiv:2207.02862v2 [stat.ML] UPDATED)
    Deep learning has had tremendous success at learning low-dimensional representations of high-dimensional data. This success would be impossible if there was no hidden low-dimensional structure in data of interest; this existence is posited by the manifold hypothesis, which states that the data lies on an unknown manifold of low intrinsic dimension. In this paper, we argue that this hypothesis does not properly capture the low-dimensional structure typically present in image data. Assuming that data lies on a single manifold implies intrinsic dimension is identical across the entire data space, and does not allow for subregions of this space to have a different number of factors of variation. To address this deficiency, we put forth the union of manifolds hypothesis, which states that data lies on a disjoint union of manifolds of varying intrinsic dimensions. We empirically verify this hypothesis on commonly-used image datasets, finding that indeed, observed data lies on a disconnected set and that intrinsic dimension is not constant. We also provide insights into the implications the union of manifolds hypothesis has for deep learning, both supervised and unsupervised, showing that designing models with an inductive bias for this structure improves performance across classification and generative modelling tasks.
    Minimax AUC Fairness: Efficient Algorithm with Provable Convergence. (arXiv:2208.10451v2 [cs.LG] UPDATED)
    The use of machine learning models in consequential decision making often exacerbates societal inequity, in particular yielding disparate impact on members of marginalized groups defined by race and gender. The area under the ROC curve (AUC) is widely used to evaluate the performance of a scoring function in machine learning, but is studied in algorithmic fairness less than other performance metrics. Due to the pairwise nature of the AUC, defining an AUC-based group fairness metric is pairwise-dependent and may involve both \emph{intra-group} and \emph{inter-group} AUCs. Importantly, considering only one category of AUCs is not sufficient to mitigate unfairness in AUC optimization. In this paper, we propose a minimax learning and bias mitigation framework that incorporates both intra-group and inter-group AUCs while maintaining utility. Based on this Rawlsian framework, we design an efficient stochastic optimization algorithm and prove its convergence to the minimum group-level AUC. We conduct numerical experiments on both synthetic and real-world datasets to validate the effectiveness of the minimax framework and the proposed optimization algorithm.
    jaCappella Corpus: A Japanese a Cappella Vocal Ensemble Corpus. (arXiv:2211.16028v1 [eess.AS])
    We construct a corpus of Japanese a cappella vocal ensembles (jaCappella corpus) for vocal ensemble separation and synthesis. It consists of 35 copyright-cleared vocal ensemble songs and their audio recordings of individual voice parts. These songs were arranged from out-of-copyright Japanese children's songs and have six voice parts (lead vocal, soprano, alto, tenor, bass, and vocal percussion). They are divided into seven subsets, each of which features typical characteristics of a music genre such as jazz and enka. The variety in genre and voice part match vocal ensembles recently widespread in social media services such as YouTube, although the main targets of conventional vocal ensemble datasets are choral singing made up of soprano, alto, tenor, and bass. Experimental evaluation demonstrates that our corpus is a challenging resource for vocal ensemble separation. Our corpus is available on our project page (https://tomohikonakamura.github.io/jaCappella_corpus/).
    PaCMO: Partner Dependent Human Motion Generation in Dyadic Human Activity using Neural Operators. (arXiv:2211.16210v1 [cs.CV])
    We address the problem of generating 3D human motions in dyadic activities. In contrast to the concurrent works, which mainly focus on generating the motion of a single actor from the textual description, we generate the motion of one of the actors from the motion of the other participating actor in the action. This is a particularly challenging, under-explored problem, that requires learning intricate relationships between the motion of two actors participating in an action and also identifying the action from the motion of one actor. To address these, we propose partner conditioned motion operator (PaCMO), a neural operator-based generative model which learns the distribution of human motion conditioned by the partner's motion in function spaces through adversarial training. Our model can handle long unlabeled action sequences at arbitrary time resolution. We also introduce the "Functional Frechet Inception Distance" ($F^2ID$) metric for capturing similarity between real and generated data for function spaces. We test PaCMO on NTU RGB+D and DuetDance datasets and our model produces realistic results evidenced by the $F^2ID$ score and the conducted user study.
    Out-Of-Distribution Detection Is Not All You Need. (arXiv:2211.16158v1 [cs.LG])
    The usage of deep neural networks in safety-critical systems is limited by our ability to guarantee their correct behavior. Runtime monitors are components aiming to identify unsafe predictions and discard them before they can lead to catastrophic consequences. Several recent works on runtime monitoring have focused on out-of-distribution (OOD) detection, i.e., identifying inputs that are different from the training data. In this work, we argue that OOD detection is not a well-suited framework to design efficient runtime monitors and that it is more relevant to evaluate monitors based on their ability to discard incorrect predictions. We call this setting out-ofmodel-scope detection and discuss the conceptual differences with OOD. We also conduct extensive experiments on popular datasets from the literature to show that studying monitors in the OOD setting can be misleading: 1. very good OOD results can give a false impression of safety, 2. comparison under the OOD setting does not allow identifying the best monitor to detect errors. Finally, we also show that removing erroneous training data samples helps to train better monitors.
    Adap DP-FL: Differentially Private Federated Learning with Adaptive Noise. (arXiv:2211.15893v1 [cs.LG])
    Federated learning seeks to address the issue of isolated data islands by making clients disclose only their local training models. However, it was demonstrated that private information could still be inferred by analyzing local model parameters, such as deep neural network model weights. Recently, differential privacy has been applied to federated learning to protect data privacy, but the noise added may degrade the learning performance much. Typically, in previous work, training parameters were clipped equally and noises were added uniformly. The heterogeneity and convergence of training parameters were simply not considered. In this paper, we propose a differentially private scheme for federated learning with adaptive noise (Adap DP-FL). Specifically, due to the gradient heterogeneity, we conduct adaptive gradient clipping for different clients and different rounds; due to the gradient convergence, we add decreasing noises accordingly. Extensive experiments on real-world datasets demonstrate that our Adap DP-FL outperforms previous methods significantly.
    On the Ability of Graph Neural Networks to Model Interactions Between Vertices. (arXiv:2211.16494v1 [cs.LG])
    Graph neural networks (GNNs) are widely used for modeling complex interactions between entities represented as vertices of a graph. Despite recent efforts to theoretically analyze the expressive power of GNNs, a formal characterization of their ability to model interactions is lacking. The current paper aims to address this gap. Formalizing strength of interactions through an established measure known as separation rank, we quantify the ability of certain GNNs to model interaction between a given subset of vertices and its complement, i.e. between sides of a given partition of input vertices. Our results reveal that the ability to model interaction is primarily determined by the partition's walk index -- a graph-theoretical characteristic that we define by the number of walks originating from the boundary of the partition. Experiments with common GNN architectures corroborate this finding. As a practical application of our theory, we design an edge sparsification algorithm named Walk Index Sparsification (WIS), which preserves the ability of a GNN to model interactions when input edges are removed. WIS is simple, computationally efficient, and markedly outperforms alternative methods in terms of induced prediction accuracy. More broadly, it showcases the potential of improving GNNs by theoretically analyzing the interactions they can model.
    Asymptotic consistency of the WSINDy algorithm in the limit of continuum data. (arXiv:2211.16000v1 [math.NA])
    In this work we study the asymptotic consistency of the weak-form sparse identification of nonlinear dynamics algorithm (WSINDy) in the identification of differential equations from noisy samples of solutions. We prove that the WSINDy estimator is unconditionally asymptotically consistent for a wide class of models which includes the Navier-Stokes equations and the Kuramoto-Sivashinsky equation. We thus provide a mathematically rigorous explanation for the observed robustness to noise of weak-form equation learning. Conversely, we also show that in general the WSINDy estimator is only conditionally asymptotically consistent, yielding discovery of spurious terms with probability one if the noise level is above some critical threshold and the nonlinearities exhibit sufficiently fast growth. We derive explicit bounds on the critical noise threshold in the case of Gaussian white noise and provide an explicit characterization of these spurious terms in the case of trigonometric and/or polynomial model nonlinearities. However, a silver lining to this negative result is that if the data is suitably denoised (a simple moving average filter is sufficient), then we recover unconditional asymptotic consistency on the class of models with locally-Lipschitz nonlinearities. Altogether, our results reveal several important aspects of weak-form equation learning which may be used to improve future algorithms. We demonstrate our results numerically using the Lorenz system, the cubic oscillator, a viscous Burgers growth model, and a Kuramoto-Sivashinsky-type higher-order PDE.
    Learning to Optimize with Dynamic Mode Decomposition. (arXiv:2211.16268v1 [cs.LG])
    Designing faster optimization algorithms is of ever-growing interest. In recent years, learning to learn methods that learn how to optimize demonstrated very encouraging results. Current approaches usually do not effectively include the dynamics of the optimization process during training. They either omit it entirely or only implicitly assume the dynamics of an isolated parameter. In this paper, we show how to utilize the dynamic mode decomposition method for extracting informative features about optimization dynamics. By employing those features, we show that our learned optimizer generalizes much better to unseen optimization problems in short. The improved generalization is illustrated on multiple tasks where training the optimizer on one neural network generalizes to different architectures and distinct datasets.
    Impact of Automatic Image Classification and Blind Deconvolution in Improving Text Detection Performance of the CRAFT Algorithm. (arXiv:2211.15999v1 [cs.CV])
    Text detection in natural scenes has been a significant and active research subject in computer vision and document analysis because of its wide range of applications as evidenced by the emergence of the Robust Reading Competition. One of the algorithms which has good text detection performance in the said competition is the Character Region Awareness for Text Detection (CRAFT). Employing the ICDAR 2013 dataset, this study investigates the impact of automatic image classification and blind deconvolution as image pre-processing steps to further enhance the text detection performance of CRAFT. The proposed technique automatically classifies the scene images into two categories, blurry and non-blurry, by utilizing of a Laplacian operator with 100 as threshold. Prior to applying the CRAFT algorithm, images that are categorized as blurry are further pre-processed using blind deconvolution to reduce the blur. The results revealed that the proposed method significantly enhanced the detection performance of CRAFT, as demonstrated by its IoU h-mean of 94.47% compared to the original 91.42% h-mean of CRAFT and this even outperformed the top-ranked SenseTime, whose h-mean is 93.62%.
    Survey on Self-Supervised Multimodal Representation Learning and Foundation Models. (arXiv:2211.15837v1 [cs.LG])
    Deep learning has been the subject of growing interest in recent years. Specifically, a specific type called Multimodal learning has shown great promise for solving a wide range of problems in domains such as language, vision, audio, etc. One promising research direction to improve this further has been learning rich and robust low-dimensional data representation of the high-dimensional world with the help of large-scale datasets present on the internet. Because of its potential to avoid the cost of annotating large-scale datasets, self-supervised learning has been the de facto standard for this task in recent years. This paper summarizes some of the landmark research papers that are directly or indirectly responsible to build the foundation of multimodal self-supervised learning of representation today. The paper goes over the development of representation learning over the last few years for each modality and how they were combined to get a multimodal agent later.
    Physics Informed Neural Network for Dynamic Stress Prediction. (arXiv:2211.16190v1 [cs.LG])
    Structural failures are often caused by catastrophic events such as earthquakes and winds. As a result, it is crucial to predict dynamic stress distributions during highly disruptive events in real time. Currently available high-fidelity methods, such as Finite Element Models (FEMs), suffer from their inherent high complexity. Therefore, to reduce computational cost while maintaining accuracy, a Physics Informed Neural Network (PINN), PINN-Stress model, is proposed to predict the entire sequence of stress distribution based on Finite Element simulations using a partial differential equation (PDE) solver. Using automatic differentiation, we embed a PDE into a deep neural network's loss function to incorporate information from measurements and PDEs. The PINN-Stress model can predict the sequence of stress distribution in almost real-time and can generalize better than the model without PINN.
    Catch Me If You Hear Me: Audio-Visual Navigation in Complex Unmapped Environments with Moving Sounds. (arXiv:2111.14843v3 [cs.SD] UPDATED)
    Audio-visual navigation combines sight and hearing to navigate to a sound-emitting source in an unmapped environment. While recent approaches have demonstrated the benefits of audio input to detect and find the goal, they focus on clean and static sound sources and struggle to generalize to unheard sounds. In this work, we propose the novel dynamic audio-visual navigation benchmark which requires catching a moving sound source in an environment with noisy and distracting sounds, posing a range of new challenges. We introduce a reinforcement learning approach that learns a robust navigation policy for these complex settings. To achieve this, we propose an architecture that fuses audio-visual information in the spatial feature space to learn correlations of geometric information inherent in both local maps and audio signals. We demonstrate that our approach consistently outperforms the current state-of-the-art by a large margin across all tasks of moving sounds, unheard sounds, and noisy environments, on two challenging 3D scanned real-world environments, namely Matterport3D and Replica. The benchmark is available at this http URL
    BatmanNet: Bi-branch Masked Graph Transformer Autoencoder for Molecular Representation. (arXiv:2211.13979v2 [cs.LG] UPDATED)
    Although substantial efforts have been made using graph neural networks (GNNs) for AI-driven drug discovery (AIDD), effective molecular representation learning remains an open challenge, especially in the case of insufficient labeled molecules. Recent studies suggest that big GNN models pre-trained by self-supervised learning on unlabeled datasets enable better transfer performance in downstream molecular property prediction tasks. However, they often require large-scale datasets and considerable computational resources, which is time-consuming, computationally expensive, and environmentally unfriendly. To alleviate these limitations, we propose a novel pre-training model for molecular representation learning, Bi-branch Masked Graph Transformer Autoencoder (BatmanNet). BatmanNet features two tailored and complementary graph autoencoders to reconstruct the missing nodes and edges from a masked molecular graph. To our surprise, BatmanNet discovered that the highly masked proportion (60%) of the atoms and bonds achieved the best performance. We further propose an asymmetric graph-based encoder-decoder architecture for either nodes and edges, where a transformer-based encoder only takes the visible subset of nodes or edges, and a lightweight decoder reconstructs the original molecule from the latent representation and mask tokens. With this simple yet effective asymmetrical design, our BatmanNet can learn efficiently even from a much smaller-scale unlabeled molecular dataset to capture the underlying structural and semantic information, overcoming a major limitation of current deep neural networks for molecular representation learning. For instance, using only 250K unlabelled molecules as pre-training data, our BatmanNet with 2.575M parameters achieves a 0.5% improvement on the average AUC compared with the current state-of-the-art method with 100M parameters pre-trained on 11M molecules.
    Approximate Gibbs Sampler for Efficient Inference of Hierarchical Bayesian Models for Grouped Count Data. (arXiv:2211.15771v1 [cs.LG])
    Hierarchical Bayesian Poisson regression models (HBPRMs) provide a flexible modeling approach of the relationship between predictors and count response variables. The applications of HBPRMs to large-scale datasets require efficient inference algorithms due to the high computational cost of inferring many model parameters based on random sampling. Although Markov Chain Monte Carlo (MCMC) algorithms have been widely used for Bayesian inference, sampling using this class of algorithms is time-consuming for applications with large-scale data and time-sensitive decision-making, partially due to the non-conjugacy of many models. To overcome this limitation, this research develops an approximate Gibbs sampler (AGS) to efficiently learn the HBPRMs while maintaining the inference accuracy. In the proposed sampler, the data likelihood is approximated with Gaussian distribution such that the conditional posterior of the coefficients has a closed-form solution. Numerical experiments using real and synthetic datasets with small and large counts demonstrate the superior performance of AGS in comparison to the state-of-the-art sampling algorithm, especially for large datasets.
    DiffuseVAE: Efficient, Controllable and High-Fidelity Generation from Low-Dimensional Latents. (arXiv:2201.00308v3 [cs.LG] UPDATED)
    Diffusion probabilistic models have been shown to generate state-of-the-art results on several competitive image synthesis benchmarks but lack a low-dimensional, interpretable latent space, and are slow at generation. On the other hand, standard Variational Autoencoders (VAEs) typically have access to a low-dimensional latent space but exhibit poor sample quality. We present DiffuseVAE, a novel generative framework that integrates VAE within a diffusion model framework, and leverage this to design novel conditional parameterizations for diffusion models. We show that the resulting model equips diffusion models with a low-dimensional VAE inferred latent code which can be used for downstream tasks like controllable synthesis. The proposed method also improves upon the speed vs quality tradeoff exhibited in standard unconditional DDPM/DDIM models (for instance, FID of 16.47 vs 34.36 using a standard DDIM on the CelebA-HQ-128 benchmark using T=10 reverse process steps) without having explicitly trained for such an objective. Furthermore, the proposed model exhibits synthesis quality comparable to state-of-the-art models on standard image synthesis benchmarks like CIFAR-10 and CelebA-64 while outperforming most existing VAE-based methods. Lastly, we show that the proposed method exhibits inherent generalization to different types of noise in the conditioning signal. For reproducibility, our source code is publicly available at https://github.com/kpandey008/DiffuseVAE.
    A memory-efficient neural ODE framework based on high-level adjoint differentiation. (arXiv:2206.01298v2 [cs.LG] UPDATED)
    Neural ordinary differential equations (neural ODEs) have emerged as a novel network architecture that bridges dynamical systems and deep learning. However, the gradient obtained with the continuous adjoint method in the vanilla neural ODE is not reverse-accurate. Other approaches suffer either from an excessive memory requirement due to deep computational graphs or from limited choices for the time integration scheme, hampering their application to large-scale complex dynamical systems. To achieve accurate gradients without compromising memory efficiency and flexibility, we present a new neural ODE framework, PNODE, based on high-level discrete adjoint algorithmic differentiation. By leveraging discrete adjoint time integrators and advanced checkpointing strategies tailored for these integrators, PNODE can provide a balance between memory and computational costs, while computing the gradients consistently and accurately. We provide an open-source implementation based on PyTorch and PETSc, one of the most commonly used portable, scalable scientific computing libraries. We demonstrate the performance through extensive numerical experiments on image classification and continuous normalizing flow problems. We show that PNODE achieves the highest memory efficiency when compared with other reverse-accurate methods. On the image classification problems, PNODE is up to two times faster than the vanilla neural ODE and up to 2.3 times faster than the best existing reverse-accurate method. We also show that PNODE enables the use of the implicit time integration methods that are needed for stiff dynamical systems.
    Compressing Cross-Lingual Multi-Task Models at Qualtrics. (arXiv:2211.15927v1 [cs.CL])
    Experience management is an emerging business area where organizations focus on understanding the feedback of customers and employees in order to improve their end-to-end experiences. This results in a unique set of machine learning problems to help understand how people feel, discover issues they care about, and find which actions need to be taken on data that are different in content and distribution from traditional NLP domains. In this paper, we present a case study of building text analysis applications that perform multiple classification tasks efficiently in 12 languages in the nascent business area of experience management. In order to scale up modern ML methods on experience data, we leverage cross lingual and multi-task modeling techniques to consolidate our models into a single deployment to avoid overhead. We also make use of model compression and model distillation to reduce overall inference latency and hardware cost to the level acceptable for business needs while maintaining model prediction quality. Our findings show that multi-task modeling improves task performance for a subset of experience management tasks in both XLM-R and mBert architectures. Among the compressed architectures we explored, we found that MiniLM achieved the best compression/performance tradeoff. Our case study demonstrates a speedup of up to 15.61x with 2.60% average task degradation (or 3.29x speedup with 1.71% degradation) and estimated savings of 44% over using the original full-size model. These results demonstrate a successful scaling up of text classification for the challenging new area of ML for experience management.
    Latent Graph Inference using Product Manifolds. (arXiv:2211.16199v1 [cs.LG])
    Graph Neural Networks usually rely on the assumption that the graph topology is available to the network as well as optimal for the downstream task. Latent graph inference allows models to dynamically learn the intrinsic graph structure of problems where the connectivity patterns of data may not be directly accessible. In this work, we generalize the discrete Differentiable Graph Module (dDGM) for latent graph learning. The original dDGM architecture used the Euclidean plane to encode latent features based on which the latent graphs were generated. By incorporating Riemannian geometry into the model and generating more complex embedding spaces, we can improve the performance of the latent graph inference system. In particular, we propose a computationally tractable approach to produce product manifolds of constant curvature model spaces that can encode latent features of varying structure. The latent representations mapped onto the inferred product manifold are used to compute richer similarity measures that are leveraged by the latent graph learning model to obtain optimized latent graphs. Moreover, the curvature of the product manifold is learned during training alongside the rest of the network parameters and based on the downstream task, rather than it being a static embedding space. Our novel approach is tested on a wide range of datasets, and outperforms the original dDGM model.
    Controllable speech synthesis by learning discrete phoneme-level prosodic representations. (arXiv:2211.16307v1 [cs.SD])
    In this paper, we present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels. We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset. These features are fed as an input sequence of prosodic labels to a prosody encoder module which augments an autoregressive attention-based text-to-speech model. We utilize various methods in order to improve prosodic control range and coverage, such as augmentation, F0 normalization, balanced clustering for duration and speaker-independent clustering. The final model enables fine-grained phoneme-level prosody control for all speakers contained in the training set, while maintaining the speaker identity. Instead of relying on reference utterances for inference, we introduce a prior prosody encoder which learns the style of each speaker and enables speech synthesis without the requirement of reference audio. We also fine-tune the multispeaker model to unseen speakers with limited amounts of data, as a realistic application scenario and show that the prosody control capabilities are maintained, verifying that the speaker-independent prosodic clustering is effective. Experimental results show that the model has high output speech quality and that the proposed method allows efficient prosody control within each speaker's range despite the variability that a multispeaker setting introduces.
    On Large-Scale Multiple Testing Over Networks: An Asymptotic Approach. (arXiv:2211.16059v1 [stat.ME])
    This work concerns developing communication- and computation-efficient methods for large-scale multiple testing over networks, which is of interest to many practical applications. We take an asymptotic approach and propose two methods, proportion-matching and greedy aggregation, tailored to distributed settings. The proportion-matching method achieves the global BH performance yet only requires a one-shot communication of the (estimated) proportion of true null hypotheses as well as the number of p-values at each node. By focusing on the asymptotic optimal power, we go beyond the BH procedure by providing an explicit characterization of the asymptotic optimal solution. This leads to the greedy aggregation method that effectively approximate the optimal rejection regions at each node, while computation-efficiency comes from the greedy-type approach naturally. Extensive numerical results over a variety of challenging settings are provided to support our theoretical findings.
    Prioritized Experience-based Reinforcement Learning with Human Guidance for Autonomous Driving. (arXiv:2109.12516v2 [cs.LG] UPDATED)
    Reinforcement learning (RL) requires skillful definition and remarkable computational efforts to solve optimization and control problems, which could impair its prospect. Introducing human guidance into reinforcement learning is a promising way to improve learning performance. In this paper, a comprehensive human guidance-based reinforcement learning framework is established. A novel prioritized experience replay mechanism that adapts to human guidance in the reinforcement learning process is proposed to boost the efficiency and performance of the reinforcement learning algorithm. To relieve the heavy workload on human participants, a behavior model is established based on an incremental online learning method to mimic human actions. We design two challenging autonomous driving tasks for evaluating the proposed algorithm. Experiments are conducted to access the training and testing performance and learning mechanism of the proposed algorithm. Comparative results against the state-of-the-art methods suggest the advantages of our algorithm in terms of learning efficiency, performance, and robustness.
    The Curse of Zero Task Diversity: On the Failure of Transfer Learning to Outperform MAML and their Empirical Equivalence. (arXiv:2112.13121v4 [cs.LG] UPDATED)
    Recently, it has been observed that a transfer learning solution might be all we need to solve many few-shot learning benchmarks -- thus raising important questions about when and how meta-learning algorithms should be deployed. In this paper, we seek to clarify these questions by proposing a novel metric -- the diversity coefficient -- to measure the diversity of tasks in a few-shot learning benchmark. We hypothesize that the diversity coefficient of the few-shot learning benchmark is predictive of whether meta-learning solutions will succeed or not. Using the diversity coefficient, we show that the MiniImagenet benchmark has zero diversity. This novel insight contextualizes claims that transfer learning solutions are better than meta-learned solutions. Specifically, we empirically find that a diversity coefficient of zero correlates with a high similarity between transfer learning and Model-Agnostic Meta-Learning (MAML) learned solutions in terms of meta-accuracy (at meta-test time). Therefore, we conjecture meta-learned solutions have the same meta-test performance as transfer learning when the diversity coefficient is zero. Our work provides the first test of whether diversity correlates with meta-learning success.
    Preservation of the Global Knowledge by Not-True Distillation in Federated Learning. (arXiv:2106.03097v5 [cs.LG] UPDATED)
    In federated learning, a strong global model is collaboratively learned by aggregating clients' locally trained models. Although this precludes the need to access clients' data directly, the global model's convergence often suffers from data heterogeneity. This study starts from an analogy to continual learning and suggests that forgetting could be the bottleneck of federated learning. We observe that the global model forgets the knowledge from previous rounds, and the local training induces forgetting the knowledge outside of the local distribution. Based on our findings, we hypothesize that tackling down forgetting will relieve the data heterogeneity problem. To this end, we propose a novel and effective algorithm, Federated Not-True Distillation (FedNTD), which preserves the global perspective on locally available data only for the not-true classes. In the experiments, FedNTD shows state-of-the-art performance on various setups without compromising data privacy or incurring additional communication costs.
    Joint Deep Reversible Regression Model and Physics-Informed Unsupervised Learning for Temperature Field Reconstruction. (arXiv:2106.11929v5 [cs.LG] UPDATED)
    Temperature monitoring during the life time of heat source components in engineering systems becomes essential to guarantee the normal work and the working life of these components. However, prior methods, which mainly use the interpolate estimation to reconstruct the temperature field from limited monitoring points, require large amounts of temperature tensors for an accurate estimation. This may decrease the availability and reliability of the system and sharply increase the monitoring cost. To solve this problem, this work develops a novel physics-informed deep reversible regression models for temperature field reconstruction of heat-source systems (TFR-HSS), which can better reconstruct the temperature field with limited monitoring points unsupervisedly. First, we define the TFR-HSS task mathematically, and numerically model the task, and hence transform the task as an image-to-image regression problem. Then this work develops the deep reversible regression model which can better learn the physical information, especially over the boundary. Finally, considering the physical characteristics of heat conduction as well as the boundary conditions, this work proposes the physics-informed reconstruction loss including four training losses and jointly learns the deep surrogate model with these losses unsupervisedly. Experimental studies have conducted over typical two-dimensional heat-source systems to demonstrate the effectiveness of the proposed method.
    Parametric machines: a fresh approach to architecture search. (arXiv:2007.02777v3 [cs.LG] UPDATED)
    Using tools from topology and functional analysis, we provide a framework where artificial neural networks, and their architectures, can be formally described. We define the notion of machine in a general topological context and show how simple machines can be combined into more complex ones. We explore finite- and infinite-depth machines, which generalize neural networks and neural ordinary differential equations. Borrowing ideas from functional analysis and kernel methods, we build complete, normed, infinite-dimensional spaces of machines, and we discuss how to find optimal architectures and parameters -- within those spaces -- to solve a given computational problem. In our numerical experiments, these kernel-inspired networks can outperform classical neural networks when the training dataset is small.
    FastCorrect 2: Fast Error Correction on Multiple Candidates for Automatic Speech Recognition. (arXiv:2109.14420v4 [cs.CL] UPDATED)
    Error correction is widely used in automatic speech recognition (ASR) to post-process the generated sentence, and can further reduce the word error rate (WER). Although multiple candidates are generated by an ASR system through beam search, current error correction approaches can only correct one sentence at a time, failing to leverage the voting effect from multiple candidates to better detect and correct error tokens. In this work, we propose FastCorrect 2, an error correction model that takes multiple ASR candidates as input for better correction accuracy. FastCorrect 2 adopts non-autoregressive generation for fast inference, which consists of an encoder that processes multiple source sentences and a decoder that generates the target sentence in parallel from the adjusted source sentence, where the adjustment is based on the predicted duration of each source token. However, there are some issues when handling multiple source sentences. First, it is non-trivial to leverage the voting effect from multiple source sentences since they usually vary in length. Thus, we propose a novel alignment algorithm to maximize the degree of token alignment among multiple sentences in terms of token and pronunciation similarity. Second, the decoder can only take one adjusted source sentence as input, while there are multiple source sentences. Thus, we develop a candidate predictor to detect the most suitable candidate for the decoder. Experiments on our inhouse dataset and AISHELL-1 show that FastCorrect 2 can further reduce the WER over the previous correction model with single candidate by 3.2% and 2.6%, demonstrating the effectiveness of leveraging multiple candidates in ASR error correction. FastCorrect 2 achieves better performance than the cascaded re-scoring and correction pipeline and can serve as a unified post-processing module for ASR.
    Malign Overfitting: Interpolation Can Provably Preclude Invariance. (arXiv:2211.15724v1 [cs.LG])
    Learned classifiers should often possess certain invariance properties meant to encourage fairness, robustness, or out-of-distribution generalization. However, multiple recent works empirically demonstrate that common invariance-inducing regularizers are ineffective in the over-parameterized regime, in which classifiers perfectly fit (i.e. interpolate) the training data. This suggests that the phenomenon of ``benign overfitting," in which models generalize well despite interpolating, might not favorably extend to settings in which robustness or fairness are desirable. In this work we provide a theoretical justification for these observations. We prove that -- even in the simplest of settings -- any interpolating learning rule (with arbitrarily small margin) will not satisfy these invariance properties. We then propose and analyze an algorithm that -- in the same setting -- successfully learns a non-interpolating classifier that is provably invariant. We validate our theoretical observations on simulated data and the Waterbirds dataset.
    Quantifying probabilistic robustness of tree-based classifiers against natural distortions. (arXiv:2208.10354v3 [cs.LG] UPDATED)
    The concept of trustworthy AI has gained widespread attention lately. One of the aspects relevant to trustworthy AI is robustness of ML models. In this study, we show how to probabilistically quantify robustness against naturally occurring distortions of input data for tree-based classifiers under the assumption that the natural distortions can be described by multivariate probability distributions that can be transformed to multivariate normal distributions. The idea is to extract the decision rules of a trained tree-based classifier, separate the feature space into non-overlapping regions and determine the probability that a data sample with distortion returns its predicted label. The approach is based on the recently introduced measure of real-world-robustness, which works for all black box classifiers, but is only an approximation and only works if the input dimension is not too high, whereas our proposed method gives an exact measure.
    Deep Learning-Driven Edge Video Analytics: A Survey. (arXiv:2211.15751v1 [cs.NI])
    Video, as a key driver in the global explosion of digital information, can create tremendous benefits for human society. Governments and enterprises are deploying innumerable cameras for a variety of applications, e.g., law enforcement, emergency management, traffic control, and security surveillance, all facilitated by video analytics (VA). This trend is spurred by the rapid advancement of deep learning (DL), which enables more precise models for object classification, detection, and tracking. Meanwhile, with the proliferation of Internet-connected devices, massive amounts of data are generated daily, overwhelming the cloud. Edge computing, an emerging paradigm that moves workloads and services from the network core to the network edge, has been widely recognized as a promising solution. The resulting new intersection, edge video analytics (EVA), begins to attract widespread attention. Nevertheless, only a few loosely-related surveys exist on this topic. A dedicated venue for collecting and summarizing the latest advances of EVA is highly desired by the community. Besides, the basic concepts of EVA (e.g., definition, architectures, etc.) are ambiguous and neglected by these surveys due to the rapid development of this domain. A thorough clarification is needed to facilitate a consensus on these concepts. To fill in these gaps, we conduct a comprehensive survey of the recent efforts on EVA. In this paper, we first review the fundamentals of edge computing, followed by an overview of VA. The EVA system and its enabling techniques are discussed next. In addition, we introduce prevalent frameworks and datasets to aid future researchers in the development of EVA systems. Finally, we discuss existing challenges and foresee future research directions. We believe this survey will help readers comprehend the relationship between VA and edge computing, and spark new ideas on EVA.
    Multi-Server Over-the-Air Federated Learning. (arXiv:2211.16162v1 [cs.IT])
    In this work, we propose a communication-efficient two-layer federated learning algorithm for distributed setups including a core server and multiple edge servers with clusters of devices. Assuming different learning tasks, clusters with a same task collaborate. To implement the algorithm over wireless links, we propose a scalable clustered over-the-air aggregation scheme for the uplink with a bandwidth-limited broadcast scheme for the downlink that requires only two single resource blocks for each algorithm iteration, independent of the number of edge servers and devices. This setup is faced with interference of devices in the uplink and interference of edge servers in the downlink that are to be modeled rigorously. We first develop a spatial model for the setup by modeling devices as a Poisson cluster process over the edge servers and quantify uplink and downlink error terms due to the interference. Accordingly, we present a comprehensive mathematical approach to derive the convergence bound for the proposed algorithm including any number of collaborating clusters in the setup and provide important special cases and design remarks. Finally, we show that despite the interference in the proposed uplink and downlink schemes, the proposed algorithm achieves high learning accuracy for a variety of parameters.
    Will My Robot Achieve My Goals? Predicting the Probability that an MDP Policy Reaches a User-Specified Behavior Target. (arXiv:2211.16462v1 [cs.LG])
    As an autonomous system performs a task, it should maintain a calibrated estimate of the probability that it will achieve the user's goal. If that probability falls below some desired level, it should alert the user so that appropriate interventions can be made. This paper considers settings where the user's goal is specified as a target interval for a real-valued performance summary, such as the cumulative reward, measured at a fixed horizon $H$. At each time $t \in \{0, \ldots, H-1\}$, our method produces a calibrated estimate of the probability that the final cumulative reward will fall within a user-specified target interval $[y^-,y^+].$ Using this estimate, the autonomous system can raise an alarm if the probability drops below a specified threshold. We compute the probability estimates by inverting conformal prediction. Our starting point is the Conformalized Quantile Regression (CQR) method of Romano et al., which applies split-conformal prediction to the results of quantile regression. CQR is not invertible, but by using the conditional cumulative distribution function (CDF) as the non-conformity measure, we show how to obtain an invertible modification that we call \textbf{P}robability-space \textbf{C}onformalized \textbf{Q}uantile \textbf{R}egression (PCQR). Like CQR, PCQR produces well-calibrated conditional prediction intervals with finite-sample marginal guarantees. By inverting PCQR, we obtain marginal guarantees for the probability that the cumulative reward of an autonomous system will fall within an arbitrary user-specified target intervals. Experiments on two domains confirm that these probabilities are well-calibrated.
    Tensor Kernel Recovery for Spatio-Temporal Hawkes Processes. (arXiv:2011.12151v3 [stat.ML] UPDATED)
    We estimate the general influence functions for spatio-temporal Hawkes processes using a tensor recovery approach by formulating the location dependent influence function that captures the influence of historical events as a tensor kernel. We assume a low-rank structure for the tensor kernel and cast the estimation problem as a convex optimization problem using the Fourier transformed nuclear norm (TNN). We provide theoretical performance guarantees for our approach and present an algorithm to solve the optimization problem. Moreover, we demonstrate the efficiency of our estimation with numerical simulations.
    Sketch-and-solve approaches to k-means clustering by semidefinite programming. (arXiv:2211.15744v1 [cs.LG])
    We introduce a sketch-and-solve approach to speed up the Peng-Wei semidefinite relaxation of k-means clustering. When the data is appropriately separated we identify the k-means optimal clustering. Otherwise, our approach provides a high-confidence lower bound on the optimal k-means value. This lower bound is data-driven; it does not make any assumption on the data nor how it is generated. We provide code and an extensive set of numerical experiments where we use this approach to certify approximate optimality of clustering solutions obtained by k-means++.
    UQ-ARMED: Uncertainty quantification of adversarially-regularized mixed effects deep learning for clustered non-iid data. (arXiv:2211.15888v1 [stat.ML])
    This work demonstrates the ability to produce readily interpretable statistical metrics for model fit, fixed effects covariance coefficients, and prediction confidence. Importantly, this work compares 4 suitable and commonly applied epistemic UQ approaches, BNN, SWAG, MC dropout, and ensemble approaches in their ability to calculate these statistical metrics for the ARMED MEDL models. In our experiment for AD prognosis, not only do the UQ methods provide these benefits, but several UQ methods maintain the high performance of the original ARMED method, some even provide a modest (but not statistically significant) performance improvement. The ensemble models, especially the ensemble method with a 90% subsampling, performed well across all metrics we tested with (1) high performance that was comparable to the non-UQ ARMED model, (2) properly deweights the confounds probes and assigns them statistically insignificant p-values, (3) attains relatively high calibration of the output prediction confidence. Based on the results, the ensemble approaches, especially with a subsampling of 90%, provided the best all-round performance for prediction and uncertainty estimation, and achieved our goals to provide statistical significance for model fit, statistical significance covariate coefficients, and confidence in prediction, while maintaining the baseline performance of MEDL using ARMED
    Outfit Generation and Recommendation -- An Experimental Study. (arXiv:2211.16353v1 [cs.IR])
    Over the past years, fashion-related challenges have gained a lot of attention in the research community. Outfit generation and recommendation, i.e., the composition of a set of items of different types (e.g., tops, bottom, shoes, accessories) that go well together, are among the most challenging ones. That is because items have to be both compatible amongst each other and also personalized to match the taste of the customer. Recently there has been a plethora of work targeted at tackling these problems by adopting various techniques and algorithms from the machine learning literature. However, to date, there is no extensive comparison of the performance of the different algorithms for outfit generation and recommendation. In this paper, we close this gap by providing a broad evaluation and comparison of various algorithms, including both personalized and non-personalized approaches, using online, real-world user data from one of Europe's largest fashion stores. We present the adaptations we made to some of those models to make them suitable for personalized outfit generation. Moreover, we provide insights for models that have not yet been evaluated on this task, specifically, GPT, BERT and Seq-to-Seq LSTM.
    AdsorbML: Accelerating Adsorption Energy Calculations with Machine Learning. (arXiv:2211.16486v1 [cond-mat.mtrl-sci])
    Computational catalysis is playing an increasingly significant role in the design of catalysts across a wide range of applications. A common task for many computational methods is the need to accurately compute the minimum binding energy - the adsorption energy - for an adsorbate and a catalyst surface of interest. Traditionally, the identification of low energy adsorbate-surface configurations relies on heuristic methods and researcher intuition. As the desire to perform high-throughput screening increases, it becomes challenging to use heuristics and intuition alone. In this paper, we demonstrate machine learning potentials can be leveraged to identify low energy adsorbate-surface configurations more accurately and efficiently. Our algorithm provides a spectrum of trade-offs between accuracy and efficiency, with one balanced option finding the lowest energy configuration, within a 0.1 eV threshold, 86.63% of the time, while achieving a 1387x speedup in computation. To standardize benchmarking, we introduce the Open Catalyst Dense dataset containing nearly 1,000 diverse surfaces and 87,045 unique configurations.
    Understanding the Impact of Adversarial Robustness on Accuracy Disparity. (arXiv:2211.15762v1 [cs.LG])
    While it has long been empirically observed that adversarial robustness may be at odds with standard accuracy and may have further disparate impacts on different classes, it remains an open question to what extent such observations hold and how the class imbalance plays a role within. In this paper, we attempt to understand this question of accuracy disparity by taking a closer look at linear classifiers under a Gaussian mixture model. We decompose the impact of adversarial robustness into two parts: an inherent effect that will degrade the standard accuracy on all classes, and the other caused by the class imbalance ratio, which will increase the accuracy disparity compared to standard training. Furthermore, we also extend our model to the general family of stable distributions. We demonstrate that while the constraint of adversarial robustness consistently degrades the standard accuracy in the balanced class setting, the class imbalance ratio plays a fundamentally different role in accuracy disparity compared to the Gaussian case, due to the heavy tail of the stable distribution. We additionally perform experiments on both synthetic and real-world datasets. The empirical results not only corroborate our theoretical findings, but also suggest that the implications may extend to nonlinear models over real-world datasets.
    A Contextual Bandit Approach for Learning to Plan in Environments with Probabilistic Goal Configurations. (arXiv:2211.16309v1 [cs.RO])
    Object-goal navigation (Object-nav) entails searching, recognizing and navigating to a target object. Object-nav has been extensively studied by the Embodied-AI community, but most solutions are often restricted to considering static objects (e.g., television, fridge, etc.). We propose a modular framework for object-nav that is able to efficiently search indoor environments for not just static objects but also movable objects (e.g. fruits, glasses, phones, etc.) that frequently change their positions due to human intervention. Our contextual-bandit agent efficiently explores the environment by showing optimism in the face of uncertainty and learns a model of the likelihood of spotting different objects from each navigable location. The likelihoods are used as rewards in a weighted minimum latency solver to deduce a trajectory for the robot. We evaluate our algorithms in two simulated environments and a real-world setting, to demonstrate high sample efficiency and reliability.
    Decentralized Learning with Multi-Headed Distillation. (arXiv:2211.15774v1 [cs.LG])
    Decentralized learning with private data is a central problem in machine learning. We propose a novel distillation-based decentralized learning technique that allows multiple agents with private non-iid data to learn from each other, without having to share their data, weights or weight updates. Our approach is communication efficient, utilizes an unlabeled public dataset and uses multiple auxiliary heads for each client, greatly improving training efficiency in the case of heterogeneous data. This approach allows individual models to preserve and enhance performance on their private tasks while also dramatically improving their performance on the global aggregated data distribution. We study the effects of data and model architecture heterogeneity and the impact of the underlying communication graph topology on learning efficiency and show that our agents can significantly improve their performance compared to learning in isolation.
    Training Time Adversarial Attack Aiming the Vulnerability of Continual Learning. (arXiv:2211.15875v1 [cs.LG])
    Generally, regularization-based continual learning models limit access to the previous task data to imitate the real-world setting which has memory and privacy issues. However, this introduces a problem in these models by not being able to track the performance on each task. In other words, current continual learning methods are vulnerable to attacks done on the previous task. We demonstrate the vulnerability of regularization-based continual learning methods by presenting simple task-specific training time adversarial attack that can be used in the learning process of a new task. Training data generated by the proposed attack causes performance degradation on a specific task targeted by the attacker. Experiment results justify the vulnerability proposed in this paper and demonstrate the importance of developing continual learning models that are robust to adversarial attack.
    What learning algorithm is in-context learning? Investigations with linear models. (arXiv:2211.15661v2 [cs.LG] UPDATED)
    Neural sequence models, especially transformers, exhibit a remarkable capacity for in-context learning. They can construct new predictors from sequences of labeled examples $(x, f(x))$ presented in the input without further parameter updates. We investigate the hypothesis that transformer-based in-context learners implement standard learning algorithms implicitly, by encoding smaller models in their activations, and updating these implicit models as new examples appear in the context. Using linear regression as a prototypical problem, we offer three sources of evidence for this hypothesis. First, we prove by construction that transformers can implement learning algorithms for linear models based on gradient descent and closed-form ridge regression. Second, we show that trained in-context learners closely match the predictors computed by gradient descent, ridge regression, and exact least-squares regression, transitioning between different predictors as transformer depth and dataset noise vary, and converging to Bayesian estimators for large widths and depths. Third, we present preliminary evidence that in-context learners share algorithmic features with these predictors: learners' late layers non-linearly encode weight vectors and moment matrices. These results suggest that in-context learning is understandable in algorithmic terms, and that (at least in the linear case) learners may rediscover standard estimation algorithms. Code and reference implementations are released at https://github.com/ekinakyurek/google-research/blob/master/incontext.
    Multi-agent reinforcement learning for wall modeling in LES of flow over periodic hills. (arXiv:2211.16427v1 [physics.flu-dyn])
    We develop a wall model for large-eddy simulation (LES) that takes into account various pressure-gradient effects using multi-agent reinforcement learning (MARL). The model is trained using low-Reynolds-number flow over periodic hills with agents distributed on the wall along the computational grid points. The model utilizes a wall eddy-viscosity formulation as the boundary condition, which is shown to provide better predictions of the mean velocity field, rather than the typical wall-shear stress formulation. Each agent receives states based on local instantaneous flow quantities at an off-wall location, computes a reward based on the estimated wall-shear stress, and provides an action to update the wall eddy viscosity at each time step. The trained wall model is validated in wall-modeled LES (WMLES) of flow over periodic hills at higher Reynolds numbers, and the results show the effectiveness of the model on flow with pressure gradients. The analysis of the trained model indicates that the model is capable of distinguishing between the various pressure gradient regimes present in the flow.
    Provably Efficient Model-free RL in Leader-Follower MDP with Linear Function Approximation. (arXiv:2211.15792v1 [cs.LG])
    We consider a multi-agent episodic MDP setup where an agent (leader) takes action at each step of the episode followed by another agent (follower). The state evolution and rewards depend on the joint action pair of the leader and the follower. Such type of interactions can find applications in many domains such as smart grids, mechanism design, security, and policymaking. We are interested in how to learn policies for both the players with provable performance guarantee under a bandit feedback setting. We focus on a setup where both the leader and followers are {\em non-myopic}, i.e., they both seek to maximize their rewards over the entire episode and consider a linear MDP which can model continuous state-space which is very common in many RL applications. We propose a {\em model-free} RL algorithm and show that $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret bounds can be achieved for both the leader and the follower, where $d$ is the dimension of the feature mapping, $H$ is the length of the episode, and $T$ is the total number of steps under the bandit feedback information setup. Thus, our result holds even when the number of states becomes infinite. The algorithm relies on {\em novel} adaptation of the LSVI-UCB algorithm. Specifically, we replace the standard greedy policy (as the best response) with the soft-max policy for both the leader and the follower. This turns out to be key in establishing uniform concentration bound for the value functions. To the best of our knowledge, this is the first sub-linear regret bound guarantee for the Markov games with non-myopic followers with function approximation.
    Optimisation of a global climate model ensemble for prediction of extreme heat days. (arXiv:2211.16367v1 [physics.ao-ph])
    Adaptation-relevant predictions of climate change are often derived by combining climate models in a multi-model ensemble. Model evaluation methods used in performance-based ensemble weighting schemes have limitations in the context of high-impact extreme events. We introduce a locally time-invariant model evaluation method with focus on assessing the simulation of extremes. We explore the behaviour of the proposed method in predicting extreme heat days in Nairobi.
    Interpreting Primal-Dual Algorithms for Constrained MARL. (arXiv:2211.16069v1 [eess.SY])
    Constrained multiagent reinforcement learning (C-MARL) is gaining importance as MARL algorithms find new applications in real-world systems ranging from energy systems to drone swarms. Most C-MARL algorithms use a primal-dual approach to enforce constraints through a penalty function added to the reward. In this paper, we study the structural effects of the primal-dual approach on the constraints and value function. First, we show that using the constraint evaluation as the penalty leads to a weak notion of safety, but by making simple modifications to the penalty function, we can enforce meaningful probabilistic safety constraints. Second, we exploit the structural effects of primal-dual methods on value functions, leading to improved value estimates. Simulations in a simple constrained multiagent environment show that our reinterpretation of the primal-dual method in terms of probabilistic constraints is meaningful, and that our proposed value estimation procedure improves convergence to a safe joint policy.
    Learning Visual Planning Models from Partially Observed Images. (arXiv:2211.15666v1 [cs.LG])
    There has been increasing attention on planning model learning in classical planning. Most existing approaches, however, focus on learning planning models from structured data in symbolic representations. It is often difficult to obtain such structured data in real-world scenarios. Although a number of approaches have been developed for learning planning models from fully observed unstructured data (e.g., images), in many scenarios raw observations are often incomplete. In this paper, we provide a novel framework, \aType{Recplan}, for learning a transition model from partially observed raw image traces. More specifically, by considering the preceding and subsequent images in a trace, we learn the latent state representations of raw observations and then build a transition model based on such representations. Additionally, we propose a neural-network-based approach to learn a heuristic model that estimates the distance toward a given goal observation. Based on the learned transition model and heuristic model, we implement a classical planner for images. We exhibit empirically that our approach is more effective than a state-of-the-art approach of learning visual planning models in the environment with incomplete observations.
    Encoder-Decoder Model for Suffix Prediction in Predictive Monitoring. (arXiv:2211.16106v1 [cs.LG])
    Predictive monitoring is a subfield of process mining that aims to predict how a running case will unfold in the future. One of its main challenges is forecasting the sequence of activities that will occur from a given point in time -- suffix prediction -- . Most approaches to the suffix prediction problem learn to predict the suffix by learning how to predict the next activity only, not learning from the whole suffix during the training phase. This paper proposes a novel architecture based on an encoder-decoder model with an attention mechanism that decouples the representation learning of the prefixes from the inference phase, predicting only the activities of the suffix. During the inference phase, this architecture is extended with a heuristic search algorithm that improves the selection of the activity for each index of the suffix. Our approach has been tested using 12 public event logs against 6 different state-of-the-art proposals, showing that it significantly outperforms these proposals.
    Robustness Disparities in Face Detection. (arXiv:2211.15937v1 [cs.CY])
    Facial analysis systems have been deployed by large companies and critiqued by scholars and activists for the past decade. Many existing algorithmic audits examine the performance of these systems on later stage elements of facial analysis systems like facial recognition and age, emotion, or perceived gender prediction; however, a core component to these systems has been vastly understudied from a fairness perspective: face detection, sometimes called face localization. Since face detection is a pre-requisite step in facial analysis systems, the bias we observe in face detection will flow downstream to the other components like facial recognition and emotion prediction. Additionally, no prior work has focused on the robustness of these systems under various perturbations and corruptions, which leaves open the question of how various people are impacted by these phenomena. We present the first of its kind detailed benchmark of face detection systems, specifically examining the robustness to noise of commercial and academic models. We use both standard and recently released academic facial datasets to quantitatively analyze trends in face detection robustness. Across all the datasets and systems, we generally find that photos of individuals who are $\textit{masculine presenting}$, $\textit{older}$, of $\textit{darker skin type}$, or have $\textit{dim lighting}$ are more susceptible to errors than their counterparts in other identities.
    Procedural Image Programs for Representation Learning. (arXiv:2211.16412v1 [cs.CV])
    Learning image representations using synthetic data allows training neural networks without some of the concerns associated with real images, such as privacy and bias. Existing work focuses on a handful of curated generative processes which require expert knowledge to design, making it hard to scale up. To overcome this, we propose training with a large dataset of twenty-one thousand programs, each one generating a diverse set of synthetic images. These programs are short code snippets, which are easy to modify and fast to execute using OpenGL. The proposed dataset can be used for both supervised and unsupervised representation learning, and reduces the gap between pre-training with real and procedurally generated images by 38%.
    Interpretations Cannot Be Trusted: Stealthy and Effective Adversarial Perturbations against Interpretable Deep Learning. (arXiv:2211.15926v1 [cs.CR])
    Deep learning methods have gained increased attention in various applications due to their outstanding performance. For exploring how this high performance relates to the proper use of data artifacts and the accurate problem formulation of a given task, interpretation models have become a crucial component in developing deep learning-based systems. Interpretation models enable the understanding of the inner workings of deep learning models and offer a sense of security in detecting the misuse of artifacts in the input data. Similar to prediction models, interpretation models are also susceptible to adversarial inputs. This work introduces two attacks, AdvEdge and AdvEdge$^{+}$, that deceive both the target deep learning model and the coupled interpretation model. We assess the effectiveness of proposed attacks against two deep learning model architectures coupled with four interpretation models that represent different categories of interpretation models. Our experiments include the attack implementation using various attack frameworks. We also explore the potential countermeasures against such attacks. Our analysis shows the effectiveness of our attacks in terms of deceiving the deep learning models and their interpreters, and highlights insights to improve and circumvent the attacks.
    Novelty Detection for Election Fraud: A Case Study with Agent-Based Simulation Data. (arXiv:2211.16023v1 [cs.LG])
    In this paper, we propose a robust election simulation model and independently developed election anomaly detection algorithm that demonstrates the simulation's utility. The simulation generates artificial elections with similar properties and trends as elections from the real world, while giving users control and knowledge over all the important components of the elections. We generate a clean election results dataset without fraud as well as datasets with varying degrees of fraud. We then measure how well the algorithm is able to successfully detect the level of fraud present. The algorithm determines how similar actual election results are as compared to the predicted results from polling and a regression model of other regions that have similar demographics. We use k-means to partition electoral regions into clusters such that demographic homogeneity is maximized among clusters. We then use a novelty detection algorithm implemented as a one-class Support Vector Machine where the clean data is provided in the form of polling predictions and regression predictions. The regression predictions are built from the actual data in such a way that the data supervises itself. We show both the effectiveness of the simulation technique and the machine learning model in its success in identifying fraudulent regions.
    PAC-Bayes Bounds for Bandit Problems: A Survey and Experimental Comparison. (arXiv:2211.16110v1 [cs.LG])
    PAC-Bayes has recently re-emerged as an effective theory with which one can derive principled learning algorithms with tight performance guarantees. However, applications of PAC-Bayes to bandit problems are relatively rare, which is a great misfortune. Many decision-making problems in healthcare, finance and natural sciences can be modelled as bandit problems. In many of these applications, principled algorithms with strong performance guarantees would be very much appreciated. This survey provides an overview of PAC-Bayes performance bounds for bandit problems and an experimental comparison of these bounds. Our experimental comparison has revealed that available PAC-Bayes upper bounds on the cumulative regret are loose, whereas available PAC-Bayes lower bounds on the expected reward can be surprisingly tight. We found that an offline contextual bandit algorithm that learns a policy by optimising a PAC-Bayes bound was able to learn randomised neural network polices with competitive expected reward and non-vacuous performance guarantees.
    Surgical Scheduling via Optimization and Machine Learning with Long-Tailed Data. (arXiv:2202.06383v2 [cs.LG] UPDATED)
    Using data from cardiovascular surgery patients with long and highly variable post-surgical lengths of stay (LOS), we develop a modeling framework to reduce recovery unit congestion. We estimate the LOS and its probability distribution using machine learning models, schedule procedures on a rolling basis using a variety of optimization models, and estimate performance with simulation. The machine learning models achieved only modest LOS prediction accuracy, despite access to a very rich set of patient characteristics. Compared to the current paper-based system used in the hospital, most optimization models failed to reduce congestion without increasing wait times for surgery. A conservative stochastic optimization with sufficient sampling to capture the long tail of the LOS distribution outperformed the current manual process and other stochastic and robust optimization approaches. These results highlight the perils of using oversimplified distributional models of LOS for scheduling procedures and the importance of using optimization methods well-suited to dealing with long-tailed behavior.
    Best Subset Selection in Reduced Rank Regression. (arXiv:2211.15889v1 [stat.ME])
    Sparse reduced rank regression is an essential statistical learning method. In the contemporary literature, estimation is typically formulated as a nonconvex optimization that often yields to a local optimum in numerical computation. Yet, their theoretical analysis is always centered on the global optimum, resulting in a discrepancy between the statistical guarantee and the numerical computation. In this research, we offer a new algorithm to address the problem and establish an almost optimal rate for the algorithmic solution. We also demonstrate that the algorithm achieves the estimation with a polynomial number of iterations. In addition, we present a generalized information criterion to simultaneously ensure the consistency of support set recovery and rank estimation. Under the proposed criterion, we show that our algorithm can achieve the oracle reduced rank estimation with a significant probability. The numerical studies and an application in the ovarian cancer genetic data demonstrate the effectiveness and scalability of our approach.
    Design Space Exploration and Explanation via Conditional Variational Autoencoders in Meta-model-based Conceptual Design of Pedestrian Bridges. (arXiv:2211.16406v1 [cs.LG])
    For conceptual design, engineers rely on conventional iterative (often manual) techniques. Emerging parametric models facilitate design space exploration based on quantifiable performance metrics, yet remain time-consuming and computationally expensive. Pure optimisation methods, however, ignore qualitative aspects (e.g. aesthetics or construction methods). This paper provides a performance-driven design exploration framework to augment the human designer through a Conditional Variational Autoencoder (CVAE), which serves as forward performance predictor for given design features as well as an inverse design feature predictor conditioned on a set of performance requests. The CVAE is trained on 18'000 synthetically generated instances of a pedestrian bridge in Switzerland. Sensitivity analysis is employed for explainability and informing designers about (i) relations of the model between features and/or performances and (ii) structural improvements under user-defined objectives. A case study proved our framework's potential to serve as a future co-pilot for conceptual design studies of pedestrian bridges and beyond.
    Differentiable User Models. (arXiv:2211.16277v1 [cs.LG])
    Probabilistic user modeling is essential for building collaborative AI systems within probabilistic frameworks. However, modern advanced user models, often designed as cognitive behavior simulators, are computationally prohibitive for interactive use in cooperative AI assistants. In this extended abstract, we address this problem by introducing widely-applicable differentiable surrogates for bypassing this computational bottleneck; the surrogates enable using modern behavioral models with online computational cost which is independent of their original computational cost. We show experimentally that modeling capabilities comparable to likelihood-free inference methods are achievable, with over eight orders of magnitude reduction in computational time. Finally, we demonstrate how AI-assistants can computationally feasibly use cognitive models in a previously studied menu-search task.
    Closing the gap between SVRG and TD-SVRG with Gradient Splitting. (arXiv:2211.16237v1 [cs.LG])
    Temporal difference (TD) learning is a simple algorithm for policy evaluation in reinforcement learning. The performance of TD learning is affected by high variance and it can be naturally enhanced with variance reduction techniques, such as the Stochastic Variance Reduced Gradient (SVRG) method. Recently, multiple works have sought to fuse TD learning with SVRG to obtain a policy evaluation method with a geometric rate of convergence. However, the resulting convergence rate is significantly weaker than what is achieved by SVRG in the setting of convex optimization. In this work we utilize a recent interpretation of TD-learning as the splitting of the gradient of an appropriately chosen function, thus simplifying the algorithm and fusing TD with SVRG. We prove a geometric convergence bound with predetermined learning rate of 1/8, that is identical to the convergence bound available for SVRG in the convex setting.
    MegaBlocks: Efficient Sparse Training with Mixture-of-Experts. (arXiv:2211.15841v1 [cs.LG])
    We present MegaBlocks, a system for efficient Mixture-of-Experts (MoE) training on GPUs. Our system is motivated by the limitations of current frameworks, which restrict the dynamic routing in MoE layers to satisfy the constraints of existing software and hardware. These formulations force a tradeoff between model quality and hardware efficiency, as users must choose between dropping tokens from the computation or wasting computation and memory on padding. To address these limitations, we reformulate MoE computation in terms of block-sparse operations and develop new block-sparse GPU kernels that efficiently handle the dynamism present in MoEs. Our approach never drops tokens and maps efficiently to modern hardware, enabling end-to-end training speedups of up to 40% over MoEs trained with the state-of-the-art Tutel library and 2.4x over DNNs trained with the highly-optimized Megatron-LM framework.
    Lie Group Forced Variational Integrator Networks for Learning and Control of Robot Systems. (arXiv:2211.16006v1 [cs.RO])
    Incorporating prior knowledge of physics laws and structural properties of dynamical systems into the design of deep learning architectures has proven to be a powerful technique for improving their computational efficiency and generalization capacity. Learning accurate models of robot dynamics is critical for safe and stable control. Autonomous mobile robots, including wheeled, aerial, and underwater vehicles, can be modeled as controlled Lagrangian or Hamiltonian rigid-body systems evolving on matrix Lie groups. In this paper, we introduce a new structure-preserving deep learning architecture, the Lie group Forced Variational Integrator Network (LieFVIN), capable of learning controlled Lagrangian or Hamiltonian dynamics on Lie groups, either from position-velocity or position-only data. By design, LieFVINs preserve both the Lie group structure on which the dynamics evolve and the symplectic structure underlying the Hamiltonian or Lagrangian systems of interest. The proposed architecture learns surrogate discrete-time flow maps instead of surrogate vector fields, which allows better and faster prediction without requiring the use of a numerical integrator, neural ODE, or adjoint techniques. Furthermore, the learnt discrete-time dynamics can be combined seamlessly with computationally scalable discrete-time (optimal) control strategies.
    Dimensionality-Varying Diffusion Process. (arXiv:2211.16032v1 [cs.LG])
    Diffusion models, which learn to reverse a signal destruction process to generate new data, typically require the signal at each step to have the same dimension. We argue that, considering the spatial redundancy in image signals, there is no need to maintain a high dimensionality in the evolution process, especially in the early generation phase. To this end, we make a theoretical generalization of the forward diffusion process via signal decomposition. Concretely, we manage to decompose an image into multiple orthogonal components and control the attenuation of each component when perturbing the image. That way, along with the noise strength increasing, we are able to diminish those inconsequential components and thus use a lower-dimensional signal to represent the source, barely losing information. Such a reformulation allows to vary dimensions in both training and inference of diffusion models. Extensive experiments on a range of datasets suggest that our approach substantially reduces the computational cost and achieves on-par or even better synthesis performance compared to baseline methods. We also show that our strategy facilitates high-resolution image synthesis and improves FID of diffusion model trained on FFHQ at $1024\times1024$ resolution from 52.40 to 10.46. Code and models will be made publicly available.
    Distributed Energy Management and Demand Response in Smart Grids: A Multi-Agent Deep Reinforcement Learning Framework. (arXiv:2211.15858v1 [cs.MA])
    This paper presents a multi-agent Deep Reinforcement Learning (DRL) framework for autonomous control and integration of renewable energy resources into smart power grid systems. In particular, the proposed framework jointly considers demand response (DR) and distributed energy management (DEM) for residential end-users. DR has a widely recognized potential for improving power grid stability and reliability, while at the same time reducing end-users energy bills. However, the conventional DR techniques come with several shortcomings, such as the inability to handle operational uncertainties while incurring end-user disutility, which prevents widespread adoption in real-world applications. The proposed framework addresses these shortcomings by implementing DR and DEM based on real-time pricing strategy that is achieved using deep reinforcement learning. Furthermore, this framework enables the power grid service provider to leverage distributed energy resources (i.e., PV rooftop panels and battery storage) as dispatchable assets to support the smart grid during peak hours, thus achieving management of distributed energy resources. Simulation results based on the Deep Q-Network (DQN) demonstrate significant improvements of the 24-hour accumulative profit for both prosumers and the power grid service provider, as well as major reductions in the utilization of the power grid reserve generators.
    Posterior Sampling for Continuing Environments. (arXiv:2211.15931v1 [cs.LG])
    We develop an extension of posterior sampling for reinforcement learning (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into agent designs that scale to complex environments. The approach maintains a statistically plausible model of the environment and follows a policy that maximizes expected $\gamma$-discounted return in that model. At each time, with probability $1-\gamma$, the model is replaced by a sample from the posterior distribution over environments. For a suitable schedule of $\gamma$, we establish an $\tilde{O}(\tau S \sqrt{A T})$ bound on the Bayesian regret, where $S$ is the number of environment states, $A$ is the number of actions, and $\tau$ denotes the reward averaging time, which is a bound on the duration required to accurately estimate the average reward of any policy.
    Fourier Continuation for Exact Derivative Computation in Physics-Informed Neural Operators. (arXiv:2211.15960v1 [cs.LG])
    The physics-informed neural operator (PINO) is a machine learning architecture that has shown promising empirical results for learning partial differential equations. PINO uses the Fourier neural operator (FNO) architecture to overcome the optimization challenges often faced by physics-informed neural networks. Since the convolution operator in PINO uses the Fourier series representation, its gradient can be computed exactly on the Fourier space. While Fourier series cannot represent nonperiodic functions, PINO and FNO still have the expressivity to learn nonperiodic problems with Fourier extension via padding. However, computing the Fourier extension in the physics-informed optimization requires solving an ill-conditioned system, resulting in inaccurate derivatives which prevent effective optimization. In this work, we present an architecture that leverages Fourier continuation (FC) to apply the exact gradient method to PINO for nonperiodic problems. This paper investigates three different ways that FC can be incorporated into PINO by testing their performance on a 1D blowup problem. Experiments show that FC-PINO outperforms padded PINO, improving equation loss by several orders of magnitude, and it can accurately capture the third order derivatives of nonsmooth solution functions.
    Revisiting Over-smoothing and Over-squashing using Ollivier's Ricci Curvature. (arXiv:2211.15779v1 [cs.LG])
    Graph Neural Networks (GNNs) had been demonstrated to be inherently susceptible to the problems of over-smoothing and over-squashing. These issues prohibit the ability of GNNs to model complex graph interactions by limiting their effectiveness at taking into account distant information. Our study reveals the key connection between the local graph geometry and the occurrence of both of these issues, thereby providing a unified framework for studying them at a local scale using the Ollivier's Ricci curvature. Based on our theory, a number of principled methods are proposed to alleviate the over-smoothing and over-squashing issues.
    LUMix: Improving Mixup by Better Modelling Label Uncertainty. (arXiv:2211.15846v1 [cs.CV])
    Modern deep networks can be better generalized when trained with noisy samples and regularization techniques. Mixup and CutMix have been proven to be effective for data augmentation to help avoid overfitting. Previous Mixup-based methods linearly combine images and labels to generate additional training data. However, this is problematic if the object does not occupy the whole image as we demonstrate in Figure 1. Correctly assigning the label weights is hard even for human beings and there is no clear criterion to measure it. To tackle this problem, in this paper, we propose LUMix, which models such uncertainty by adding label perturbation during training. LUMix is simple as it can be implemented in just a few lines of code and can be universally applied to any deep networks \eg CNNs and Vision Transformers, with minimal computational cost. Extensive experiments show that our LUMix can consistently boost the performance for networks with a wide range of diversity and capacity on ImageNet, \eg $+0.7\%$ for a small model DeiT-S and $+0.6\%$ for a large variant XCiT-L. We also demonstrate that LUMix can lead to better robustness when evaluated on ImageNet-O and ImageNet-A. The source code can be found \href{https://github.com/kevin-ssy/LUMix}{here}
    Personalized Reward Learning with Interaction-Grounded Learning (IGL). (arXiv:2211.15823v1 [cs.LG])
    In an era of countless content offerings, recommender systems alleviate information overload by providing users with personalized content suggestions. Due to the scarcity of explicit user feedback, modern recommender systems typically optimize for the same fixed combination of implicit feedback signals across all users. However, this approach disregards a growing body of work highlighting that (i) implicit signals can be used by users in diverse ways, signaling anything from satisfaction to active dislike, and (ii) different users communicate preferences in different ways. We propose applying the recent Interaction Grounded Learning (IGL) paradigm to address the challenge of learning representations of diverse user communication modalities. Rather than taking a fixed, human-designed reward function, IGL is able to learn personalized reward functions for different users and then optimize directly for the latent user satisfaction. We demonstrate the success of IGL with experiments using simulations as well as with real-world production traces.
    Predicting Football Match Outcomes with eXplainable Machine Learning and the Kelly Index. (arXiv:2211.15734v1 [cs.LG])
    In this work, a machine learning approach is developed for predicting the outcomes of football matches. The novelty of this research lies in the utilisation of the Kelly Index to first classify matches into categories where each one denotes the different levels of predictive difficulty. Classification models using a wide suite of algorithms were developed for each category of matches in order to determine the efficacy of the approach. In conjunction to this, a set of previously unexplored features were engineering including Elo-based variables. The dataset originated from the Premier League match data covering the 2019-2021 seasons. The findings indicate that the process of decomposing the predictive problem into sub-tasks was effective and produced competitive results with prior works, while the ensemble-based methods were the most effective. The paper also devised an investment strategy in order to evaluate its effectiveness by benchmarking against bookmaker odds. An approach was developed that minimises risk by combining the Kelly Index with the predefined confidence thresholds of the predictive models. The experiments found that the proposed strategy can return a profit when following a conservative approach that focuses primarily on easy-to-predict matches where the predictive models display a high confidence level.
    An Extreme-Adaptive Time Series Prediction Model Based on Probability-Enhanced LSTM Neural Networks. (arXiv:2211.15891v1 [cs.LG])
    Forecasting time series with extreme events has been a challenging and prevalent research topic, especially when the time series data are affected by complicated uncertain factors, such as is the case in hydrologic prediction. Diverse traditional and deep learning models have been applied to discover the nonlinear relationships and recognize the complex patterns in these types of data. However, existing methods usually ignore the negative influence of imbalanced data, or severe events, on model training. Moreover, methods are usually evaluated on a small number of generally well-behaved time series, which does not show their ability to generalize. To tackle these issues, we propose a novel probability-enhanced neural network model, called NEC+, which concurrently learns extreme and normal prediction functions and a way to choose among them via selective back propagation. We evaluate the proposed model on the difficult 3-day ahead hourly water level prediction task applied to 9 reservoirs in California. Experimental results demonstrate that the proposed model significantly outperforms state-of-the-art baselines and exhibits superior generalization ability on data with diverse distributions.
    Understanding and Enhancing Robustness of Concept-based Models. (arXiv:2211.16080v1 [cs.LG])
    Rising usage of deep neural networks to perform decision making in critical applications like medical diagnosis and financial analysis have raised concerns regarding their reliability and trustworthiness. As automated systems become more mainstream, it is important their decisions be transparent, reliable and understandable by humans for better trust and confidence. To this effect, concept-based models such as Concept Bottleneck Models (CBMs) and Self-Explaining Neural Networks (SENN) have been proposed which constrain the latent space of a model to represent high level concepts easily understood by domain experts in the field. Although concept-based models promise a good approach to both increasing explainability and reliability, it is yet to be shown if they demonstrate robustness and output consistent concepts under systematic perturbations to their inputs. To better understand performance of concept-based models on curated malicious samples, in this paper, we aim to study their robustness to adversarial perturbations, which are also known as the imperceptible changes to the input data that are crafted by an attacker to fool a well-learned concept-based model. Specifically, we first propose and analyze different malicious attacks to evaluate the security vulnerability of concept based models. Subsequently, we propose a potential general adversarial training-based defense mechanism to increase robustness of these systems to the proposed malicious attacks. Extensive experiments on one synthetic and two real-world datasets demonstrate the effectiveness of the proposed attacks and the defense approach.
    Token-Label Alignment for Vision Transformers. (arXiv:2210.06455v2 [cs.CV] UPDATED)
    Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of convolutional neural networks (CNNs). They mix two images as inputs for training and assign them with a mixed label with the same ratio. While they are shown effective for vision transformers (ViTs), we identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies. We empirically observe that the contributions of input tokens fluctuate as forward propagating, which might induce a different mixing ratio in the output tokens. The training target computed by the original data mixing strategy can thus be inaccurate, resulting in less effective training. To address this, we propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token. We reuse the computed attention at each layer for efficient token-label alignment, introducing only negligible additional training costs. Extensive experiments demonstrate that our method improves the performance of ViTs on image classification, semantic segmentation, objective detection, and transfer learning tasks. Code is available at: https://github.com/Euphoria16/TL-Align.
    A Gift from Label Smoothing: Robust Training with Adaptive Label Smoothing via Auxiliary Classifier under Label Noise. (arXiv:2206.07277v2 [cs.LG] UPDATED)
    As deep neural networks can easily overfit noisy labels, robust training in the presence of noisy labels is becoming an important challenge in modern deep learning. While existing methods address this problem in various directions, they still produce unpredictable sub-optimal results since they rely on the posterior information estimated by the feature extractor corrupted by noisy labels. Lipschitz regularization successfully alleviates this problem by training a robust feature extractor, but it requires longer training time and expensive computations. Motivated by this, we propose a simple yet effective method, called ALASCA, which efficiently provides a robust feature extractor under label noise. ALASCA integrates two key ingredients: (1) adaptive label smoothing based on our theoretical analysis that label smoothing implicitly induces Lipschitz regularization, and (2) auxiliary classifiers that enable practical application of intermediate Lipschitz regularization with negligible computations. We conduct wide-ranging experiments for ALASCA and combine our proposed method with previous noise-robust methods on several synthetic and real-world datasets. Experimental results show that our framework consistently improves the robustness of feature extractors and the performance of existing baselines with efficiency. Our code is available at https://github.com/jongwooko/ALASCA.
    SimCS: Simulation for Online Domain-Incremental Continual Segmentation. (arXiv:2211.16234v1 [cs.CV])
    Continual Learning is a step towards lifelong intelligence where models continuously learn from recently collected data without forgetting previous knowledge. Existing continual learning approaches mostly focus on image classification in the class-incremental setup with clear task boundaries and unlimited computational budget. This work explores Online Domain-Incremental Continual Segmentation~(ODICS), a real-world problem that arises in many applications, \eg, autonomous driving. In ODICS, the model is continually presented with batches of densely labeled images from different domains; computation is limited and no information about the task boundaries is available. In autonomous driving, this may correspond to the realistic scenario of training a segmentation model over time on a sequence of cities. We analyze several existing continual learning methods and show that they do not perform well in this setting despite working well in class-incremental segmentation. We propose SimCS, a parameter-free method complementary to existing ones that leverages simulated data as a continual learning regularizer. Extensive experiments show consistent improvements over different types of continual learning methods that use regularizers and even replay.
    Reusable Self-Attention-based Recommender System for Fashion. (arXiv:2211.16366v1 [cs.IR])
    A large number of empirical studies on applying self-attention models in the domain of recommender systems are based on offline evaluation and metrics computed on standardized datasets, without insights on how these models perform in real life scenarios. Moreover, many of them do not consider information such as item and customer metadata, although deep-learning recommenders live up to their full potential only when numerous features of heterogeneous types are included. Also, typically recommendation models are designed to serve well only a single use case, which increases modeling complexity and maintenance costs, and may lead to inconsistent customer experience. In this work, we present a reusable Attention-based Fashion Recommendation Algorithm (AFRA), that utilizes various interaction types with different fashion entities such as items (e.g., shirt), outfits and influencers, and their heterogeneous features. Moreover, we leverage temporal and contextual information to address both short and long-term customer preferences. We show its effectiveness on outfit recommendation use cases, in particular: 1) personalized ranked feed; 2) outfit recommendations by style; 3) similar item recommendation and 4) in-session recommendations inspired by most recent customer actions. We present both offline and online experimental results demonstrating substantial improvements in customer retention and engagement.
    Performance evaluation of deep segmentation models on Landsat-8 imagery. (arXiv:2211.14851v2 [cs.CV] UPDATED)
    Contrails, short for condensation trails, are line-shaped ice clouds produced by aircraft engine exhaust when they fly through cold and humid air. They generate a greenhouse effect by absorbing or directing back to Earth approximately 33% of emitted outgoing longwave radiation. They account for over half of the climate change resulting from aviation activities. Avoiding contrails and adjusting flight routes could be an inexpensive and effective way to reduce their impact. An accurate, automated, and reliable detection algorithm is required to develop and evaluate contrail avoidance strategies. Advancement in contrail detection has been severely limited due to several factors, primarily due to a lack of quality-labeled data. Recently, proposed a large human-labeled Landsat-8 contrails dataset. Each contrail is carefully labeled with various inputs in various scenes of Landsat-8 satellite imagery. In this work, we benchmark several popular segmentation models with combinations of different loss functions and encoder backbones. This work is the first to apply state-of-the-art segmentation techniques to detect contrails in low-orbit satellite imagery. Our work can also be used as an open benchmark for contrail segmentation and is publicly available.
    Beyond Ensemble Averages: Leveraging Climate Model Ensembles for Subseasonal Forecasting. (arXiv:2211.15856v1 [cs.LG])
    Producing high-quality forecasts of key climate variables such as temperature and precipitation on subseasonal time scales has long been a gap in operational forecasting. Recent studies have shown promising results using machine learning (ML) models to advance subseasonal forecasting (SSF), but several open questions remain. First, several past approaches use the average of an ensemble of physics-based forecasts as an input feature of these models. However, ensemble forecasts contain information that can aid prediction beyond only the ensemble mean. Second, past methods have focused on average performance, whereas forecasts of extreme events are far more important for planning and mitigation purposes. Third, climate forecasts correspond to a spatially-varying collection of forecasts, and different methods account for spatial variability in the response differently. Trade-offs between different approaches may be mitigated with model stacking. This paper describes the application of a variety of ML methods used to predict monthly average precipitation and two meter temperature using physics-based predictions (ensemble forecasts) and observational data such as relative humidity, pressure at sea level, or geopotential height, two weeks in advance for the whole continental United States. Regression, quantile regression, and tercile classification tasks using linear models, random forests, convolutional neural networks, and stacked models are considered. The proposed models outperform common baselines such as historical averages (or quantiles) and ensemble averages (or quantiles). This paper further includes an investigation of feature importance, trade-offs between using the full ensemble or only the ensemble average, and different modes of accounting for spatial variability.
    Kernel Autocovariance Operators of Stationary Processes: Estimation and Convergence. (arXiv:2004.00891v2 [math.PR] UPDATED)
    We consider autocovariance operators of a stationary stochastic process on a Polish space that is embedded into a reproducing kernel Hilbert space. We investigate how empirical estimates of these operators converge along realizations of the process under various conditions. In particular, we examine ergodic and strongly mixing processes and obtain several asymptotic results as well as finite sample error bounds. We provide applications of our theory in terms of consistency results for kernel PCA with dependent data and the conditional mean embedding of transition probabilities. Finally, we use our approach to examine the nonparametric estimation of Markov transition operators and highlight how our theory can give a consistency analysis for a large family of spectral analysis methods including kernel-based dynamic mode decomposition.
    AutoML Two-Sample Test. (arXiv:2206.08843v2 [cs.LG] UPDATED)
    Two-sample tests are important in statistics and machine learning, both as tools for scientific discovery as well as to detect distribution shifts. This led to the development of many sophisticated test procedures going beyond the standard supervised learning frameworks, whose usage can require specialized knowledge about two-sample testing. We use a simple test that takes the mean discrepancy of a witness function as the test statistic and prove that minimizing a squared loss leads to a witness with optimal testing power. This allows us to leverage recent advancements in AutoML. Without any user input about the problems at hand, and using the same method for all our experiments, our AutoML two-sample test achieves competitive performance on a diverse distribution shift benchmark as well as on challenging two-sample testing problems. We provide an implementation of the AutoML two-sample test in the Python package autotst.
    Multimodal learning with graphs. (arXiv:2209.03299v4 [cs.LG] UPDATED)
    Artificial intelligence for graphs (graph AI) has achieved remarkable success in modeling complex systems, ranging from dynamic networks in biology to interacting particle systems in physics. However, the increasingly heterogeneous graph datasets call for multimodal methods that can combine different inductive biases: the set of assumptions that algorithms use to make predictions for inputs they have not encountered during training. Learning on multimodal graph datasets presents fundamental challenges because the inductive biases can vary by data modality and graphs might not be explicitly given in the input. To address these challenges, multimodal graph AI methods combine different modalities while leveraging cross-modal dependencies. Here, we survey 145 studies in graph AI and realize that diverse datasets are increasingly combined using graphs and fed into sophisticated multimodal methods, specified as image-intensive, knowledge-grounded and language-intensive models. Using this categorization, we introduce a blueprint for multimodal graph AI to study existing methods and guide the design of future methods.
    Self-Supervised Mental Disorder Classifiers via Time Reversal. (arXiv:2211.16398v1 [cs.LG])
    Data scarcity is a notable problem, especially in the medical domain, due to patient data laws. Therefore, efficient Pre-Training techniques could help in combating this problem. In this paper, we demonstrate that a model trained on the time direction of functional neuro-imaging data could help in any downstream task, for example, classifying diseases from healthy controls in fMRI data. We train a Deep Neural Network on Independent components derived from fMRI data using the Independent component analysis (ICA) technique. It learns time direction in the ICA-based data. This pre-trained model is further trained to classify brain disorders in different datasets. Through various experiments, we have shown that learning time direction helps a model learn some causal relation in fMRI data that helps in faster convergence, and consequently, the model generalizes well in downstream classification tasks even with fewer data records.
    Birds of a Feather Trust Together: Knowing When to Trust a Classifier via Adaptive Neighborhood Aggregation. (arXiv:2211.16466v1 [cs.LG])
    How do we know when the predictions made by a classifier can be trusted? This is a fundamental problem that also has immense practical applicability, especially in safety-critical areas such as medicine and autonomous driving. The de facto approach of using the classifier's softmax outputs as a proxy for trustworthiness suffers from the over-confidence issue; while the most recent works incur problems such as additional retraining cost and accuracy versus trustworthiness trade-off. In this work, we argue that the trustworthiness of a classifier's prediction for a sample is highly associated with two factors: the sample's neighborhood information and the classifier's output. To combine the best of both worlds, we design a model-agnostic post-hoc approach NeighborAgg to leverage the two essential information via an adaptive neighborhood aggregation. Theoretically, we show that NeighborAgg is a generalized version of a one-hop graph convolutional network, inheriting the powerful modeling ability to capture the varying similarity between samples within each class. We also extend our approach to the closely related task of mislabel detection and provide a theoretical coverage guarantee to bound the false negative. Empirically, extensive experiments on image and tabular benchmarks verify our theory and suggest that NeighborAgg outperforms other methods, achieving state-of-the-art trustworthiness performance.
    Equivalence Between SE(3) Equivariant Networks via Steerable Kernels and Group Convolution. (arXiv:2211.15903v1 [cs.CG])
    A wide range of techniques have been proposed in recent years for designing neural networks for 3D data that are equivariant under rotation and translation of the input. Most approaches for equivariance under the Euclidean group $\mathrm{SE}(3)$ of rotations and translations fall within one of the two major categories. The first category consists of methods that use $\mathrm{SE}(3)$-convolution which generalizes classical $\mathbb{R}^3$-convolution on signals over $\mathrm{SE}(3)$. Alternatively, it is possible to use \textit{steerable convolution} which achieves $\mathrm{SE}(3)$-equivariance by imposing constraints on $\mathbb{R}^3$-convolution of tensor fields. It is known by specialists in the field that the two approaches are equivalent, with steerable convolution being the Fourier transform of $\mathrm{SE}(3)$ convolution. Unfortunately, these results are not widely known and moreover the exact relations between deep learning architectures built upon these two approaches have not been precisely described in the literature on equivariant deep learning. In this work we provide an in-depth analysis of both methods and their equivalence and relate the two constructions to multiview convolutional networks. Furthermore, we provide theoretical justifications of separability of $\mathrm{SE}(3)$ group convolution, which explain the applicability and success of some recent approaches. Finally, we express different methods using a single coherent formalism and provide explicit formulas that relate the kernels learned by different methods. In this way, our work helps to unify different previously-proposed techniques for achieving roto-translational equivariance, and helps to shed light on both the utility and precise differences between various alternatives. We also derive new TFN non-linearities from our equivalence principle and test them on practical benchmark datasets.
    Finding Front-Door Adjustment Sets in Linear Time. (arXiv:2211.16468v1 [cs.AI])
    Front-door adjustment is a classic technique to estimate causal effects from a specified directed acyclic graph (DAG) and observed data. The advantage of this approach is that it uses observed mediators to identify causal effects, which is possible even in the presence of unobserved confounding. While the statistical properties of the front-door estimation are quite well understood, its algorithmic aspects remained unexplored for a long time. Recently, Jeong, Tian, and Barenboim [NeurIPS 2022] have presented the first polynomial-time algorithm for finding sets satisfying the front-door criterion in a given DAG, with an $O(n^3(n+m))$ run time, where $n$ denotes the number of variables and $m$ the number of edges of the graph. In our work, we give the first linear-time, i.e. $O(n+m)$, algorithm for this task, which thus reaches the asymptotically optimal time complexity, as the size of the input is $\Omega(n+m)$. We also provide an algorithm to enumerate all front-door adjustment sets in a given DAG with delay $O(n(n + m))$. These results improve the algorithms by Jeong et al. [2022] for the two tasks by a factor of $n^3$, respectively.
    BoMD: Bag of Multi-label Descriptors for Noisy Chest X-ray Classification. (arXiv:2203.01937v2 [eess.IV] UPDATED)
    Deep learning methods have shown outstanding classification accuracy in medical imaging problems, which is largely attributed to the availability of large-scale datasets manually annotated with clean labels. However, given the high cost of such manual annotation, new medical imaging classification problems may need to rely on machine-generated noisy labels extracted from radiology reports. Indeed, many Chest X-ray (CXR) classifiers have already been modelled from datasets with noisy labels, but their training procedure is in general not robust to noisy-label samples, leading to sub-optimal models. Furthermore, CXR datasets are mostly multi-label, so current noisy-label learning methods designed for multi-class problems cannot be easily adapted. In this paper, we propose a new method designed for the noisy multi-label CXR learning, which detects and smoothly re-labels samples from the dataset, which is then used to train common multi-label classifiers. The proposed method optimises a bag of multi-label descriptors (BoMD) to promote their similarity with the semantic descriptors produced by BERT models from the multi-label image annotation. Our experiments on diverse noisy multi-label training sets and clean testing sets show that our model has state-of-the-art accuracy and robustness in many CXR multi-label classification benchmarks.
    If your data distribution shifts, use self-learning. (arXiv:2104.12928v3 [cs.CV] UPDATED)
    We demonstrate that self-learning techniques like entropy minimization and pseudo-labeling are simple and effective at improving performance of a deployed computer vision model under systematic domain shifts. We conduct a wide range of large-scale experiments and show consistent improvements irrespective of the model architecture, the pre-training technique or the type of distribution shift. At the same time, self-learning is simple to use in practice because it does not require knowledge or access to the original training data or scheme, is robust to hyperparameter choices, is straight-forward to implement and requires only a few adaptation epochs. This makes self-learning techniques highly attractive for any practitioner who applies machine learning algorithms in the real world. We present state-of-the-art adaptation results on CIFAR10-C (8.5% error), ImageNet-C (22.0% mCE), ImageNet-R (17.4% error) and ImageNet-A (14.8% error), theoretically study the dynamics of self-supervised adaptation methods and propose a new classification dataset (ImageNet-D) which is challenging even with adaptation.
    G-CMP: Graph-enhanced Contextual Matrix Profile for unsupervised anomaly detection in sensor-based remote health monitoring. (arXiv:2211.16122v1 [cs.LG])
    Sensor-based remote health monitoring is used in industrial, urban and healthcare settings to monitor ongoing operation of equipment and human health. An important aim is to intervene early if anomalous events or adverse health is detected. In the wild, these anomaly detection approaches are challenged by noise, label scarcity, high dimensionality, explainability and wide variability in operating environments. The Contextual Matrix Profile (CMP) is a configurable 2-dimensional version of the Matrix Profile (MP) that uses the distance matrix of all subsequences of a time series to discover patterns and anomalies. The CMP is shown to enhance the effectiveness of the MP and other SOTA methods at detecting, visualising and interpreting true anomalies in noisy real world data from different domains. It excels at zooming out and identifying temporal patterns at configurable time scales. However, the CMP does not address cross-sensor information, and cannot scale to high dimensional data. We propose a novel, self-supervised graph-based approach for temporal anomaly detection that works on context graphs generated from the CMP distance matrix. The learned graph embeddings encode the anomalous nature of a time context. In addition, we evaluate other graph outlier algorithms for the same task. Given our pipeline is modular, graph construction, generation of graph embeddings, and pattern recognition logic can all be chosen based on the specific pattern detection application. We verified the effectiveness of graph-based anomaly detection and compared it with the CMP and 3 state-of-the art methods on two real-world healthcare datasets with different anomalies. Our proposed method demonstrated better recall, alert rate and generalisability.
    MedalCare-XL: 16,900 healthy and pathological 12 lead ECGs obtained through electrophysiological simulations. (arXiv:2211.15997v1 [physics.med-ph])
    Mechanistic cardiac electrophysiology models allow for personalized simulations of the electrical activity in the heart and the ensuing electrocardiogram (ECG) on the body surface. As such, synthetic signals possess known ground truth labels of the underlying disease and can be employed for validation of machine learning ECG analysis tools in addition to clinical signals. Recently, synthetic ECGs were used to enrich sparse clinical data or even replace them completely during training leading to improved performance on real-world clinical test data. We thus generated a novel synthetic database comprising a total of 16,900 12 lead ECGs based on electrophysiological simulations equally distributed into healthy control and 7 pathology classes. The pathological case of myocardial infraction had 6 sub-classes. A comparison of extracted features between the virtual cohort and a publicly available clinical ECG database demonstrated that the synthetic signals represent clinical ECGs for healthy and pathological subpopulations with high fidelity. The ECG database is split into training, validation, and test folds for development and objective assessment of novel machine learning algorithms.
    Training Language Models with Memory Augmentation. (arXiv:2205.12674v3 [cs.CL] UPDATED)
    Recent work has improved language models (LMs) remarkably by equipping them with a non-parametric memory component. However, most existing approaches only introduce mem-ories at testing time or represent them using a separately trained encoder, resulting in suboptimal training of the language model. In this work, we present TRIME, a novel yet simple training approach designed for training LMs with memory augmentation. Our approach uses a training objective that directly takes in-batch examples as accessible memory. We also present new methods for memory construction and data batching, which are used for adapting to different sets of memories--local, long-term, and external memory--at testing time. We evaluate TRIME on multiple language modeling and machine translation benchmarks and show that it is able to achieve significant improvements across all the settings. Concretely, TRIME reduces the perplexity from 18.70 to 15.37 on WIKITEXT-103, by effectively leveraging a large memory set from the training corpus. Compared to standard LM training, TRIME adds negligible computational overhead and is compatible with different neural architectures, making it a versatile solution for training memory-augmented LMs.
    Better Generalized Few-Shot Learning Even Without Base Data. (arXiv:2211.16095v1 [cs.LG])
    This paper introduces and studies zero-base generalized few-shot learning (zero-base GFSL), which is an extreme yet practical version of few-shot learning problem. Motivated by the cases where base data is not available due to privacy or ethical issues, the goal of zero-base GFSL is to newly incorporate the knowledge of few samples of novel classes into a pretrained model without any samples of base classes. According to our analysis, we discover the fact that both mean and variance of the weight distribution of novel classes are not properly established, compared to those of base classes. The existing GFSL methods attempt to make the weight norms balanced, which we find helps only the variance part, but discard the importance of mean of weights particularly for novel classes, leading to the limited performance in the GFSL problem even with base data. In this paper, we overcome this limitation by proposing a simple yet effective normalization method that can effectively control both mean and variance of the weight distribution of novel classes without using any base samples and thereby achieve a satisfactory performance on both novel and base classes. Our experimental results somewhat surprisingly show that the proposed zero-base GFSL method that does not utilize any base samples even outperforms the existing GFSL methods that make the best use of base data.
    Linear Causal Disentanglement via Interventions. (arXiv:2211.16467v1 [stat.ML])
    Causal disentanglement seeks a representation of data involving latent variables that relate to one another via a causal model. A representation is identifiable if both the latent model and the transformation from latent to observed variables are unique. In this paper, we study observed variables that are a linear transformation of a linear latent causal model. Data from interventions are necessary for identifiability: if one latent variable is missing an intervention, we show that there exist distinct models that cannot be distinguished. Conversely, we show that a single intervention on each latent variable is sufficient for identifiability. Our proof uses a generalization of the RQ decomposition of a matrix that replaces the usual orthogonal and upper triangular conditions with analogues depending on a partial order on the rows of the matrix, with partial order determined by a latent causal model. We corroborate our theoretical results with a method for causal disentanglement that accurately recovers a latent causal model.
    Synthetic data enable experiments in atomistic machine learning. (arXiv:2211.16443v1 [physics.chem-ph])
    Machine-learning models are increasingly used to predict properties of atoms in chemical systems. There have been major advances in developing descriptors and regression frameworks for this task, typically starting from (relatively) small sets of quantum-mechanical reference data. Larger datasets of this kind are becoming available, but remain expensive to generate. Here we demonstrate the use of a large dataset that we have "synthetically" labelled with per-atom energies from an existing ML potential model. The cheapness of this process, compared to the quantum-mechanical ground truth, allows us to generate millions of datapoints, in turn enabling rapid experimentation with atomistic ML models from the small- to the large-data regime. This approach allows us here to compare regression frameworks in depth, and to explore visualisation based on learned representations. We also show that learning synthetic data labels can be a useful pre-training task for subsequent fine-tuning on small datasets. In the future, we expect that our open-sourced dataset, and similar ones, will be useful in rapidly exploring deep-learning models in the limit of abundant chemical data.
    On Robust Learning from Noisy Labels: A Permutation Layer Approach. (arXiv:2211.15890v1 [cs.LG])
    The existence of label noise imposes significant challenges (e.g., poor generalization) on the training process of deep neural networks (DNN). As a remedy, this paper introduces a permutation layer learning approach termed PermLL to dynamically calibrate the training process of the DNN subject to instance-dependent and instance-independent label noise. The proposed method augments the architecture of a conventional DNN by an instance-dependent permutation layer. This layer is essentially a convex combination of permutation matrices that is dynamically calibrated for each sample. The primary objective of the permutation layer is to correct the loss of noisy samples mitigating the effect of label noise. We provide two variants of PermLL in this paper: one applies the permutation layer to the model's prediction, while the other applies it directly to the given noisy label. In addition, we provide a theoretical comparison between the two variants and show that previous methods can be seen as one of the variants. Finally, we validate PermLL experimentally and show that it achieves state-of-the-art performance on both real and synthetic datasets.
    Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles. (arXiv:2211.16504v1 [cs.CV])
    This paper focuses on analyzing and improving the commonsense ability of recent popular vision-language (VL) models. Despite the great success, we observe that existing VL-models still lack commonsense knowledge/reasoning ability (e.g., "Lemons are sour"), which is a vital component towards artificial general intelligence. Through our analysis, we find one important reason is that existing large-scale VL datasets do not contain much commonsense knowledge, which motivates us to improve the commonsense of VL-models from the data perspective. Rather than collecting a new VL training dataset, we propose a more scalable strategy, i.e., "Data Augmentation with kNowledge graph linearization for CommonsensE capability" (DANCE). It can be viewed as one type of data augmentation technique, which can inject commonsense knowledge into existing VL datasets on the fly during training. More specifically, we leverage the commonsense knowledge graph (e.g., ConceptNet) and create variants of text description in VL datasets via bidirectional sub-graph sequentialization. For better commonsense evaluation, we further propose the first retrieval-based commonsense diagnostic benchmark. By conducting extensive experiments on some representative VL-models, we demonstrate that our DANCE technique is able to significantly improve the commonsense ability while maintaining the performance on vanilla retrieval tasks. The code and data are available at https://github.com/pleaseconnectwifi/DANCE
    Finding mixed-strategy equilibria of continuous-action games without gradients using randomized policy networks. (arXiv:2211.15936v1 [cs.GT])
    We study the problem of computing an approximate Nash equilibrium of continuous-action game without access to gradients. Such game access is common in reinforcement learning settings, where the environment is typically treated as a black box. To tackle this problem, we apply zeroth-order optimization techniques that combine smoothed gradient estimators with equilibrium-finding dynamics. We model players' strategies using artificial neural networks. In particular, we use randomized policy networks to model mixed strategies. These take noise in addition to an observation as input and can flexibly represent arbitrary observation-dependent, continuous-action distributions. Being able to model such mixed strategies is crucial for tackling continuous-action games that lack pure-strategy equilibria. We evaluate the performance of our method using an approximation of the Nash convergence metric from game theory, which measures how much players can benefit from unilaterally changing their strategy. We apply our method to continuous Colonel Blotto games, single-item and multi-item auctions, and a visibility game. The experiments show that our method can quickly find high-quality approximate equilibria. Furthermore, they show that the dimensionality of the input noise is crucial for performance. To our knowledge, this paper is the first to solve general continuous-action games with unrestricted mixed strategies and without any gradient information.
    A3T: Accuracy Aware Adversarial Training. (arXiv:2211.16316v1 [cs.LG])
    Adversarial training has been empirically shown to be more prone to overfitting than standard training. The exact underlying reasons still need to be fully understood. In this paper, we identify one cause of overfitting related to current practices of generating adversarial samples from misclassified samples. To address this, we propose an alternative approach that leverages the misclassified samples to mitigate the overfitting problem. We show that our approach achieves better generalization while having comparable robustness to state-of-the-art adversarial training methods on a wide range of computer vision, natural language processing, and tabular tasks.
    Fast Hyperparameter Tuning for Ising Machines. (arXiv:2211.15869v1 [cs.LG])
    In this paper, we propose a novel technique to accelerate Ising machines hyperparameter tuning. Firstly, we define Ising machine performance and explain the goal of hyperparameter tuning in regard to this performance definition. Secondly, we compare well-known hyperparameter tuning techniques, namely random sampling and Tree-structured Parzen Estimator (TPE) on different combinatorial optimization problems. Thirdly, we propose a new convergence acceleration method for TPE which we call "FastConvergence".It aims at limiting the number of required TPE trials to reach best performing hyperparameter values combination. We compare FastConvergence to previously mentioned well-known hyperparameter tuning techniques to show its effectiveness. For experiments, well-known Travel Salesman Problem (TSP) and Quadratic Assignment Problem (QAP) instances are used as input. The Ising machine used is Fujitsu's third generation Digital Annealer (DA). Results show, in most cases, FastConvergence can reach similar results to TPE alone within less than half the number of trials.
    FastCorrect: Fast Error Correction with Edit Alignment for Automatic Speech Recognition. (arXiv:2105.03842v6 [cs.CL] UPDATED)
    Error correction techniques have been used to refine the output sentences from automatic speech recognition (ASR) models and achieve a lower word error rate (WER) than original ASR outputs. Previous works usually use a sequence-to-sequence model to correct an ASR output sentence autoregressively, which causes large latency and cannot be deployed in online ASR services. A straightforward solution to reduce latency, inspired by non-autoregressive (NAR) neural machine translation, is to use an NAR sequence generation model for ASR error correction, which, however, comes at the cost of significantly increased ASR error rate. In this paper, observing distinctive error patterns and correction operations (i.e., insertion, deletion, and substitution) in ASR, we propose FastCorrect, a novel NAR error correction model based on edit alignment. In training, FastCorrect aligns each source token from an ASR output sentence to the target tokens from the corresponding ground-truth sentence based on the edit distance between the source and target sentences, and extracts the number of target tokens corresponding to each source token during edition/correction, which is then used to train a length predictor and to adjust the source tokens to match the length of the target sentence for parallel generation. In inference, the token number predicted by the length predictor is used to adjust the source tokens for target sequence generation. Experiments on the public AISHELL-1 dataset and an internal industrial-scale ASR dataset show the effectiveness of FastCorrect for ASR error correction: 1) it speeds up the inference by 6-9 times and maintains the accuracy (8-14% WER reduction) compared with the autoregressive correction model; and 2) it outperforms the popular NAR models adopted in neural machine translation and text edition by a large margin.
    Configurable Agent With Reward As Input: A Play-Style Continuum Generation. (arXiv:2211.16221v1 [cs.AI])
    Modern video games are becoming richer and more complex in terms of game mechanics. This complexity allows for the emergence of a wide variety of ways to play the game across the players. From the point of view of the game designer, this means that one needs to anticipate a lot of different ways the game could be played. Machine Learning (ML) could help address this issue. More precisely, Reinforcement Learning is a promising answer to the need of automating video game testing. In this paper we present a video game environment which lets us define multiple play-styles. We then introduce CARI: a Configurable Agent with Reward as Input. An agent able to simulate a wide continuum range of play-styles. It is not constrained to extreme archetypal behaviors like current methods using reward shaping. In addition it achieves this through a single training loop, instead of the usual one loop per play-style. We compare this novel training approach with the more classic reward shaping approach and conclude that CARI can also outperform the baseline on archetypes generation. This novel agent could be used to investigate behaviors and balancing during the production of a video game with a realistic amount of training time.
    On Learning Fairness and Accuracy on Multiple Subgroups. (arXiv:2210.10837v2 [stat.ML] UPDATED)
    We propose an analysis in fair learning that preserves the utility of the data while reducing prediction disparities under the criteria of group sufficiency. We focus on the scenario where the data contains multiple or even many subgroups, each with limited number of samples. As a result, we present a principled method for learning a fair predictor for all subgroups via formulating it as a bilevel objective. Specifically, the subgroup specific predictors are learned in the lower-level through a small amount of data and the fair predictor. In the upper-level, the fair predictor is updated to be close to all subgroup specific predictors. We further prove that such a bilevel objective can effectively control the group sufficiency and generalization error. We evaluate the proposed framework on real-world datasets. Empirical evidence suggests the consistently improved fair predictions, as well as the comparable accuracy to the baselines.
    Transformers Can Be Translated to First-Order Logic with Majority Quantifiers. (arXiv:2210.02671v2 [cs.LG] UPDATED)
    Characterizing the implicit structure of the computation within neural networks is a foundational problem in the area of deep learning interpretability. Can their inner decision process be captured symbolically in some familiar logic? We show that any transformer neural network can be translated into an equivalent fixed-size first-order logic formula which may also use majority quantifiers. The idea is to simulate transformers with highly uniform threshold circuits and leverage known theoretical connections between circuits and logic. Our findings also reveal the surprising fact that the entire transformer computation can be reduced merely to the division of two (large) integers. While our results are most pertinent for transformers, they apply equally to a broader class of neural network architectures, namely those with a fixed-depth uniform computation graph made up of standard neural net components, which includes feedforward and convolutional networks.
    NCTV: Neural Clamping Toolkit and Visualization for Neural Network Calibration. (arXiv:2211.16274v1 [cs.LG])
    With the advancement of deep learning technology, neural networks have demonstrated their excellent ability to provide accurate predictions in many tasks. However, a lack of consideration for neural network calibration will not gain trust from humans, even for high-accuracy models. In this regard, the gap between the confidence of the model's predictions and the actual correctness likelihood must be bridged to derive a well-calibrated model. In this paper, we introduce the Neural Clamping Toolkit, the first open-source framework designed to help developers employ state-of-the-art model-agnostic calibrated models. Furthermore, we provide animations and interactive sections in the demonstration to familiarize researchers with calibration in neural networks. A Colab tutorial on utilizing our toolkit is also introduced.
    Offline Reinforcement Learning with Closed-Form Policy Improvement Operators. (arXiv:2211.15956v1 [cs.LG])
    Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning. By exploiting historical transitions, a policy is trained to maximize a learned value function while constrained by the behavior policy to avoid a significant distributional shift. In this paper, we propose our closed-form policy improvement operators. We make a novel observation that the behavior constraint naturally motivates the use of first-order Taylor approximation, leading to a linear approximation of the policy objective. Additionally, as practical datasets are usually collected by heterogeneous policies, we model the behavior policies as a Gaussian Mixture and overcome the induced optimization difficulties by leveraging the LogSumExp's lower bound and Jensen's Inequality, giving rise to a closed-form policy improvement operator. We instantiate offline RL algorithms with our novel policy improvement operators and empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
    TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second. (arXiv:2207.01848v4 [cs.LG] UPDATED)
    We present TabPFN, a trained Transformer that can do supervised classification for small tabular datasets in less than a second, needs no hyperparameter tuning and is competitive with state-of-the-art classification methods. TabPFN is fully entailed in the weights of our network, which accepts training and test samples as a set-valued input and yields predictions for the entire test set in a single forward pass. TabPFN is a Prior-Data Fitted Network (PFN) and is trained offline once, to approximate Bayesian inference on synthetic datasets drawn from our prior. This prior incorporates ideas from causal reasoning: It entails a large space of structural causal models with a preference for simple structures. On the 18 datasets in the OpenML-CC18 suite that contain up to 1 000 training data points, up to 100 purely numerical features without missing values, and up to 10 classes, we show that our method clearly outperforms boosted trees and performs on par with complex state-of-the-art AutoML systems with up to 70$\times$ speedup. This increases to a 3200$\times$ speedup when a GPU is available. We also validate these results on an additional 67 small numerical datasets from OpenML. We provide all our code, the trained TabPFN, an interactive browser demo and a Colab notebook at https://github.com/automl/TabPFN.
    Flow Annealed Importance Sampling Bootstrap. (arXiv:2208.01893v2 [cs.LG] UPDATED)
    Normalizing flows are tractable density models that can approximate complicated target distributions, e.g. Boltzmann distributions of physical systems. However, current methods for training flows either suffer from mode-seeking behavior, use samples from the target generated beforehand by expensive MCMC simulations, or use stochastic losses that have high variance. To avoid these problems, we augment flows with annealed importance sampling (AIS) and minimize the mass-covering $\alpha$-divergence with $\alpha=2$, which minimizes importance weight variance. Our method, Flow AIS Bootstrap (FAB), uses AIS to generate samples in regions where the flow is a poor approximation of the target, facilitating the discovery of new modes. We apply FAB to complex multimodal targets and show that we can approximate them very accurately where previous methods fail. To the best of our knowledge, we are the first to learn the Boltzmann distribution of the alanine dipeptide molecule using only the unnormalized target density, without access to samples generated via Molecular Dynamics (MD) simulations: FAB produces better results than training via maximum likelihood on MD samples while using 100 times fewer target evaluations. After reweighting samples with importance weights, we obtain unbiased histograms of dihedral angles that are almost identical to the ground truth.
    Logical Activation Functions: Logit-space equivalents of Probabilistic Boolean Operators. (arXiv:2110.11940v2 [cs.LG] UPDATED)
    The choice of activation functions and their motivation is a long-standing issue within the neural network community. Neuronal representations within artificial neural networks are commonly understood as logits, representing the log-odds score of presence of features within the stimulus. We derive logit-space operators equivalent to probabilistic Boolean logic-gates AND, OR, and XNOR for independent probabilities. Such theories are important to formalize more complex dendritic operations in real neurons, and these operations can be used as activation functions within a neural network, introducing probabilistic Boolean-logic as the core operation of the neural network. Since these functions involve taking multiple exponents and logarithms, they are computationally expensive and not well suited to be directly used within neural networks. Consequently, we construct efficient approximations named $\text{AND}_\text{AIL}$ (the AND operator Approximate for Independent Logits), $\text{OR}_\text{AIL}$, and $\text{XNOR}_\text{AIL}$, which utilize only comparison and addition operations, have well-behaved gradients, and can be deployed as activation functions in neural networks. Like MaxOut, $\text{AND}_\text{AIL}$ and $\text{OR}_\text{AIL}$ are generalizations of ReLU to two-dimensions. While our primary aim is to formalize dendritic computations within a logit-space probabilistic-Boolean framework, we deploy these new activation functions, both in isolation and in conjunction to demonstrate their effectiveness on a variety of tasks including image classification, transfer learning, abstract reasoning, and compositional zero-shot learning.
    CLAS: Coordinating Multi-Robot Manipulation with Central Latent Action Spaces. (arXiv:2211.15824v1 [cs.RO])
    Multi-robot manipulation tasks involve various control entities that can be separated into dynamically independent parts. A typical example of such real-world tasks is dual-arm manipulation. Learning to naively solve such tasks with reinforcement learning is often unfeasible due to the sample complexity and exploration requirements growing with the dimensionality of the action and state spaces. Instead, we would like to handle such environments as multi-agent systems and have several agents control parts of the whole. However, decentralizing the generation of actions requires coordination across agents through a channel limited to information central to the task. This paper proposes an approach to coordinating multi-robot manipulation through learned latent action spaces that are shared across different agents. We validate our method in simulated multi-robot manipulation tasks and demonstrate improvement over previous baselines in terms of sample efficiency and learning performance.
    Coder Reviewer Reranking for Code Generation. (arXiv:2211.16490v1 [cs.LG])
    Sampling diverse programs from a code language model and reranking with model likelihood is a popular method for code generation but it is prone to preferring degenerate solutions. Inspired by collaborative programming, we propose Coder-Reviewer reranking. We augment Coder language models from past work, which generate programs given language instructions, with Reviewer models, which evaluate the likelihood of the instruction given the generated programs. We perform an extensive study across six datasets with eight models from three model families. Experimental results show that Coder-Reviewer reranking leads to consistent and significant improvement (up to 17% absolute accuracy gain) over reranking with the Coder model only. When combined with executability filtering, Coder-Reviewer reranking can often outperform the minimum Bayes risk method. Coder-Reviewer reranking is easy to implement by prompting, can generalize to different programming languages, and works well with off-the-shelf hyperparameters.
    Bayesian Semiparametric Model for Sequential Treatment Decisions with Informative Timing. (arXiv:2211.16393v1 [stat.ME])
    We develop a Bayesian semi-parametric model for the estimating the impact of dynamic treatment rules on survival among patients diagnosed with pediatric acute myeloid leukemia (AML). The data consist of a subset of patients enrolled in the phase III AAML1031 clinical trial in which patients move through a sequence of four treatment courses. At each course, they undergo treatment that may or may not include anthracyclines (ACT). While ACT is known to be effective at treating AML, it is also cardiotoxic and can lead to early death for some patients. Our task is to estimate the potential survival probability under hypothetical dynamic ACT treatment strategies, but there are several impediments. First, since ACT was not randomized in the trial, its effect on survival is confounded over time. Second, subjects initiate the next course depending on when they recover from the previous course, making timing potentially informative of subsequent treatment and survival. Third, patients may die or drop out before ever completing the full treatment sequence. We develop a generative Bayesian semi-parametric model based on Gamma Process priors to address these complexities. At each treatment course, the model captures subjects' transition to subsequent treatment or death in continuous time under a given rule. A g-computation procedure is used to compute a posterior over potential survival probability that is adjusted for time-varying confounding. Using this approach, we conduct posterior inference for the efficacy of hypothetical treatment rules that dynamically modify ACT based on evolving cardiac function.
    The Surprising Effectiveness of Latent World Models for Continual Reinforcement Learning. (arXiv:2211.15944v1 [cs.LG])
    We study the use of model-based reinforcement learning methods, in particular, world models for continual reinforcement learning. In continual reinforcement learning, an agent is required to solve one task and then another sequentially while retaining performance and preventing forgetting on past tasks. World models offer a task-agnostic solution: they do not require knowledge of task changes. World models are a straight-forward baseline for continual reinforcement learning for three main reasons. Firstly, forgetting in the world model is prevented by persisting existing experience replay buffers across tasks, experience from previous tasks is replayed for learning the world model. Secondly, they are sample efficient. Thirdly and finally, they offer a task-agnostic exploration strategy through the uncertainty in the trajectories generated by the world model. We show that world models are a simple and effective continual reinforcement learning baseline. We study their effectiveness on Minigrid and Minihack continual reinforcement learning benchmarks and show that it outperforms state of the art task-agnostic continual reinforcement learning methods.
    Symmetry Detection in Trajectory Data for More Meaningful Reinforcement Learning Representations. (arXiv:2211.16381v1 [cs.LG])
    Knowledge of the symmetries of reinforcement learning (RL) systems can be used to create compressed and semantically meaningful representations of a low-level state space. We present a method of automatically detecting RL symmetries directly from raw trajectory data without requiring active control of the system. Our method generates candidate symmetries and trains a recurrent neural network (RNN) to discriminate between the original trajectories and the transformed trajectories for each candidate symmetry. The RNN discriminator's accuracy for each candidate reveals how symmetric the system is under that transformation. This information can be used to create high-level representations that are invariant to all symmetries on a dataset level and to communicate properties of the RL behavior to users. We show in experiments on two simulated RL use cases (a pusher robot and a UAV flying in wind) that our method can determine the symmetries underlying both the environment physics and the trained RL policy.
    On "Deep Learning" Misconduct. (arXiv:2211.16350v1 [cs.LG])
    This is a theoretical paper, as a companion paper of the plenary talk for the same conference ISAIC 2022. In contrast to conscious learning, which develops a single network for a normal life and is the main topic of the plenary talk, it is necessary to address the currently widespread approach, so-called "Deep Learning". Although "Deep Learning" may use different learning modes, including supervised, reinforcement and adversarial modes, almost all "Deep Learning" projects apparently suffer from the same misconduct, called "data deletion" and "test on training data". Consequently, Deep Learning almost always was not tested at all. Why? The so-called "test set" was used in the Post-Selection step of the training stage. This paper establishes a theorem that a simple method called Pure-Guess Nearest Neighbor (PGNN) reaches any required errors on validation set and test set, including zero-error requirements, through the "Deep Learning" misconduct, as long as the test set is in the possession of the author and both the amount of storage space and the time of training are finite but unbounded. However, Deep Learning methods, like the PGNN method, apparently are not generalizable since they have never been tested at all by a valid test set.
    Joint Neural Architecture and Hyperparameter Search for Correlated Time Series Forecasting. (arXiv:2211.16126v1 [cs.LG])
    Sensors in cyber-physical systems often capture interconnected processes and thus emit correlated time series (CTS), the forecasting of which enables important applications. The key to successful CTS forecasting is to uncover the temporal dynamics of time series and the spatial correlations among time series. Deep learning-based solutions exhibit impressive performance at discerning these aspects. In particular, automated CTS forecasting, where the design of an optimal deep learning architecture is automated, enables forecasting accuracy that surpasses what has been achieved by manual approaches. However, automated CTS solutions remain in their infancy and are only able to find optimal architectures for predefined hyperparameters and scale poorly to large-scale CTS. To overcome these limitations, we propose SEARCH, a joint, scalable framework, to automatically devise effective CTS forecasting models. Specifically, we encode each candidate architecture and accompanying hyperparameters into a joint graph representation. We introduce an efficient Architecture-Hyperparameter Comparator (AHC) to rank all architecture-hyperparameter pairs, and we then further evaluate the top-ranked pairs to select a final result. Extensive experiments on six benchmark datasets demonstrate that SEARCH not only eliminates manual efforts but also is capable of better performance than manually designed and existing automatically designed CTS models. In addition, it shows excellent scalability to large CTS.
    Continuous Neural Algorithmic Planners. (arXiv:2211.15839v1 [cs.LG])
    Neural algorithmic reasoning studies the problem of learning algorithms with neural networks, especially with graph architectures. A recent proposal, XLVIN, reaps the benefits of using a graph neural network that simulates the value iteration algorithm in deep reinforcement learning agents. It allows model-free planning without access to privileged information about the environment, which is usually unavailable. However, XLVIN only supports discrete action spaces, and is hence nontrivially applicable to most tasks of real-world interest. We expand XLVIN to continuous action spaces by discretization, and evaluate several selective expansion policies to deal with the large planning graphs. Our proposal, CNAP, demonstrates how neural algorithmic reasoning can make a measurable impact in higher-dimensional continuous control settings, such as MuJoCo, bringing gains in low-data settings and outperforming model-free baselines.
    On the Utility Recovery Incapability of Neural Net-based Differential Private Tabular Training Data Synthesizer under Privacy Deregulation. (arXiv:2211.15809v1 [cs.LG])
    Devising procedures for auditing generative model privacy-utility tradeoff is an important yet unresolved problem in practice. Existing works concentrates on investigating the privacy constraint side effect in terms of utility degradation of the train on synthetic, test on real paradigm of synthetic data training. We push such understanding on privacy-utility tradeoff to next level by observing the privacy deregulation side effect on synthetic training data utility. Surprisingly, we discover the Utility Recovery Incapability of DP-CTGAN and PATE-CTGAN under privacy deregulation, raising concerns on their practical applications. The main message is Privacy Deregulation does NOT always imply Utility Recovery.
    Data-efficient Modeling of Optical Matrix Multipliers Using Transfer Learning. (arXiv:2211.16038v1 [cs.LG])
    We demonstrate transfer learning-assisted neural network models for optical matrix multipliers with scarce measurement data. Our approach uses <10\% of experimental data needed for best performance and outperforms analytical models for a Mach-Zehnder interferometer mesh.
    Neural networks: solving the chemistry of the interstellar medium. (arXiv:2211.15688v1 [astro-ph.GA])
    Non-equilibrium chemistry is a key process in the study of the InterStellar Medium (ISM), in particular the formation of molecular clouds and thus stars. However, computationally it is among the most difficult tasks to include in astrophysical simulations, because of the typically high (>40) number of reactions, the short evolutionary timescales (about $10^4$ times less than the ISM dynamical time) and the characteristic non-linearity and stiffness of the associated Ordinary Differential Equations system (ODEs). In this proof of concept work, we show that Physics Informed Neural Networks (PINN) are a viable alternative to traditional ODE time integrators for stiff thermo-chemical systems, i.e. up to molecular hydrogen formation (9 species and 46 reactions). Testing different chemical networks in a wide range of densities ($-2< \log n/{\rm cm}^{-3}< 3$) and temperatures ($1 < \log T/{\rm K}< 5$), we find that a basic architecture can give a comfortable convergence only for simplified chemical systems: to properly capture the sudden chemical and thermal variations a Deep Galerkin Method is needed. Once trained ($\sim 10^3$ GPUhr), the PINN well reproduces the strong non-linear nature of the solutions (errors $\lesssim 10\%$) and can give speed-ups up to a factor of $\sim 200$ with respect to traditional ODE solvers. Further, the latter have completion times that vary by about $\sim 30\%$ for different initial $n$ and $T$, while the PINN method gives negligible variations. Both the speed-up and the potential improvement in load balancing imply that PINN-powered simulations are a very palatable way to solve complex chemical calculation in astrophysical and cosmological problems.
    Fake It Till You Make It: Towards Accurate Near-Distribution Novelty Detection. (arXiv:2205.14297v2 [cs.CV] UPDATED)
    We aim for image-based novelty detection. Despite considerable progress, existing models either fail or face a dramatic drop under the so-called "near-distribution" setting, where the differences between normal and anomalous samples are subtle. We first demonstrate existing methods experience up to 20% decrease in performance in the near-distribution setting. Next, we propose to exploit a score-based generative model to produce synthetic near-distribution anomalous data. Our model is then fine-tuned to distinguish such data from the normal samples. We provide a quantitative as well as qualitative evaluation of this strategy, and compare the results with a variety of GAN-based models. Effectiveness of our method for both the near-distribution and standard novelty detection is assessed through extensive experiments on datasets in diverse applications such as medical images, object classification, and quality control. This reveals that our method considerably improves over existing models, and consistently decreases the gap between the near-distribution and standard novelty detection performance. The code repository is available at https://github.com/rohban-lab/FITYMI.
    Diverse Gaussian Noise Consistency Regularization for Robustness and Uncertainty Calibration. (arXiv:2104.01231v5 [cs.LG] UPDATED)
    Deep neural networks achieve high prediction accuracy when the train and test distributions coincide. In practice though, various types of corruptions occur which deviate from this setup and cause severe performance degradations. Few methods have been proposed to address generalization in the presence of unforeseen domain shifts. In particular, digital noise corruptions arise commonly in practice during the image acquisition stage and present a significant challenge for current robustness approaches. In this paper, we propose a diverse Gaussian noise consistency regularization method for improving robustness of image classifiers under a variety of noise corruptions while still maintaining high clean accuracy. We derive bounds to motivate and understand the behavior of our Gaussian noise consistency regularization using a local loss landscape analysis. We show that this simple approach improves robustness against various unforeseen noise corruptions by 4.2-18.4% over adversarial training and other strong diverse data augmentation baselines across several benchmarks. Furthermore, when combined with state-of-the-art diverse data augmentation techniques, experiments against state-of-the-art show our method further improves robustness accuracy by 3.7% and uncertainty calibration by 5.5% for all common corruptions on several image classification benchmarks.
    PINNet: a deep neural network with pathway prior knowledge for Alzheimer's disease. (arXiv:2211.15669v1 [q-bio.QM])
    Identification of Alzheimer's Disease (AD)-related transcriptomic signatures from blood is important for early diagnosis of the disease. Deep learning techniques are potent classifiers for AD diagnosis, but most have been unable to identify biomarkers because of their lack of interpretability. To address these challenges, we propose a pathway information-based neural network (PINNet) to predict AD patients and analyze blood and brain transcriptomic signatures using an interpretable deep learning model. PINNet is a deep neural network (DNN) model with pathway prior knowledge from either the Gene Ontology or Kyoto Encyclopedia of Genes and Genomes databases. Then, a backpropagation-based model interpretation method was applied to reveal essential pathways and genes for predicting AD. We compared the performance of PINNet with a DNN model without a pathway. Performances of PINNet outperformed or were similar to those of DNN without a pathway using blood and brain gene expressions, respectively. Moreover, PINNet considers more AD-related genes as essential features than DNN without a pathway in the learning process. Pathway analysis of protein-protein interaction modules of highly contributed genes showed that AD-related genes in blood were enriched with cell migration, PI3K-Akt, MAPK signaling, and apoptosis in blood. The pathways enriched in the brain module included cell migration, PI3K-Akt, MAPK signaling, apoptosis, protein ubiquitination, and t-cell activation. Collectively, with prior knowledge about pathways, PINNet reveals essential pathways related to AD.  ( 2 min )
    Confidence-Aware Graph Neural Networks for Learning Reliability Assessment Commitments. (arXiv:2211.15755v1 [cs.LG])
    Reliability Assessment Commitment (RAC) Optimization is increasingly important in grid operations due to larger shares of renewable generations in the generation mix and increased prediction errors. Independent System Operators (ISOs) also aim at using finer time granularities, longer time horizons, and possibly stochastic formulations for additional economic and reliability benefits. The goal of this paper is to address the computational challenges arising in extending the scope of RAC formulations. It presents RACLEARN that (1) uses Graph Neural Networks (GNN) to predict generator commitments and active line constraints, (2) associates a confidence value to each commitment prediction, (3) selects a subset of the high-confidence predictions, which are (4) repaired for feasibility, and (5) seeds a state-of-the-art optimization algorithm with the feasible predictions and the active constraints. Experimental results on exact RAC formulations used by the Midcontinent Independent System Operator (MISO) and an actual transmission network (8965 transmission lines, 6708 buses, 1890 generators, and 6262 load units) show that the RACLEARN framework can speed up RAC optimization by factors ranging from 2 to 4 with negligible loss in solution quality.  ( 2 min )
    Physics-guided deep learning for data scarcity. (arXiv:2211.15664v1 [cs.LG])
    Data are the core of deep learning (DL), and the quality of data significantly affects the performance of DL models. However, high-quality and well-annotated databases are hard or even impossible to acquire for use in many applications, such as structural risk estimation and medical diagnosis, which is an essential barrier that blocks the applications of DL in real life. Physics-guided deep learning (PGDL) is a novel type of DL that can integrate physics laws to train neural networks. It can be used for any systems that are controlled or governed by physics laws, such as mechanics, finance and medical applications. It has been shown that, with the additional information provided by physics laws, PGDL achieves great accuracy and generalisation when facing data scarcity. In this review, the details of PGDL are elucidated, and a structured overview of PGDL with respect to data scarcity in various applications is presented, including physics, engineering and medical applications. Moreover, the limitations and opportunities for current PGDL in terms of data scarcity are identified, and the future outlook for PGDL is discussed in depth.  ( 2 min )
    Predicting pathways for old and new metabolites through clustering. (arXiv:2211.15720v1 [q-bio.BM])
    The diverse metabolic pathways are fundamental to all living organisms, as they harvest energy, synthesize biomass components, produce molecules to interact with the microenvironment, and neutralize toxins. While discovery of new metabolites and pathways continues, the prediction of pathways for new metabolites can be challenging. It can take vast amounts of time to elucidate pathways for new metabolites; thus, according to HMDB only 60% of metabolites get assigned to pathways. Here, we present an approach to identify pathways based on metabolite structure. We extracted 201 features from SMILES annotations, and identified new metabolites from PubMed abstracts and HMDB. After applying clustering algorithms to both groups of features, we quantified correlations between metabolites, and found the clusters accurately linked 92% of known metabolites to their respective pathways. Thus, this approach could be valuable for predicting metabolic pathways for new metabolites.  ( 2 min )
    Deep Semi-supervised Learning with Double-Contrast of Features and Semantics. (arXiv:2211.15671v1 [cs.LG])
    In recent years, the field of intelligent transportation systems (ITS) has achieved remarkable success, which is mainly due to the large amount of available annotation data. However, obtaining these annotated data has to afford expensive costs in reality. Therefore, a more realistic strategy is to leverage semi-supervised learning (SSL) with a small amount of labeled data and a large amount of unlabeled data. Typically, semantic consistency regularization and the two-stage learning methods of decoupling feature extraction and classification have been proven effective. Nevertheless, representation learning only limited to semantic consistency regularization may not guarantee the separation or discriminability of representations of samples with different semantics; due to the inherent limitations of the two-stage learning methods, the extracted features may not match the specific downstream tasks. In order to deal with the above drawbacks, this paper proposes an end-to-end deep semi-supervised learning double contrast of semantic and feature, which extracts effective tasks specific discriminative features by contrasting the semantics/features of positive and negative augmented samples pairs. Moreover, we leverage information theory to explain the rationality of double contrast of semantics and features and slack mutual information to contrastive loss in a simpler way. Finally, the effectiveness of our method is verified in benchmark datasets.  ( 2 min )
    PyTorch Adapt. (arXiv:2211.15673v1 [cs.LG])
    PyTorch Adapt is a library for domain adaptation, a type of machine learning algorithm that re-purposes existing models to work in new domains. It is a fully-featured toolkit, allowing users to create a complete train/test pipeline in a few lines of code. It is also modular, so users can import just the parts they need, and not worry about being locked into a framework. One defining feature of this library is its customizability. In particular, complex training algorithms can be easily modified and combined, thanks to a system of composable, lazily-evaluated hooks. In this technical report, we explain in detail these features and the overall design of the library. Code is available at https://www.github.com/KevinMusgrave/pytorch-adapt  ( 2 min )
  • Open

    Triadic Temporal Exponential Random Graph Models (TTERGM). (arXiv:2211.16229v1 [cs.SI])
    Temporal exponential random graph models (TERGM) are powerful statistical models that can be used to infer the temporal pattern of edge formation and elimination in complex networks (e.g., social networks). TERGMs can also be used in a generative capacity to predict longitudinal time series data in these evolving graphs. However, parameter estimation within this framework fails to capture many real-world properties of social networks, including: triadic relationships, small world characteristics, and social learning theories which could be used to constrain the probabilistic estimation of dyadic covariates. Here, we propose triadic temporal exponential random graph models (TTERGM) to fill this void, which includes these hierarchical network relationships within the graph model. We represent social network learning theory as an additional probability distribution that optimizes Markov chains in the graph vector space. The new parameters are then approximated via Monte Carlo maximum likelihood estimation. We show that our TTERGM model achieves improved fidelity and more accurate predictions compared to several benchmark methods on GitHub network data.
    Fully Stochastic Trust-Region Sequential Quadratic Programming for Equality-Constrained Optimization Problems. (arXiv:2211.15943v1 [math.OC])
    We propose a trust-region stochastic sequential quadratic programming algorithm (TR-StoSQP) to solve nonlinear optimization problems with stochastic objectives and deterministic equality constraints. We consider a fully stochastic setting, where in each iteration a single sample is generated to estimate the objective gradient. The algorithm adaptively selects the trust-region radius and, compared to the existing line-search StoSQP schemes, allows us to employ indefinite Hessian matrices (i.e., Hessians without modification) in SQP subproblems. As a trust-region method for constrained optimization, our algorithm needs to address an infeasibility issue -- the linearized equality constraints and trust-region constraints might lead to infeasible SQP subproblems. In this regard, we propose an \textit{adaptive relaxation technique} to compute the trial step that consists of a normal step and a tangential step. To control the lengths of the two steps, we adaptively decompose the trust-region radius into two segments based on the proportions of the feasibility and optimality residuals to the full KKT residual. The normal step has a closed form, while the tangential step is solved from a trust-region subproblem, to which a solution ensuring the Cauchy reduction is sufficient for our study. We establish the global almost sure convergence guarantee for TR-StoSQP, and illustrate its empirical performance on both a subset of problems in the CUTEst test set and constrained logistic regression problems using data from the LIBSVM collection.
    UQ-ARMED: Uncertainty quantification of adversarially-regularized mixed effects deep learning for clustered non-iid data. (arXiv:2211.15888v1 [stat.ML])
    This work demonstrates the ability to produce readily interpretable statistical metrics for model fit, fixed effects covariance coefficients, and prediction confidence. Importantly, this work compares 4 suitable and commonly applied epistemic UQ approaches, BNN, SWAG, MC dropout, and ensemble approaches in their ability to calculate these statistical metrics for the ARMED MEDL models. In our experiment for AD prognosis, not only do the UQ methods provide these benefits, but several UQ methods maintain the high performance of the original ARMED method, some even provide a modest (but not statistically significant) performance improvement. The ensemble models, especially the ensemble method with a 90% subsampling, performed well across all metrics we tested with (1) high performance that was comparable to the non-UQ ARMED model, (2) properly deweights the confounds probes and assigns them statistically insignificant p-values, (3) attains relatively high calibration of the output prediction confidence. Based on the results, the ensemble approaches, especially with a subsampling of 90%, provided the best all-round performance for prediction and uncertainty estimation, and achieved our goals to provide statistical significance for model fit, statistical significance covariate coefficients, and confidence in prediction, while maintaining the baseline performance of MEDL using ARMED
    Tensor Kernel Recovery for Spatio-Temporal Hawkes Processes. (arXiv:2011.12151v3 [stat.ML] UPDATED)
    We estimate the general influence functions for spatio-temporal Hawkes processes using a tensor recovery approach by formulating the location dependent influence function that captures the influence of historical events as a tensor kernel. We assume a low-rank structure for the tensor kernel and cast the estimation problem as a convex optimization problem using the Fourier transformed nuclear norm (TNN). We provide theoretical performance guarantees for our approach and present an algorithm to solve the optimization problem. Moreover, we demonstrate the efficiency of our estimation with numerical simulations.
    TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second. (arXiv:2207.01848v4 [cs.LG] UPDATED)
    We present TabPFN, a trained Transformer that can do supervised classification for small tabular datasets in less than a second, needs no hyperparameter tuning and is competitive with state-of-the-art classification methods. TabPFN is fully entailed in the weights of our network, which accepts training and test samples as a set-valued input and yields predictions for the entire test set in a single forward pass. TabPFN is a Prior-Data Fitted Network (PFN) and is trained offline once, to approximate Bayesian inference on synthetic datasets drawn from our prior. This prior incorporates ideas from causal reasoning: It entails a large space of structural causal models with a preference for simple structures. On the 18 datasets in the OpenML-CC18 suite that contain up to 1 000 training data points, up to 100 purely numerical features without missing values, and up to 10 classes, we show that our method clearly outperforms boosted trees and performs on par with complex state-of-the-art AutoML systems with up to 70$\times$ speedup. This increases to a 3200$\times$ speedup when a GPU is available. We also validate these results on an additional 67 small numerical datasets from OpenML. We provide all our code, the trained TabPFN, an interactive browser demo and a Colab notebook at https://github.com/automl/TabPFN.
    Asymptotic consistency of the WSINDy algorithm in the limit of continuum data. (arXiv:2211.16000v1 [math.NA])
    In this work we study the asymptotic consistency of the weak-form sparse identification of nonlinear dynamics algorithm (WSINDy) in the identification of differential equations from noisy samples of solutions. We prove that the WSINDy estimator is unconditionally asymptotically consistent for a wide class of models which includes the Navier-Stokes equations and the Kuramoto-Sivashinsky equation. We thus provide a mathematically rigorous explanation for the observed robustness to noise of weak-form equation learning. Conversely, we also show that in general the WSINDy estimator is only conditionally asymptotically consistent, yielding discovery of spurious terms with probability one if the noise level is above some critical threshold and the nonlinearities exhibit sufficiently fast growth. We derive explicit bounds on the critical noise threshold in the case of Gaussian white noise and provide an explicit characterization of these spurious terms in the case of trigonometric and/or polynomial model nonlinearities. However, a silver lining to this negative result is that if the data is suitably denoised (a simple moving average filter is sufficient), then we recover unconditional asymptotic consistency on the class of models with locally-Lipschitz nonlinearities. Altogether, our results reveal several important aspects of weak-form equation learning which may be used to improve future algorithms. We demonstrate our results numerically using the Lorenz system, the cubic oscillator, a viscous Burgers growth model, and a Kuramoto-Sivashinsky-type higher-order PDE.
    Flow Annealed Importance Sampling Bootstrap. (arXiv:2208.01893v2 [cs.LG] UPDATED)
    Normalizing flows are tractable density models that can approximate complicated target distributions, e.g. Boltzmann distributions of physical systems. However, current methods for training flows either suffer from mode-seeking behavior, use samples from the target generated beforehand by expensive MCMC simulations, or use stochastic losses that have high variance. To avoid these problems, we augment flows with annealed importance sampling (AIS) and minimize the mass-covering $\alpha$-divergence with $\alpha=2$, which minimizes importance weight variance. Our method, Flow AIS Bootstrap (FAB), uses AIS to generate samples in regions where the flow is a poor approximation of the target, facilitating the discovery of new modes. We apply FAB to complex multimodal targets and show that we can approximate them very accurately where previous methods fail. To the best of our knowledge, we are the first to learn the Boltzmann distribution of the alanine dipeptide molecule using only the unnormalized target density, without access to samples generated via Molecular Dynamics (MD) simulations: FAB produces better results than training via maximum likelihood on MD samples while using 100 times fewer target evaluations. After reweighting samples with importance weights, we obtain unbiased histograms of dihedral angles that are almost identical to the ground truth.
    A survey on multi-player bandits. (arXiv:2211.16275v1 [stat.ML])
    Due mostly to its application to cognitive radio networks, multiplayer bandits gained a lot of interest in the last decade. A considerable progress has been made on its theoretical aspect. However, the current algorithms are far from applicable and many obstacles remain between these theoretical results and a possible implementation of multiplayer bandits algorithms in real cognitive radio networks. This survey contextualizes and organizes the rich multiplayer bandits literature. In light of the existing works, some clear directions for future research appear. We believe that a further study of these different directions might lead to theoretical algorithms adapted to real-world situations.
    AutoML Two-Sample Test. (arXiv:2206.08843v2 [cs.LG] UPDATED)
    Two-sample tests are important in statistics and machine learning, both as tools for scientific discovery as well as to detect distribution shifts. This led to the development of many sophisticated test procedures going beyond the standard supervised learning frameworks, whose usage can require specialized knowledge about two-sample testing. We use a simple test that takes the mean discrepancy of a witness function as the test statistic and prove that minimizing a squared loss leads to a witness with optimal testing power. This allows us to leverage recent advancements in AutoML. Without any user input about the problems at hand, and using the same method for all our experiments, our AutoML two-sample test achieves competitive performance on a diverse distribution shift benchmark as well as on challenging two-sample testing problems. We provide an implementation of the AutoML two-sample test in the Python package autotst.
    The Union of Manifolds Hypothesis. (arXiv:2207.02862v2 [stat.ML] UPDATED)
    Deep learning has had tremendous success at learning low-dimensional representations of high-dimensional data. This success would be impossible if there was no hidden low-dimensional structure in data of interest; this existence is posited by the manifold hypothesis, which states that the data lies on an unknown manifold of low intrinsic dimension. In this paper, we argue that this hypothesis does not properly capture the low-dimensional structure typically present in image data. Assuming that data lies on a single manifold implies intrinsic dimension is identical across the entire data space, and does not allow for subregions of this space to have a different number of factors of variation. To address this deficiency, we put forth the union of manifolds hypothesis, which states that data lies on a disjoint union of manifolds of varying intrinsic dimensions. We empirically verify this hypothesis on commonly-used image datasets, finding that indeed, observed data lies on a disconnected set and that intrinsic dimension is not constant. We also provide insights into the implications the union of manifolds hypothesis has for deep learning, both supervised and unsupervised, showing that designing models with an inductive bias for this structure improves performance across classification and generative modelling tasks.
    Rectified Pessimistic-Optimistic Learning for Stochastic Continuum-armed Bandit with Constraints. (arXiv:2211.14720v2 [cs.LG] CROSS LISTED)
    This paper studies the problem of stochastic continuum-armed bandit with constraints (SCBwC), where we optimize a black-box reward function $f(x)$ subject to a black-box constraint function $g(x)\leq 0$ over a continuous space $\mathcal X$. We model reward and constraint functions via Gaussian processes (GPs) and propose a Rectified Pessimistic-Optimistic Learning framework (RPOL), a penalty-based method incorporating optimistic and pessimistic GP bandit learning for reward and constraint functions, respectively. We consider the metric of cumulative constraint violation $\sum_{t=1}^T(g(x_t))^{+},$ which is strictly stronger than the traditional long-term constraint violation $\sum_{t=1}^Tg(x_t).$ The rectified design for the penalty update and the pessimistic learning for the constraint function in RPOL guarantee the cumulative constraint violation is minimal. RPOL can achieve sublinear regret and cumulative constraint violation for SCBwC and its variants (e.g., under delayed feedback and non-stationary environment). These theoretical results match their unconstrained counterparts. Our experiments justify RPOL outperforms several existing baseline algorithms.
    Accelerated Nonnegative Tensor Completion via Integer Programming. (arXiv:2211.15770v1 [cs.LG])
    The problem of tensor completion has applications in healthcare, computer vision, and other domains. However, past approaches to tensor completion have faced a tension in that they either have polynomial-time computation but require exponentially more samples than the information-theoretic rate, or they use fewer samples but require solving NP-hard problems for which there are no known practical algorithms. A recent approach, based on integer programming, resolves this tension for nonnegative tensor completion. It achieves the information-theoretic sample complexity rate and deploys the Blended Conditional Gradients algorithm, which requires a linear (in numerical tolerance) number of oracle steps to converge to the global optimum. The tradeoff in this approach is that, in the worst case, the oracle step requires solving an integer linear program. Despite this theoretical limitation, numerical experiments show that this algorithm can, on certain instances, scale up to 100 million entries while running on a personal computer. The goal of this paper is to further enhance this algorithm, with the intention to expand both the breadth and scale of instances that can be solved. We explore several variants that can maintain the same theoretical guarantees as the algorithm, but offer potentially faster computation. We consider different data structures, acceleration of gradient descent steps, and the use of the Blended Pairwise Conditional Gradients algorithm. We describe the original approach and these variants, and conduct numerical experiments in order to explore various tradeoffs in these algorithmic design choices.
    Incorporating Sum Constraints into Multitask Gaussian Processes. (arXiv:2202.01793v2 [stat.ML] UPDATED)
    Machine learning models can be improved by adapting them to respect existing background knowledge. In this paper we consider multitask Gaussian processes, with background knowledge in the form of constraints that require a specific sum of the outputs to be constant. This is achieved by conditioning the prior distribution on the constraint fulfillment. The approach allows for both linear and nonlinear constraints. We demonstrate that the constraints are fulfilled with high precision and that the construction can improve the overall prediction accuracy as compared to the standard Gaussian process.
    Sparse random hypergraphs: Non-backtracking spectra and community detection. (arXiv:2203.07346v3 [math.PR] UPDATED)
    We consider the community detection problem in a sparse $q$-uniform hypergraph $G$, assuming that $G$ is generated according to the Hypergraph Stochastic Block Model (HSBM). We prove that a spectral method based on the non-backtracking operator for hypergraphs works with high probability down to the generalized Kesten-Stigum detection threshold conjectured by Angelini et al. (2015). We characterize the spectrum of the non-backtracking operator for the sparse HSBM and provide an efficient dimension reduction procedure using the Ihara-Bass formula for hypergraphs. As a result, community detection for the sparse HSBM on $n$ vertices can be reduced to an eigenvector problem of a $2n\times 2n$ non-normal matrix constructed from the adjacency matrix and the degree matrix of the hypergraph. To the best of our knowledge, this is the first provable and efficient spectral algorithm that achieves the conjectured threshold for HSBMs with $r$ blocks generated according to a general symmetric probability tensor.
    Minimax AUC Fairness: Efficient Algorithm with Provable Convergence. (arXiv:2208.10451v2 [cs.LG] UPDATED)
    The use of machine learning models in consequential decision making often exacerbates societal inequity, in particular yielding disparate impact on members of marginalized groups defined by race and gender. The area under the ROC curve (AUC) is widely used to evaluate the performance of a scoring function in machine learning, but is studied in algorithmic fairness less than other performance metrics. Due to the pairwise nature of the AUC, defining an AUC-based group fairness metric is pairwise-dependent and may involve both \emph{intra-group} and \emph{inter-group} AUCs. Importantly, considering only one category of AUCs is not sufficient to mitigate unfairness in AUC optimization. In this paper, we propose a minimax learning and bias mitigation framework that incorporates both intra-group and inter-group AUCs while maintaining utility. Based on this Rawlsian framework, we design an efficient stochastic optimization algorithm and prove its convergence to the minimum group-level AUC. We conduct numerical experiments on both synthetic and real-world datasets to validate the effectiveness of the minimax framework and the proposed optimization algorithm.
    Double Robust Bayesian Inference on Average Treatment Effects. (arXiv:2211.16298v1 [econ.EM])
    We study a double robust Bayesian inference procedure on the average treatment effect (ATE) under unconfoundedness. Our Bayesian approach involves a correction term for prior distributions adjusted by the propensity score. We prove asymptotic equivalence of our Bayesian estimator and efficient frequentist estimators by establishing a new semiparametric Bernstein-von Mises theorem under double robustness; i.e., the lack of smoothness of conditional mean functions can be compensated by high regularity of the propensity score and vice versa. Consequently, the resulting Bayesian point estimator internalizes the bias correction as the frequentist-type doubly robust estimator, and the Bayesian credible sets form confidence intervals with asymptotically exact coverage probability. In simulations, we find that this corrected Bayesian procedure leads to significant bias reduction of point estimation and accurate coverage of confidence intervals, especially when the dimensionality of covariates is large relative to the sample size and the underlying functions become complex. We illustrate our method in an application to the National Supported Work Demonstration.
    Parametric machines: a fresh approach to architecture search. (arXiv:2007.02777v3 [cs.LG] UPDATED)
    Using tools from topology and functional analysis, we provide a framework where artificial neural networks, and their architectures, can be formally described. We define the notion of machine in a general topological context and show how simple machines can be combined into more complex ones. We explore finite- and infinite-depth machines, which generalize neural networks and neural ordinary differential equations. Borrowing ideas from functional analysis and kernel methods, we build complete, normed, infinite-dimensional spaces of machines, and we discuss how to find optimal architectures and parameters -- within those spaces -- to solve a given computational problem. In our numerical experiments, these kernel-inspired networks can outperform classical neural networks when the training dataset is small.
    On Learning Fairness and Accuracy on Multiple Subgroups. (arXiv:2210.10837v2 [stat.ML] UPDATED)
    We propose an analysis in fair learning that preserves the utility of the data while reducing prediction disparities under the criteria of group sufficiency. We focus on the scenario where the data contains multiple or even many subgroups, each with limited number of samples. As a result, we present a principled method for learning a fair predictor for all subgroups via formulating it as a bilevel objective. Specifically, the subgroup specific predictors are learned in the lower-level through a small amount of data and the fair predictor. In the upper-level, the fair predictor is updated to be close to all subgroup specific predictors. We further prove that such a bilevel objective can effectively control the group sufficiency and generalization error. We evaluate the proposed framework on real-world datasets. Empirical evidence suggests the consistently improved fair predictions, as well as the comparable accuracy to the baselines.
    A Revenue Function for Comparison-Based Hierarchical Clustering. (arXiv:2211.16459v1 [cs.LG])
    Comparison-based learning addresses the problem of learning when, instead of explicit features or pairwise similarities, one only has access to comparisons of the form: \emph{Object $A$ is more similar to $B$ than to $C$.} Recently, it has been shown that, in Hierarchical Clustering, single and complete linkage can be directly implemented using only such comparisons while several algorithms have been proposed to emulate the behaviour of average linkage. Hence, finding hierarchies (or dendrograms) using only comparisons is a well understood problem. However, evaluating their meaningfulness when no ground-truth nor explicit similarities are available remains an open question. In this paper, we bridge this gap by proposing a new revenue function that allows one to measure the goodness of dendrograms using only comparisons. We show that this function is closely related to Dasgupta's cost for hierarchical clustering that uses pairwise similarities. On the theoretical side, we use the proposed revenue function to resolve the open problem of whether one can approximately recover a latent hierarchy using few triplet comparisons. On the practical side, we present principled algorithms for comparison-based hierarchical clustering based on the maximisation of the revenue and we empirically compare them with existing methods.
    Outlier-Robust Sparse Mean Estimation for Heavy-Tailed Distributions. (arXiv:2211.16333v1 [cs.DS])
    We study the fundamental task of outlier-robust mean estimation for heavy-tailed distributions in the presence of sparsity. Specifically, given a small number of corrupted samples from a high-dimensional heavy-tailed distribution whose mean $\mu$ is guaranteed to be sparse, the goal is to efficiently compute a hypothesis that accurately approximates $\mu$ with high probability. Prior work had obtained efficient algorithms for robust sparse mean estimation of light-tailed distributions. In this work, we give the first sample-efficient and polynomial-time robust sparse mean estimator for heavy-tailed distributions under mild moment assumptions. Our algorithm achieves the optimal asymptotic error using a number of samples scaling logarithmically with the ambient dimension. Importantly, the sample complexity of our method is optimal as a function of the failure probability $\tau$, having an additive $\log(1/\tau)$ dependence. Our algorithm leverages the stability-based approach from the algorithmic robust statistics literature, with crucial (and necessary) adaptations required in our setting. Our analysis may be of independent interest, involving the delicate design of a (non-spectral) decomposition for positive semi-definite matrices satisfying certain sparsity properties.
    Revisiting Over-smoothing and Over-squashing using Ollivier's Ricci Curvature. (arXiv:2211.15779v1 [cs.LG])
    Graph Neural Networks (GNNs) had been demonstrated to be inherently susceptible to the problems of over-smoothing and over-squashing. These issues prohibit the ability of GNNs to model complex graph interactions by limiting their effectiveness at taking into account distant information. Our study reveals the key connection between the local graph geometry and the occurrence of both of these issues, thereby providing a unified framework for studying them at a local scale using the Ollivier's Ricci curvature. Based on our theory, a number of principled methods are proposed to alleviate the over-smoothing and over-squashing issues.  ( 2 min )
    PAC-Bayes Bounds for Bandit Problems: A Survey and Experimental Comparison. (arXiv:2211.16110v1 [cs.LG])
    PAC-Bayes has recently re-emerged as an effective theory with which one can derive principled learning algorithms with tight performance guarantees. However, applications of PAC-Bayes to bandit problems are relatively rare, which is a great misfortune. Many decision-making problems in healthcare, finance and natural sciences can be modelled as bandit problems. In many of these applications, principled algorithms with strong performance guarantees would be very much appreciated. This survey provides an overview of PAC-Bayes performance bounds for bandit problems and an experimental comparison of these bounds. Our experimental comparison has revealed that available PAC-Bayes upper bounds on the cumulative regret are loose, whereas available PAC-Bayes lower bounds on the expected reward can be surprisingly tight. We found that an offline contextual bandit algorithm that learns a policy by optimising a PAC-Bayes bound was able to learn randomised neural network polices with competitive expected reward and non-vacuous performance guarantees.  ( 2 min )
    DIGRAC: Digraph Clustering Based on Flow Imbalance. (arXiv:2106.05194v8 [stat.ML] UPDATED)
    Node clustering is a powerful tool in the analysis of networks. We introduce a graph neural network framework, named DIGRAC, to obtain node embeddings for directed networks in a self-supervised manner, including a novel probabilistic imbalance loss, which can be used for network clustering. Here, we propose \textit{directed flow imbalance} measures, which are tightly related to directionality, to reveal clusters in the network even when there is no density difference between clusters. In contrast to standard approaches in the literature, in this paper, directionality is not treated as a nuisance, but rather contains the main signal. DIGRAC optimizes directed flow imbalance for clustering without requiring label supervision, unlike existing graph neural network methods, and can naturally incorporate node features, unlike existing spectral methods. Extensive experimental results on synthetic data, in the form of directed stochastic block models, and real-world data at different scales, demonstrate that our method, based on flow imbalance, attains state-of-the-art results on directed graph clustering when compared against 10 state-of-the-art methods from the literature, for a wide range of noise and sparsity levels, graph structures, and topologies, and even outperforms supervised methods.  ( 2 min )
    Sketch-and-solve approaches to k-means clustering by semidefinite programming. (arXiv:2211.15744v1 [cs.LG])
    We introduce a sketch-and-solve approach to speed up the Peng-Wei semidefinite relaxation of k-means clustering. When the data is appropriately separated we identify the k-means optimal clustering. Otherwise, our approach provides a high-confidence lower bound on the optimal k-means value. This lower bound is data-driven; it does not make any assumption on the data nor how it is generated. We provide code and an extensive set of numerical experiments where we use this approach to certify approximate optimality of clustering solutions obtained by k-means++.  ( 2 min )
    Understanding the Impact of Adversarial Robustness on Accuracy Disparity. (arXiv:2211.15762v1 [cs.LG])
    While it has long been empirically observed that adversarial robustness may be at odds with standard accuracy and may have further disparate impacts on different classes, it remains an open question to what extent such observations hold and how the class imbalance plays a role within. In this paper, we attempt to understand this question of accuracy disparity by taking a closer look at linear classifiers under a Gaussian mixture model. We decompose the impact of adversarial robustness into two parts: an inherent effect that will degrade the standard accuracy on all classes, and the other caused by the class imbalance ratio, which will increase the accuracy disparity compared to standard training. Furthermore, we also extend our model to the general family of stable distributions. We demonstrate that while the constraint of adversarial robustness consistently degrades the standard accuracy in the balanced class setting, the class imbalance ratio plays a fundamentally different role in accuracy disparity compared to the Gaussian case, due to the heavy tail of the stable distribution. We additionally perform experiments on both synthetic and real-world datasets. The empirical results not only corroborate our theoretical findings, but also suggest that the implications may extend to nonlinear models over real-world datasets.  ( 2 min )
    Kernel Autocovariance Operators of Stationary Processes: Estimation and Convergence. (arXiv:2004.00891v2 [math.PR] UPDATED)
    We consider autocovariance operators of a stationary stochastic process on a Polish space that is embedded into a reproducing kernel Hilbert space. We investigate how empirical estimates of these operators converge along realizations of the process under various conditions. In particular, we examine ergodic and strongly mixing processes and obtain several asymptotic results as well as finite sample error bounds. We provide applications of our theory in terms of consistency results for kernel PCA with dependent data and the conditional mean embedding of transition probabilities. Finally, we use our approach to examine the nonparametric estimation of Markov transition operators and highlight how our theory can give a consistency analysis for a large family of spectral analysis methods including kernel-based dynamic mode decomposition.  ( 2 min )
    Proximal boosting: aggregating weak learners to minimize non-differentiable losses. (arXiv:1808.09670v4 [cs.LG] UPDATED)
    Gradient boosting is a prediction method that iteratively combines weak learners to produce a complex and accurate model. From an optimization point of view, the learning procedure of gradient boosting mimics a gradient descent on a functional variable. This paper proposes to build upon the proximal point algorithm, when the empirical risk to minimize is not differentiable, in order to introduce a novel boosting approach, called proximal boosting. It comes with a companion algorithm inspired by [1] and called residual proximal boosting, which is aimed at better controlling the approximation error. Theoretical convergence is proved for these two procedures under different hypotheses on the empirical risk and advantages of leveraging proximal methods for boosting are illustrated by numerical experiments on simulated and real-world data. In particular, we exhibit a favorable comparison over gradient boosting regarding convergence rate and prediction accuracy.  ( 2 min )
    Linear Causal Disentanglement via Interventions. (arXiv:2211.16467v1 [stat.ML])
    Causal disentanglement seeks a representation of data involving latent variables that relate to one another via a causal model. A representation is identifiable if both the latent model and the transformation from latent to observed variables are unique. In this paper, we study observed variables that are a linear transformation of a linear latent causal model. Data from interventions are necessary for identifiability: if one latent variable is missing an intervention, we show that there exist distinct models that cannot be distinguished. Conversely, we show that a single intervention on each latent variable is sufficient for identifiability. Our proof uses a generalization of the RQ decomposition of a matrix that replaces the usual orthogonal and upper triangular conditions with analogues depending on a partial order on the rows of the matrix, with partial order determined by a latent causal model. We corroborate our theoretical results with a method for causal disentanglement that accurately recovers a latent causal model.  ( 2 min )
    Bayesian Semiparametric Model for Sequential Treatment Decisions with Informative Timing. (arXiv:2211.16393v1 [stat.ME])
    We develop a Bayesian semi-parametric model for the estimating the impact of dynamic treatment rules on survival among patients diagnosed with pediatric acute myeloid leukemia (AML). The data consist of a subset of patients enrolled in the phase III AAML1031 clinical trial in which patients move through a sequence of four treatment courses. At each course, they undergo treatment that may or may not include anthracyclines (ACT). While ACT is known to be effective at treating AML, it is also cardiotoxic and can lead to early death for some patients. Our task is to estimate the potential survival probability under hypothetical dynamic ACT treatment strategies, but there are several impediments. First, since ACT was not randomized in the trial, its effect on survival is confounded over time. Second, subjects initiate the next course depending on when they recover from the previous course, making timing potentially informative of subsequent treatment and survival. Third, patients may die or drop out before ever completing the full treatment sequence. We develop a generative Bayesian semi-parametric model based on Gamma Process priors to address these complexities. At each treatment course, the model captures subjects' transition to subsequent treatment or death in continuous time under a given rule. A g-computation procedure is used to compute a posterior over potential survival probability that is adjusted for time-varying confounding. Using this approach, we conduct posterior inference for the efficacy of hypothetical treatment rules that dynamically modify ACT based on evolving cardiac function.  ( 2 min )
    Characterizing the robustness of Bayesian adaptive experimental designs to active learning bias. (arXiv:2205.13698v2 [stat.ME] UPDATED)
    Bayesian adaptive experimental design is a form of active learning, which chooses samples to maximize the information they give about uncertain parameters. Prior work has shown that other forms of active learning can suffer from active learning bias, where unrepresentative sampling leads to inconsistent parameter estimates. We show that active learning bias can also afflict Bayesian adaptive experimental design, depending on model misspecification. We analyze the case of estimating a linear model, and show that worse misspecification implies more severe active learning bias. At the same time, model classes incorporating more "noise" - i.e., specifying higher inherent variance in observations - suffer less from active learning bias. Finally, we demonstrate empirically that insights from the linear model can predict the presence and degree of active learning bias in nonlinear contexts, namely in a (simulated) preference learning experiment.  ( 2 min )
    Estimating the minimizer and the minimum value of a regression function under passive design. (arXiv:2211.16457v1 [math.ST])
    We propose a new method for estimating the minimizer $\boldsymbol{x}^*$ and the minimum value $f^*$ of a smooth and strongly convex regression function $f$ from the observations contaminated by random noise. Our estimator $\boldsymbol{z}_n$ of the minimizer $\boldsymbol{x}^*$ is based on a version of the projected gradient descent with the gradient estimated by a regularized local polynomial algorithm. Next, we propose a two-stage procedure for estimation of the minimum value $f^*$ of regression function $f$. At the first stage, we construct an accurate enough estimator of $\boldsymbol{x}^*$, which can be, for example, $\boldsymbol{z}_n$. At the second stage, we estimate the function value at the point obtained in the first stage using a rate optimal nonparametric procedure. We derive non-asymptotic upper bounds for the quadratic risk and optimization error of $\boldsymbol{z}_n$, and for the risk of estimating $f^*$. We establish minimax lower bounds showing that, under certain choice of parameters, the proposed algorithms achieve the minimax optimal rates of convergence on the class of smooth and strongly convex functions.  ( 2 min )
    Diagnosing and Fixing Manifold Overfitting in Deep Generative Models. (arXiv:2204.07172v4 [stat.ML] UPDATED)
    Likelihood-based, or explicit, deep generative models use neural networks to construct flexible high-dimensional densities. This formulation directly contradicts the manifold hypothesis, which states that observed data lies on a low-dimensional manifold embedded in high-dimensional ambient space. In this paper we investigate the pathologies of maximum-likelihood training in the presence of this dimensionality mismatch. We formally prove that degenerate optima are achieved wherein the manifold itself is learned but not the distribution on it, a phenomenon we call manifold overfitting. We propose a class of two-step procedures consisting of a dimensionality reduction step followed by maximum-likelihood density estimation, and prove that they recover the data-generating distribution in the nonparametric regime, thus avoiding manifold overfitting. We also show that these procedures enable density estimation on the manifolds learned by implicit models, such as generative adversarial networks, hence addressing a major shortcoming of these models. Several recently proposed methods are instances of our two-step procedures; we thus unify, extend, and theoretically justify a large class of models.  ( 2 min )
    FakeEdge: Alleviate Dataset Shift in Link Prediction. (arXiv:2211.15899v1 [cs.LG])
    Link prediction is a crucial problem in graph-structured data. Due to the recent success of graph neural networks (GNNs), a variety of GNN-based models were proposed to tackle the link prediction task. Specifically, GNNs leverage the message passing paradigm to obtain node representation, which relies on link connectivity. However, in a link prediction task, links in the training set are always present while ones in the testing set are not yet formed, resulting in a discrepancy of the connectivity pattern and bias of the learned representation. It leads to a problem of dataset shift which degrades the model performance. In this paper, we first identify the dataset shift problem in the link prediction task and provide theoretical analyses on how existing link prediction methods are vulnerable to it. We then propose FakeEdge, a model-agnostic technique, to address the problem by mitigating the graph topological gap between training and testing sets. Extensive experiments demonstrate the applicability and superiority of FakeEdge on multiple datasets across various domains.  ( 2 min )
    Posterior Sampling for Continuing Environments. (arXiv:2211.15931v1 [cs.LG])
    We develop an extension of posterior sampling for reinforcement learning (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into agent designs that scale to complex environments. The approach maintains a statistically plausible model of the environment and follows a policy that maximizes expected $\gamma$-discounted return in that model. At each time, with probability $1-\gamma$, the model is replaced by a sample from the posterior distribution over environments. For a suitable schedule of $\gamma$, we establish an $\tilde{O}(\tau S \sqrt{A T})$ bound on the Bayesian regret, where $S$ is the number of environment states, $A$ is the number of actions, and $\tau$ denotes the reward averaging time, which is a bound on the duration required to accurately estimate the average reward of any policy.  ( 2 min )
    Bayesian Simultaneous Factorization and Prediction Using Multi-Omic Data. (arXiv:2211.16403v1 [stat.ME])
    Understanding of the pathophysiology of obstructive lung disease (OLD) is limited by available methods to examine the relationship between multi-omic molecular phenomena and clinical outcomes. Integrative factorization methods for multi-omic data can reveal latent patterns of variation describing important biological signal. However, most methods do not provide a framework for inference on the estimated factorization, simultaneously predict important disease phenotypes or clinical outcomes, nor accommodate multiple imputation. To address these gaps, we propose Bayesian Simultaneous Factorization (BSF). We use conjugate normal priors and show that the posterior mode of this model can be estimated by solving a structured nuclear norm-penalized objective that also achieves rank selection and motivates the choice of hyperparameters. We then extend BSF to simultaneously predict a continuous or binary response, termed Bayesian Simultaneous Factorization and Prediction (BSFP). BSF and BSFP accommodate concurrent imputation and full posterior inference for missing data, including "blockwise" missingness, and BSFP offers prediction of unobserved outcomes. We show via simulation that BSFP is competitive in recovering latent variation structure, as well as the importance of propagating uncertainty from the estimated factorization to prediction. We also study the imputation performance of BSF via simulation under missing-at-random and missing-not-at-random assumptions. Lastly, we use BSFP to predict lung function based on the bronchoalveolar lavage metabolome and proteome from a study of HIV-associated OLD. Our analysis reveals a distinct cluster of patients with OLD driven by shared metabolomic and proteomic expression patterns, as well as multi-omic patterns related to lung function decline. Software is freely available at https://github.com/sarahsamorodnitsky/BSFP .  ( 2 min )
    Will My Robot Achieve My Goals? Predicting the Probability that an MDP Policy Reaches a User-Specified Behavior Target. (arXiv:2211.16462v1 [cs.LG])
    As an autonomous system performs a task, it should maintain a calibrated estimate of the probability that it will achieve the user's goal. If that probability falls below some desired level, it should alert the user so that appropriate interventions can be made. This paper considers settings where the user's goal is specified as a target interval for a real-valued performance summary, such as the cumulative reward, measured at a fixed horizon $H$. At each time $t \in \{0, \ldots, H-1\}$, our method produces a calibrated estimate of the probability that the final cumulative reward will fall within a user-specified target interval $[y^-,y^+].$ Using this estimate, the autonomous system can raise an alarm if the probability drops below a specified threshold. We compute the probability estimates by inverting conformal prediction. Our starting point is the Conformalized Quantile Regression (CQR) method of Romano et al., which applies split-conformal prediction to the results of quantile regression. CQR is not invertible, but by using the conditional cumulative distribution function (CDF) as the non-conformity measure, we show how to obtain an invertible modification that we call \textbf{P}robability-space \textbf{C}onformalized \textbf{Q}uantile \textbf{R}egression (PCQR). Like CQR, PCQR produces well-calibrated conditional prediction intervals with finite-sample marginal guarantees. By inverting PCQR, we obtain marginal guarantees for the probability that the cumulative reward of an autonomous system will fall within an arbitrary user-specified target intervals. Experiments on two domains confirm that these probabilities are well-calibrated.  ( 2 min )
    Malign Overfitting: Interpolation Can Provably Preclude Invariance. (arXiv:2211.15724v1 [cs.LG])
    Learned classifiers should often possess certain invariance properties meant to encourage fairness, robustness, or out-of-distribution generalization. However, multiple recent works empirically demonstrate that common invariance-inducing regularizers are ineffective in the over-parameterized regime, in which classifiers perfectly fit (i.e. interpolate) the training data. This suggests that the phenomenon of ``benign overfitting," in which models generalize well despite interpolating, might not favorably extend to settings in which robustness or fairness are desirable. In this work we provide a theoretical justification for these observations. We prove that -- even in the simplest of settings -- any interpolating learning rule (with arbitrarily small margin) will not satisfy these invariance properties. We then propose and analyze an algorithm that -- in the same setting -- successfully learns a non-interpolating classifier that is provably invariant. We validate our theoretical observations on simulated data and the Waterbirds dataset.  ( 2 min )
    Linear Complexity Gibbs Sampling for Generalized Labeled Multi-Bernoulli Filtering. (arXiv:2211.16041v1 [stat.ML])
    Generalized Labeled Multi-Bernoulli (GLMB) densities arise in a host of multi-object system applications analogous to Gaussians in single-object filtering. However, computing the GLMB filtering density requires solving NP-hard problems. To alleviate this computational bottleneck, we develop a linear complexity Gibbs sampling framework for GLMB density computation. Specifically, we propose a tempered Gibbs sampler that exploits the structure of the GLMB filtering density to achieve an $\mathcal{O}(T(P+M))$ complexity, where $T$ is the number of iterations of the algorithm, $P$ and $M$ are the number hypothesized objects and measurements. This innovation enables an $\mathcal{O}(T(P+M+\log(T))+PM)$ complexity implementation of the GLMB filter. Convergence of the proposed Gibbs sampler is established and numerical studies are presented to validate the proposed GLMB filter implementation.  ( 2 min )
    On the Ability of Graph Neural Networks to Model Interactions Between Vertices. (arXiv:2211.16494v1 [cs.LG])
    Graph neural networks (GNNs) are widely used for modeling complex interactions between entities represented as vertices of a graph. Despite recent efforts to theoretically analyze the expressive power of GNNs, a formal characterization of their ability to model interactions is lacking. The current paper aims to address this gap. Formalizing strength of interactions through an established measure known as separation rank, we quantify the ability of certain GNNs to model interaction between a given subset of vertices and its complement, i.e. between sides of a given partition of input vertices. Our results reveal that the ability to model interaction is primarily determined by the partition's walk index -- a graph-theoretical characteristic that we define by the number of walks originating from the boundary of the partition. Experiments with common GNN architectures corroborate this finding. As a practical application of our theory, we design an edge sparsification algorithm named Walk Index Sparsification (WIS), which preserves the ability of a GNN to model interactions when input edges are removed. WIS is simple, computationally efficient, and markedly outperforms alternative methods in terms of induced prediction accuracy. More broadly, it showcases the potential of improving GNNs by theoretically analyzing the interactions they can model.  ( 2 min )
    Optimal variance-reduced stochastic approximation in Banach spaces. (arXiv:2201.08518v2 [math.ST] UPDATED)
    We study the problem of estimating the fixed point of a contractive operator defined on a separable Banach space. Focusing on a stochastic query model that provides noisy evaluations of the operator, we analyze a variance-reduced stochastic approximation scheme, and establish non-asymptotic bounds for both the operator defect and the estimation error, measured in an arbitrary semi-norm. In contrast to worst-case guarantees, our bounds are instance-dependent, and achieve the local asymptotic minimax risk non-asymptotically. For linear operators, contractivity can be relaxed to multi-step contractivity, so that the theory can be applied to problems like average reward policy evaluation problem in reinforcement learning. We illustrate the theory via applications to stochastic shortest path problems, two-player zero-sum Markov games, as well as policy evaluation and $Q$-learning for tabular Markov decision processes.  ( 2 min )

  • Open

    Napoleon’s theorem
    The following theorem is attributed to Napoleon Bonaparte (1769–1821). There’s some debate over whether Napoleon was the first to discover the theorem, but I don’t believe there’s any doubt that the theorem, like Morley’s theorem from the previous post, was discovered a long time after Euclid. Start with any triangle and draw equilateral triangles on […] Napoleon’s theorem first appeared on John D. Cook.  ( 4 min )
    Trilinear coordinates
    The first time I saw a reference to trilinear coordinates I thought this must be another name for barycentric coordinates. It’s not. Barycentric coordinates come up often in applications, such as when working with finite element meshes. Trilinear coordinates are less common, at least in my experience, and yet trilinear coordinates simplify a lot of classical […] Trilinear coordinates first appeared on John D. Cook.  ( 5 min )
    Trilinear coordindates
    The first time I saw a reference to trilinear coordinates I thought this must be another name for barycentric coordinates. It’s not. Barycentric coordinates come up often in applications, such as when working with finite element meshes. Trilinear coordinates are less common, at least in my experience, and yet trilinear coordinates simplify a lot of classical […] Trilinear coordindates first appeared on John D. Cook.  ( 5 min )
    Unexpected symmetry
    Take an arbitrary triangle and draw the trisectors of each angle. Morley’s theorem says that the trisector lines will intersect at the vertices of an equilateral triangle. This theorem is surprising because out of a triangle with no symmetry pops a triangle with three-fold symmetry. The theorem is also historically surprising. It’s a theorem of […] Unexpected symmetry first appeared on John D. Cook.  ( 4 min )
    Elliptic functions of a complex argument in Python
    I used Mathematica to create the graphics for my previous two posts because SciPy didn’t have the functions I needed. In particular, elliptic integrals and elliptic functions in SciPy only take real-valued arguments, but I needed to use complex arguments. Also, I needed theta functions, which are not in SciPy at all. I thought mpmath […] Elliptic functions of a complex argument in Python first appeared on John D. Cook.  ( 4 min )
  • Open

    I found an app that leverages AI to 'mash' 2 faces together pretty seamlessly. Quite impressive and actually a lot of fun to share around with friends.
    ​ https://reddit.com/link/z9652a/video/matcv7lc463a1/player The AI component is interesting as the resulting face (the combination of two people) is not a real person but looks scary accurate. Added benefit - the app lets me create an NFT with the end result! Haven't seen anything like this before ... Give it a try, am curious to see where this goes and to hear others' thoughts on the application of AI tech as well. I want to see how many people I can mash faces with so mash your face with mine here! https://facely.gg/?ref=1375334554&hashedReferrer=7f8e7c25d67b8fd56eb6eb9f91a271a0 ​ ​ submitted by /u/gratefullythickheade [link] [comments]  ( 48 min )
    AI Dream 122 - Lucid MAZE by M.C. Escher
    submitted by /u/LordPewPew777 [link] [comments]  ( 46 min )
    Evil Elf (https://creator.nightcafe.studio/creation/vQRiNmzrxPg192zGGKKH)
    submitted by /u/OtakuLibertarian [link] [comments]  ( 46 min )
    ChatGPT is a GPT-3 chatbot from OpenAI that you can test now
    submitted by /u/much_successes [link] [comments]  ( 48 min )
    I used AI and After Effects to comp together a character selection screen for a game that never existed, I wanted to strike a nostalgic Mortal Combat feel
    submitted by /u/PerryJ [link] [comments]  ( 48 min )
    Short excerpt from my latest, 7min long ai video using mixed techniques, made for my song Jean's Memory, about dementia. Using the instability of the frames to represented the fragmentation of a mind. Link to the full video in comments. Open to questions about the process.
    submitted by /u/defensiveFruit [link] [comments]  ( 80 min )
    Meet ‘Magic3D’: An AI-Based Text-to-3D Content Creation Tool That Creates 3D Mesh Models With Unprecedented Quality
    submitted by /u/ai-lover [link] [comments]  ( 59 min )
    Generative AI - The New Venture Capital (VC) Gold Rush
    Some investors are likening generative AI to the early days of the web, seeing it as a transformative platform shift. US-based VC firm Sequoia sees generative AI as a technology that could generate trillions of dollars of economic value. As the demand for AI-powered content generation accelerates, generative AI start-ups have been garnering significant VC attention despite a broader slowdown in the pace of VC funding. Jasper, an Austin-based start-up, recently raised $125 million in Series A funding at a $1.5 billion valuation. London-based Stability AI also raised $101 million in an oversubscribed round, with investors like Coatue and Lightspeed Venture Partners participating. In May, Hugging Face also raised $100 million in a Series C round at a valuation of $2 billion. The backing from big is another stamp of approval for generative AI start-ups. Microsoft has made significant investments in OpenAI and is anticipated to enhance its OpenAI efforts this year. Additionally, Google and Meta developed a new artificial intelligence tool to produce a video from a simple text prompt. Such interest from big tech could also very well spark a wave of M&A in the generative AI space. Even if generative AI output can’t yet match the human-generated output, businesses and creators see them as handy tools in a broader toolbox. Further, a large number of firms are using generative AI to improve efficiencies and speed in their operations, providing value for their customers. That said, businesses need to address ethical concerns, which are often associated with AI. Read on... submitted by /u/Sienna_99 [link] [comments]  ( 53 min )
    DaVinci 3 is pretty good.
    submitted by /u/LorestForest [link] [comments]  ( 48 min )
    My project Imagetocartoon is a creative cartoon converter!
    submitted by /u/koalalighting [link] [comments]  ( 46 min )
    I asked OpenAI's DaVinci to write me a poem only a machine would understand and it spat out some binary code. How to I decode it?
    This is what it gave me: 0 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 I tried using a binary to text converter but I only got some garbage values. Am I doing something wrong? Or do I actually need to be a machine to understand this? submitted by /u/LorestForest [link] [comments]  ( 47 min )
  • Open

    [D] Can area chair ask all reviewers to be in a meeting?
    I recently reviewed one ICLR paper. The opinions about the paper are somewhat diversed. The area chair then arranged a meeting with all reviewers to finalize the idea to accept or reject the paper. The area chair said it's his/her first time doing this kind of meeting as well. Due to the concerns of breaking anonymity and peer pressure, I am wondering is it allowed for the area chair to ask all reviewers to participate in a meeting to discuss the decision of a paper? submitted by /u/Least_Pollution7078 [link] [comments]  ( 62 min )
    [D] Understanding EfficientNet Depth Scaling
    When reading the efficient net paper to gain ideas on how to efficiently scale neural networks, I wonder, how do these findings extrapolate to smaller networks in regards to depth? For instance, if I have a width factor of 1.1, this means I increase my output channels by a factor of 1.1 correct? I assume we round to the nearest whole. But if I have a depth factor of 1.1, and I only have a shallow network with K layers, such that K*1.1 is not a whole number, how do I handle this? For a smaller network, I would have to assume adding a third layer would be more impactful than adding the 100th. So a simple round when the initial number of layers are small feels too approximate, and the scaling properties don't seem appropriate. So does this sort of scaling still hold for smaller networks with lower numbers for width, depth, and resolution of photo? Example: Depth Factor=1.1 Initial Network = 4 layers Scaling suggests new network have 4.4 layers. In the case that we decide not to round up to 5 layers here, we are essentially making the depth factor = 1, which would allow us to increase our width and height factors by more to double FLOPS. So it seems that for shallower networks, this sort of scaling needs to be modified. submitted by /u/Oceanboi [link] [comments]  ( 59 min )
    [D] Best animal dataset for video object detection?
    I am looking for a well-labeled dataset for animal object detection and identification. This is so I can train a model for animal detection in videos. Does this exist? submitted by /u/LearnMLWithMe [link] [comments]  ( 59 min )
    [R] GLAMI-1M: A Multilingual Image-Text Fashion Dataset
    https://arxiv.org/abs/2211.14451v1 Abstract: We introduce GLAMI-1M: the largest multilingual image-text classification dataset and benchmark. The dataset contains images of fashion products with item descriptions, each in 1 of 13 languages. Categorization into 191 classes has high-quality annotations: all 100k images in the test set and 75% of the 1M training set were human-labeled. The paper presents baselines for image-text classification showing that the dataset presents a challenging fine-grained classification problem: The best scoring EmbraceNet model using both visual and textual features achieves 69.7% accuracy. Experiments with a modified Imagen model show the dataset is also suitable for image generation conditioned on text. The dataset, source code and model checkpoints are published here: https://github.com/glami/glami-1m Image: https://github.com/glami/glami-1m/raw/main/media/glami-1m-dataset-examples.png Video: https://youtu.be/_BRAO6iIKoQ submitted by /u/vackosar [link] [comments]  ( 61 min )
    [D] CPU - which one to choose?
    Hi all! I have to choose between these two CPUs: i5-1235U and RYZEN 7-6850U. It's Intel vs AMD. Which one would you recommend for ML purposes? Mostly pyspark, pandas, numpy, sklean and maybe in the future tensorflow or PyTorch. submitted by /u/krzaki_ [link] [comments]  ( 69 min )
    [D]What are the popular research keywords at NeurIPS 2022?
    I'm interested in identifying the research trends in AI for the next year and it seems very likely that research papers published at NIPS this year might anchor the upcoming works in the next year. Thank you. submitted by /u/ureepamuree [link] [comments]  ( 63 min )
    [D] Slow ONNX GPU Performance
    I recently worked with an ONNX model exported from PyTorch and found that it ran 7x slower in ONNX when run with GPUs. I tried the obvious things like io_binding and that didn't do much to help. Ended up having to change the cudnn_conv_algo_search setting to match the setting on PyTorch. I wrote a detailed article on steps taken to come to that conclusion. Sharing here in case anyone else runs into this same problem or has seen this problem before. Article: https://medium.com/neuml/debug-onnx-gpu-performance-c9290fe07459 submitted by /u/davidmezzetti [link] [comments]  ( 58 min )
    [D] Training imagen like model
    I want to train imagen like model using LAION dataset, Can someone provide little guidance on how to prepare data, and what to do to train? submitted by /u/ANeek181 [link] [comments]  ( 61 min )
    what is better to study bachelor in computer science or in data science to become machine learning engineer? [D]
    i want to know what is the best option to get jobs and intenship easly in machine learning field submitted by /u/Fun_Helicopter_6540 [link] [comments]  ( 61 min )
    [P] Sparse Transfer 1000s of Select Hugging Face NLP Transformers
    Hi all, sharing a quick colab notebook for ML engineers to take a dense transformer NLP model from the Hugging Face Models Hub and sparse transferring it to sparse upstream model giving you a substantial reduction in latency and ultimately, hardware usage at runtime. :) This notebook is using the SparseML library (open-source) for the sparse transfer part and the Deepsparse library for benchmarking the sparse model against its dense variant. If lower latency/higher throughput is important to you in deployment, you may want to give this a try: https://colab.research.google.com/drive/1I5ez6ZpdT0K-yo7l9AXrrJ7tIFoEP8Jv?usp=sharing submitted by /u/Quantum_Stat [link] [comments]  ( 60 min )
    [D] Choose a topic from neural networks
    Hi! I'm doing a course on neural network, and we have to make a 10 min presentation about a topic of our choosing, and explain it to the rest of the class. It's a beggining course, so we're looking for something not to heavy, but also kind of relevant and fun. And also it doesnt have to be something necesarily recent. We are searching for options but you might also have something to reccomend. Thanks in advance! submitted by /u/Mikesblum [link] [comments]  ( 58 min )
    Can ANN take the boundary conditions in consideration? [D]
    I am working with hybrid model of metaheuristics and ANN. I know that metaheuristics consider the boundary conditions into consideration but I am curios whether ANN alone can take these conditions into consideration while performing predictions. For those who are not familiar with meta heuristics, Metaheuristics optimize the problem provided to them while keeping the output result within the upper and lower bounds that we provide to it. But I have not seen this thing in NN any where, I have just seen use of ANN for prediction purposes only. submitted by /u/Horseman099 [link] [comments]  ( 56 min )
    [P] Releasing opensource speech enhancement toolkit: mayavoz
    Almost every ML audio model expects clean audio as input for inference. Unfortunately, in a real-time environment, audio is always noisy. To bridge this gap, I am releasing my project #mayavoz: an open-source PyTorch-based audio enhancement toolkit. It's built to save time for audio researchers and practitioners. It provides easy-to-use pre-trained audio enhancement models and facilitates highly customizable model training. Checkout mayavoz here https://github.com/shahules786/mayavoz Give it a ⭐ if you loved it :) https://reddit.com/link/z8qs76/video/7hpqq9lj533a1/player submitted by /u/iamikka [link] [comments]  ( 69 min )
    [R] PhD Interview Machine Learning
    Dear Community, I am a sociologist doing my doctorate in sociology at the University of Potsdam. In the context of my doctoral thesis, I am investigating the personal understanding of work and the work practice of people who design machine learning algortihms. For my study I am looking for people who are professionally active at this field, whom I can interview about their daily work routine. I am particularly interested in your personal work practices, i.e. "HOW" you do it in your professional work. I am particularly interested in your approach to problem solving and negotiation processes for finding solutions. I would like to conduct an interview with you, which should take about one hour. The interview can be conducted in presence or digitally, as desired. In both cases, an audio recording will be made for empirical analysis. All personal data will be anonymized. The increasing number of users and companies using AI-based solutions makes your field particularly interesting for a sociological analysis. Therefore, I would be very pleased if you would be interested and have the time. With kind regards submitted by /u/SozUngl [link] [comments]  ( 61 min )
    [D] Does Transformer need huge pretraining process?
    Hi! I'm new to the Transformer architecture. Yesterday, I went to a conference where I heard about an application on NLP for legal documents. I'm kinda curious on why the author always started with Bi-LSTM and slowly move on to Transformer. When I asked him in QA, he replied as stated in title: because Transformer needs a huge pre-training step. Is it really true? submitted by /u/minhrongcon2000 [link] [comments]  ( 60 min )
    Does anyone uses Intel Arc A770 GPU for machine learning? [D]
    Intel Arc A770 seems to have an impressive spec for dirt cheap price for machine learning. Is anyone using this GPU for machine learning? submitted by /u/labloke11 [link] [comments]  ( 63 min )
    [D] I am at NeurIPS and would like to have a meetup for folks working on production AI systems for vision.
    NeurIPS has been awesome so far. Got to meet lots of awesome folks working on interesting research. However, I would love to meet more engineers and folks who are building AI vision products as they do face a different set of challenges and problems. Would love to exchange learnings and geek out about the space. Anyone down? submitted by /u/No_Specialist1457 [link] [comments]  ( 60 min )
    [D] Other than data what are the common problems holding back machine learning/artificial intelligence
    Also how are you solving the data availability problems in your project/or at work submitted by /u/BadKarma-18 [link] [comments]  ( 69 min )
    A new RL community in Sydney, Australia [N]
    Hi There, If you are interested in RL. I have started a meetup for RL in Sydney, Australia. Please join to create a community that we can discuss anything RL related. Cheers, and see you there :) https://www.meetup.com/reinforcement-learning/ submitted by /u/Express-Incident-113 [link] [comments]  ( 58 min )
    [D] I'm at NeurIPS, AMA
    I've been to a number of NeurIPS so far. I have a PhD and work in industry. Publish here occasionally. Not willing to discuss identity of my employer, but AMA else. Whatever is on your mind, either on ML in general, or NeurIPS specifics. submitted by /u/ThisIsMyStonerAcount [link] [comments]  ( 71 min )
    [R] General Intelligence Requires Rethinking Exploration - Minqi Jiang et al 2022 - Learning / exploring in the real world and maintaining open-ended learning processes that continually learn to discover and solve new problems are required!
    Paper: https://arxiv.org/abs/2211.07819 Abstract: We are at the cusp of a transition from "learning from data" to "learning what data to learn from" as a central focus of artificial intelligence (AI) research. While the first-order learning problem is not completely solved, large models under unified architectures, such as transformers, have shifted the learning bottleneck from how to effectively train our models to how to effectively acquire and use task-relevant data. This problem, which we frame as exploration, is a universal aspect of learning in open-ended domains, such as the real world. Although the study of exploration in AI is largely limited to the field of reinforcement learning, we argue that exploration is essential to all learning systems, including supervised learning. We propose the problem of generalized exploration to conceptually unify exploration-driven learning between supervised learning and reinforcement learning, allowing us to highlight key similarities across learning settings and open research challenges. Importantly, generalized exploration serves as a necessary objective for maintaining open-ended learning processes, which in continually learning to discover and solve new problems, provides a promising path to more general intelligence. https://preview.redd.it/l9368hb9pz2a1.jpg?width=1355&format=pjpg&auto=webp&s=ffa1d150f9bc776ad8e42a9fb69eddf76b6a4f89 https://preview.redd.it/5tkg3db9pz2a1.jpg?width=1520&format=pjpg&auto=webp&s=b953bfa9ea569ad05fbb453bacadcc3a848a92cf https://preview.redd.it/csvjoib9pz2a1.jpg?width=1349&format=pjpg&auto=webp&s=c848130ea7cc05b7e8f819c172a42dfec77d2663 submitted by /u/Singularian2501 [link] [comments]  ( 62 min )
    [R] AI Timelines via Cumulative Optimization Power: Less Long, More Short
    https://www.lesswrong.com/posts/3nMpdmt8LrzxQnkGp/ai-timelines-via-cumulative-optimization-power-less-long TLDR: We can best predict the future by using simple models which best postdict the past (ala Bayes/Solomonoff). A simple model based on net training compute postdicts the relative performance of successful biological and artificial neural networks. Extrapolation of this model into the future leads to short AI timelines: ~75% chance of AGI by 2032. A very interesting article. Any thoughts? submitted by /u/ThePerson654321 [link] [comments]  ( 64 min )
  • Open

    Stability AI builds foundation models on Amazon SageMaker
    We’re thrilled to announce that Stability AI has selected AWS as its preferred cloud provider to power its state-of-the-art AI models for image, language, audio, video, and 3D content generation. Stability AI is a community-driven, open-source artificial intelligence (AI) company developing breakthrough technologies. With Amazon SageMaker, Stability AI will build AI models on compute clusters […]  ( 4 min )
    Launch Amazon SageMaker Autopilot experiments directly from within Amazon SageMaker Pipelines to easily automate MLOps workflows
    Amazon SageMaker Autopilot, a low-code machine learning (ML) service that automatically builds, trains, and tunes the best ML models based on tabular data, is now integrated with Amazon SageMaker Pipelines, the first purpose-built continuous integration and continuous delivery (CI/CD) service for ML. This enables the automation of an end-to-end flow of building ML models using […]  ( 8 min )
    AI21 Jurassic-1 foundation model is now available on Amazon SageMaker
    Today we are excited to announce that AI21 Jurassic-1 (J1) foundation models are available for customers using Amazon SageMaker. Jurassic-1 models are highly versatile, capable of both human-like text generation, as well as solving complex tasks such as question answering, text classification, and many others. You can easily try out this model and use it […]  ( 7 min )
    Introducing AWS AI Service Cards: A new resource to enhance transparency and advance responsible AI
    Artificial intelligence (AI) and machine learning (ML) are some of the most transformative technologies we will encounter in our generation—to tackle business and societal problems, improve customer experiences, and spur innovation. Along with the widespread use and growing scale of AI comes the recognition that we must all build responsibly. At AWS, we think responsible […]  ( 7 min )
  • Open

    QMIX not working with negative rewards
    I'm trying to use QMIX in a multi-agent environment with negative rewards. QMIX takes the max Q value of each agent and transforms it into a global Q value with a mixer network generated by a set of hyper networks. The mixer network has only positive weights so that the monotonicity constraint is not broken, and the maximum Q value of each agent can be used at test time in a decentralized manner. My problem is that the mixer network always predicts positive Q values even when rewards are always negative. I am logging the predicted Q_tot values and target Q_tot values and they are always positive, and if they start negative they increase over time, no matter what target update rule I use (soft/hard with different numbers of steps and values of tau). The same code but without the mixer network (i.e. standard DQN) works correctly. Do you have any advice on how to debug this? --------------------------------------------------------------------------------------------------------------------------------------------- If can help, this is the implementation of the mixer forward: https://gist.github.com/fedetask/2e6692381c579e7dd8a5c150c5a8eb52 Here q_values is a (batch_size, n_agents) tensor, while global_state has shape (batch_size, state_size). I compute the agent Q values by taking the max of each individual Q network and concatenating them together, and I pass them to the mixer network. Then, I do the same with the Q target networks for the next states, and I pass them to the target mixer network along with the next global state. The td error is computed as global_reward + gamma * q_tot_target - q_tot submitted by /u/fedetask [link] [comments]  ( 59 min )
    Is reinforcement learning funding increasing or decreasing?
    Hey guys, I’m thinking about doing a phd in reinforcement learning, and I want to know if it will still have ample opportunities or if it’s decreasing in popularity submitted by /u/Turkeydunk [link] [comments]  ( 63 min )
    What do you think about Loss as Reward function in the form: L = Exp(-R) instead of -R. In order to minimize L it has to maximize R?
    Theoretically it should rise by its value if update is discretized by small fractions as is with alpha learning rate. But -R is also rises by its current value... So it is the same? What other benefits can you find? Linearity is the best choice in the most cases apart when one needs to minimize prediction error. submitted by /u/Timur_1988 [link] [comments]  ( 51 min )
    How does the seed (initial value) fed to the Deep RL/RL algorithms affects the performance. Does it lead to divergence or create any major effect or is just a hyperparameter. Is there any way to nullify the effects of initial value. Does anyone has any material regarding this .
    Does the initial value fed to the RL algorithm creates any significant effect ? submitted by /u/aabra__ka__daabra [link] [comments]  ( 61 min )
    Does Q learning converge under different maximization objective
    Given an update rule for Q learning a' = max_u f(Q(s', u)) Q(s, a) <-- Q(s, a) + alpha * [r + gamma * Q(s', a') - Q(s, a) ] which is the common Q learning update rule except that the argmax action a' is the action that maximizes Q(s', a') transformed by a function f which can be non-monotonic. Does Q learning converge to the policy that maximizes the transformed objective? submitted by /u/fedetask [link] [comments]  ( 60 min )
    Optimality in PPO
    Hello everyone, I have a question concerning convergence in PPO. I'm currently training a recurrent ppo agent on a positioning task (the goal is to reach a certain position and orientation in space) using raw visual inputs. The training is going well but the problem is that the agent is unable to reach the exact positions I want (which can be translated as stuck in a local optimum). I'm using a shaped reward which is scaled between [0,1]. Is this problem linked to the reward function? the hyperparameters? or PPO itself ? Thanks in advance. submitted by /u/Many_Reception_4921 [link] [comments]  ( 63 min )
    How do we deploy a Reinforcement Learning Algorithm on a Microcontroller?
    submitted by /u/Final-Batz [link] [comments]  ( 55 min )
    Connections/compatibility between Approximation and Search in RL?
    Dear RL community on reddit, most (if not all) deep RL methods use a combination of function approximation (classically through NNs to approximate some (for example) value function as applied by alpha zero for instance) and search (such as Monte Carlo or A*). This duality is something which I was thinking a lot aboutrecently. For example in alpha zero the value approximation to evaluate a certain position is learned by training it to approximate the Monte Carlo sampled expected reward. It appears to me that in some way the search procedure is engraved into the approximator as a ‘lite version’ (for the lack of a better word). The approximator basically learns to evaluate the position without actually “searching” from the current state position. I have no proof that this duality is actually bad but I have a feeling that these concepts should somehow be combined in one algorithm with more coherent integration of search and approximation principles. I would enjoy reading your thoughts regarding this. submitted by /u/Tobiwan663 [link] [comments]  ( 61 min )
    A new RL community in Sydney Australia
    Hi, Now we have a new RL meetup group in Sydney, Australia. Please join us if you are in Sydney and interested in discussing anything RL related. Cheers, and see you there :) https://www.meetup.com/reinforcement-learning/ submitted by /u/Express-Incident-113 [link] [comments]  ( 56 min )
  • Open

    Ushering in a new era of computing
    Dan Huttenlocher is a professor of electrical engineering and computer science and the inaugural dean at MIT Schwarzman College of Computing.  ( 9 min )
  • Open

    ChatGPT: Optimizing Language Models for Dialogue
    We’ve trained a model called ChatGPT which interacts in a conversational way. The dialogue format makes it possible for ChatGPT to answer followup questions, admit its mistakes, challenge incorrect premises, and reject inappropriate requests. ChatGPT is a sibling model to InstructGPT, which is trained to follow an instruction  ( 12 min )
  • Open

    Making a Traversable Wormhole with a Quantum Computer
    Posted by Alexander Zlokapa, Student Researcher, and Hartmut Neven, VP of Engineering, Quantum AI Team Wormholes — wrinkles in the fabric of spacetime that connect two disparate locations — may seem like the stuff of science fiction. But whether or not they exist in reality, studying these hypothetical objects could be the key to making concrete the tantalizing link between information and matter that has bedeviled physicists for decades. Surprisingly, a quantum computer is an ideal platform to investigate this connection. The trick is to use a correspondence called AdS/CFT, which establishes an equivalence between a theory that describes gravity and spacetime (and wormholes) in a fictional world with a special geometry (AdS) to a quantum theory that does not contain gravity at all …  ( 94 min )
  • Open

    How Quickly Can You Get an Approval for a Personal Loan for Buisness
    There’s no standard waiting time for Approval for a Personal Loan as it depends on factors like the type of lender or institution, the approval process, and credit history. Further, it depends on the type of loan requests as there are different types of personal loans, from payday loans to title loans. The approval time… Read More »How Quickly Can You Get an Approval for a Personal Loan for Buisness The post How Quickly Can You Get an Approval for a Personal Loan for Buisness appeared first on Data Science Central.  ( 20 min )
  • Open

    Qubit Pharmaceuticals Accelerates Drug Discovery With Hybrid Quantum Computing
    The promise of quantum computing is to solve unsolvable problems. And companies are already making headway with hybrid approaches — those that combine classical and quantum computing — to tackle challenges like drug discovery for incurable diseases. By accelerating drug molecule simulation and modeling with hybrid quantum computing, startup Qubit Pharmaceuticals is significantly reducing the Read article > The post Qubit Pharmaceuticals Accelerates Drug Discovery With Hybrid Quantum Computing appeared first on NVIDIA Blog.  ( 5 min )
  • Open

    How to combine NER with sentiment analysis in a single model?
    I just started learning about NLP applications and understand that it is possible to train a model for the data set you give it. E.g. for sentiment analysis you give it the text input as well as the expected sentiment output. But how would you create/train a model that combines multiple NLP tasks? Specifically, a model for NER with sentiment analysis, would you just train it on a data set that contains input text and expect output that would be a list of key-value pairs where key=entity and value=sentiment? How does the model know how to make use of the way this key-value data is structured/formatted? Perhaps I need to learn about the models themselves instead of how to use/train them. Any pointers on books/references for helping me learn about what I am trying to do is much appreciated. Thank you! submitted by /u/brooksbp [link] [comments]  ( 50 min )
    Topics of neural network
    Hi! I'm doing a course on neural network, and we have to make a 10 min presentation about a topic of our choosing, and explain it to the rest of the class. It's a beggining course, so we're looking for something not to heavy, but also kind of relevant and fun. We are looking for options but you might have something in mind also. ​ Thanks in advance! submitted by /u/Mikesblum [link] [comments]  ( 44 min )
  • Open

    Controlled Sparsity via Constrained Optimization or: How I Learned to Stop Tuning Penalties and Love Constraints. (arXiv:2208.04425v2 [cs.LG] UPDATED)
    The performance of trained neural networks is robust to harsh levels of pruning. Coupled with the ever-growing size of deep learning models, this observation has motivated extensive research on learning sparse models. In this work, we focus on the task of controlling the level of sparsity when performing sparse learning. Existing methods based on sparsity-inducing penalties involve expensive trial-and-error tuning of the penalty factor, thus lacking direct control of the resulting model sparsity. In response, we adopt a constrained formulation: using the gate mechanism proposed by Louizos et al. (2018), we formulate a constrained optimization problem where sparsification is guided by the training objective and the desired sparsity target in an end-to-end fashion. Experiments on CIFAR-{10, 100}, TinyImageNet, and ImageNet using WideResNet and ResNet{18, 50} models validate the effectiveness of our proposal and demonstrate that we can reliably achieve pre-determined sparsity targets without compromising on predictive performance.
    Statistical Learning and Inverse Problems: A Stochastic Gradient Approach. (arXiv:2209.14967v3 [stat.ML] UPDATED)
    Inverse problems are paramount in Science and Engineering. In this paper, we consider the setup of Statistical Inverse Problem (SIP) and demonstrate how Stochastic Gradient Descent (SGD) algorithms can be used in the linear SIP setting. We provide consistency and finite sample bounds for the excess risk. We also propose a modification for the SGD algorithm where we leverage machine learning methods to smooth the stochastic gradients and improve empirical performance. We exemplify the algorithm in a setting of great interest nowadays: the Functional Linear Regression model. In this case we consider a synthetic data example and examples with a real data classification problem.
    Bayesian Optimization-based Combinatorial Assignment. (arXiv:2208.14698v3 [cs.LG] UPDATED)
    We study the combinatorial assignment domain, which includes combinatorial auctions and course allocation. The main challenge in this domain is that the bundle space grows exponentially in the number of items. To address this, several papers have recently proposed machine learning-based preference elicitation algorithms that aim to elicit only the most important information from agents. However, the main shortcoming of this prior work is that it does not model a mechanism's uncertainty over values for not yet elicited bundles. In this paper, we address this shortcoming by presenting a Bayesian Optimization-based Combinatorial Assignment (BOCA) mechanism. Our key technical contribution is to integrate a method for capturing model uncertainty into an iterative combinatorial auction mechanism. Concretely, we design a new method for estimating an upper uncertainty bound that can be used to define an acquisition function to determine the next query to the agents. This enables the mechanism to properly explore (and not just exploit) the bundle space during its preference elicitation phase. We run computational experiments in several spectrum auction domains to evaluate BOCA's performance. Our results show that BOCA achieves higher allocative efficiency than state-of-the-art approaches.
    An Attention Matrix for Every Decision: Faithfulness-based Arbitration Among Multiple Attention-Based Interpretations of Transformers in Text Classification. (arXiv:2209.10876v2 [cs.CL] UPDATED)
    Transformers are widely used in natural language processing, where they consistently achieve state-of-the-art performance. This is mainly due to their attention-based architecture, which allows them to model rich linguistic relations between (sub)words. However, transformers are difficult to interpret. Being able to provide reasoning for its decisions is an important property for a model in domains where human lives are affected. With transformers finding wide use in such fields, the need for interpretability techniques tailored to them arises. We propose a new technique that selects the most faithful attention-based interpretation among the several ones that can be obtained by combining different head, layer and matrix operations. In addition, two variations are introduced towards (i) reducing the computational complexity, thus being faster and friendlier to the environment, and (ii) enhancing the performance in multi-label data. We further propose a new faithfulness metric that is more suitable for transformer models and exhibits high correlation with the area under the precision-recall curve based on ground truth rationales. We validate the utility of our contributions with a series of quantitative and qualitative experiments on seven datasets.
    ReAct: Synergizing Reasoning and Acting in Language Models. (arXiv:2210.03629v2 [cs.CL] UPDATED)
    While large language models (LLMs) have demonstrated impressive capabilities across tasks in language understanding and interactive decision making, their abilities for reasoning (e.g. chain-of-thought prompting) and acting (e.g. action plan generation) have primarily been studied as separate topics. In this paper, we explore the use of LLMs to generate both reasoning traces and task-specific actions in an interleaved manner, allowing for greater synergy between the two: reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources, such as knowledge bases or environments, to gather additional information. We apply our approach, named ReAct, to a diverse set of language and decision making tasks and demonstrate its effectiveness over state-of-the-art baselines, as well as improved human interpretability and trustworthiness over methods without reasoning or acting components. Concretely, on question answering (HotpotQA) and fact verification (Fever), ReAct overcomes issues of hallucination and error propagation prevalent in chain-of-thought reasoning by interacting with a simple Wikipedia API, and generates human-like task-solving trajectories that are more interpretable than baselines without reasoning traces. On two interactive decision making benchmarks (ALFWorld and WebShop), ReAct outperforms imitation and reinforcement learning methods by an absolute success rate of 34% and 10% respectively, while being prompted with only one or two in-context examples. Project site with code: https://react-lm.github.io
    Catch Me if You Can: A Novel Task for Detection of Covert Geo-Locations (CGL). (arXiv:2202.02567v1 [cs.CV] CROSS LISTED)
    Most visual scene understanding tasks in the field of computer vision involve identification of the objects present in the scene. Image regions like hideouts, turns, & other obscured regions of the scene also contain crucial information, for specific surveillance tasks. Task proposed in this paper involves the design of an intelligent visual aid for identification of such locations in an image, which has either the potential to create an imminent threat from an adversary or appear as the target zones needing further investigation. Covert places (CGL) for hiding behind an occluding object are concealed 3D locations, not detectable from the viewpoint (camera). Hence this involves delineating specific image regions around the projections of outer boundary of the occluding objects, as places to be accessed around the potential hideouts. CGL detection finds applications in military counter-insurgency operations, surveillance with path planning for an exploratory robot. Given an RGB image, the goal is to identify all CGLs in the 2D scene. Identification of such regions would require knowledge about the 3D boundaries of obscuring items (pillars, furniture), their spatial location with respect to the neighboring regions of the scene. We propose this as a novel task, termed Covert Geo-Location (CGL) Detection. Classification of any region of an image as a CGL (as boundary sub-segments of an occluding object that conceals the hideout) requires examining the 3D relation between boundaries of occluding objects and their neighborhoods & surroundings. Our method successfully extracts relevant depth features from a single RGB image and quantitatively yields significant improvement over existing object detection and segmentation models adapted and trained for CGL detection. We also introduce a novel hand-annotated CGL detection dataset containing 1.5K real-world images for experimentation.
    Reducing Computational Complexity of Neural Networks in Optical Channel Equalization: From Concepts to Implementation. (arXiv:2208.12866v2 [eess.SP] UPDATED)
    In this paper, a new methodology is proposed that allows for the low-complexity development of neural network (NN) based equalizers for the mitigation of impairments in high-speed coherent optical transmission systems. In this work, we provide a comprehensive description and comparison of various deep model compression approaches that have been applied to feed-forward and recurrent NN designs. Additionally, we evaluate the influence these strategies have on the performance of each NN equalizer. Quantization, weight clustering, pruning, and other cutting-edge strategies for model compression are taken into consideration. In this work, we propose and evaluate a Bayesian optimization-assisted compression, in which the hyperparameters of the compression are chosen to simultaneously reduce complexity and improve performance. In conclusion, the trade-off between the complexity of each compression approach and its performance is evaluated by utilizing both simulated and experimental data in order to complete the analysis. By utilizing optimal compression approaches, we show that it is possible to design an NN-based equalizer that is simpler to implement and has better performance than the conventional digital back-propagation (DBP) equalizer with only one step per span. This is accomplished by reducing the number of multipliers used in the NN equalizer after applying the weighted clustering and pruning algorithms. Furthermore, we demonstrate that an equalizer based on NN can also achieve superior performance while still maintaining the same degree of complexity as the full electronic chromatic dispersion compensation block. We conclude our analysis by highlighting open questions and existing challenges, as well as possible future research directions.
    The European AI Liability Directives -- Critique of a Half-Hearted Approach and Lessons for the Future. (arXiv:2211.13960v2 [cs.CY] UPDATED)
    The optimal liability framework for AI systems remains an unsolved problem across the globe. In a much-anticipated move, the European Commission advanced two proposals outlining the European approach to AI liability in September 2022: a novel AI Liability Directive and a revision of the Product Liability Directive. They constitute the final, and much-anticipated, cornerstone of AI regulation in the EU. Crucially, the liability proposals and the EU AI Act are inherently intertwined: the latter does not contain any individual rights of affected persons, and the former lack specific, substantive rules on AI development and deployment. Taken together, these acts may well trigger a Brussels effect in AI regulation, with significant consequences for the US and other countries. This paper makes three novel contributions. First, it examines in detail the Commission proposals and shows that, while making steps in the right direction, they ultimately represent a half-hearted approach: if enacted as foreseen, AI liability in the EU will primarily rest on disclosure of evidence mechanisms and a set of narrowly defined presumptions concerning fault, defectiveness and causality. Hence, second, the article suggests amendments, which are collected in an Annex at the end of the paper. Third, based on an analysis of the key risks AI poses, the final part of the paper maps out a road for the future of AI liability and regulation, in the EU and beyond. This includes: a comprehensive framework for AI liability; provisions to support innovation; an extension to non-discrimination/algorithmic fairness, as well as explainable AI; and sustainability. I propose to jump-start sustainable AI regulation via sustainability impact assessments in the AI Act and sustainable design defects in the liability regime. In this way, the law may help spur not only fair AI and XAI, but potentially also sustainable AI (SAI).
    Scaling Up Dynamic Graph Representation Learning via Spiking Neural Networks. (arXiv:2208.10364v2 [cs.NE] UPDATED)
    Recent years have seen a surge in research on dynamic graph representation learning, which aims to model temporal graphs that are dynamic and evolving constantly over time. However, current work typically models graph dynamics with recurrent neural networks (RNNs), making them suffer seriously from computation and memory overheads on large temporal graphs. So far, scalability of dynamic graph representation learning on large temporal graphs remains one of the major challenges. In this paper, we present a scalable framework, namely SpikeNet, to efficiently capture the temporal and structural patterns of temporal graphs. We explore a new direction in that we can capture the evolving dynamics of temporal graphs with spiking neural networks (SNNs) instead of RNNs. As a low-power alternative to RNNs, SNNs explicitly model graph dynamics as spike trains of neuron populations and enable spike-based propagation in an efficient way. Experiments on three large real-world temporal graph datasets demonstrate that SpikeNet outperforms strong baselines on the temporal node classification task with lower computational costs. Particularly, SpikeNet generalizes to a large temporal graph (2.7M nodes and 13.9M edges) with significantly fewer parameters and computation overheads.
    Spectral Diffusion Processes. (arXiv:2209.14125v2 [stat.ML] UPDATED)
    Score-based generative modelling (SGM) has proven to be a very effective method for modelling densities on finite-dimensional spaces. In this work we propose to extend this methodology to learn generative models over functional spaces. To do so, we represent functional data in spectral space to dissociate the stochastic part of the processes from their space-time part. Using dimensionality reduction techniques we then sample from their stochastic component using finite dimensional SGM. We demonstrate our method's effectiveness for modelling various multimodal datasets.
    An Interpretable Joint Nonnegative Matrix Factorization-Based Point Cloud Distance Measure. (arXiv:2207.05112v2 [cs.LG] UPDATED)
    In this paper, we propose a new method for determining shared features of and measuring the distance between data sets or point clouds. Our approach uses the joint factorization of two data matrices $X_1,X_2$ into non-negative matrices $X_1 = AS_1, X_2 = AS_2$ to derive a similarity measure that determines how well the shared basis $A$ approximates $X_1, X_2$. We also propose a point cloud distance measure built upon this method and the learned factorization. Our method reveals structural differences in both image and text data. Potential applications include classification, detecting plagiarism or other manipulation, data denoising, and transfer learning.
    High-precision Density Mapping of Marine Debris and Floating Plastics via Satellite Imagery. (arXiv:2210.05468v2 [eess.IV] UPDATED)
    Combining multi-spectral satellite data and machine learning has been suggested as a method for monitoring plastic pollutants in the ocean environment. Recent studies have made theoretical progress regarding the identification of marine plastic via machine learning. However, no study has assessed the application of these methods for mapping and monitoring marine-plastic density. As such, this paper comprised of three main components: (1) the development of a machine learning model, (2) the construction of the MAP-Mapper, an automated tool for mapping marine-plastic density, and finally (3) an evaluation of the whole system for out-of-distribution test locations. The findings from this paper leverage the fact that machine learning models need to be high-precision to reduce the impact of false positives on results. The developed MAP-Mapper architectures provide users choices to reach high-precision ($\textit{abbv.}$ -HP) or optimum precision-recall ($\textit{abbv.}$ -Opt) values in terms of the training/test data set. Our MAP-Mapper-HP model greatly increased the precision of plastic detection to 95\%, whilst MAP-Mapper-Opt reaches precision-recall pair of 87\%-88\%. The MAP-Mapper contributes to the literature with the first tool to exploit advanced deep/machine learning and multi-spectral imagery to map marine-plastic density in automated software. The proposed data pipeline has taken a novel approach to map plastic density in ocean regions. As such, this enables an initial assessment of the challenges and opportunities of this method to help guide future work and scientific study.
    Learning with an Evolving Class Ontology. (arXiv:2210.04993v3 [cs.CV] UPDATED)
    Lifelong learners must recognize concept vocabularies that evolve over time. A common yet underexplored scenario is learning with class labels over time that refine/expand old classes. For example, humans learn to recognize ${\tt dog}$ before dog breeds. In practical settings, dataset $\textit{versioning}$ often introduces refinement to ontologies, such as autonomous vehicle benchmarks that refine a previous ${\tt vehicle}$ class into ${\tt school-bus}$ as autonomous operations expand to new cities. This paper formalizes a protocol for studying the problem of $\textit{Learning with Evolving Class Ontology}$ (LECO). LECO requires learning classifiers in distinct time periods (TPs); each TP introduces a new ontology of "fine" labels that refines old ontologies of "coarse" labels (e.g., dog breeds that refine the previous ${\tt dog}$). LECO explores such questions as whether to annotate new data or relabel the old, how to leverage coarse labels, and whether to finetune the previous TP's model or train from scratch. To answer these questions, we leverage insights from related problems such as class-incremental learning. We validate them under the LECO protocol through the lens of image classification (CIFAR and iNaturalist) and semantic segmentation (Mapillary). Our experiments lead to surprising conclusions; while the current status quo is to relabel existing datasets with new ontologies (such as COCO-to-LVIS or Mapillary1.2-to-2.0), LECO demonstrates that a far better strategy is to annotate $\textit{new}$ data with the new ontology. However, this produces an aggregate dataset with inconsistent old-vs-new labels, complicating learning. To address this challenge, we adopt methods from semi-supervised and partial-label learning. Such strategies can surprisingly be made near-optimal, approaching an "oracle" that learns on the aggregate dataset exhaustively labeled with the newest ontology.
    AI for Porosity and Permeability Prediction from Geologic Core X-Ray Micro-Tomography. (arXiv:2205.13189v2 [cs.LG] UPDATED)
    Geologic cores are rock samples that are extracted from deep under the ground during the well drilling process. They are used for petroleum reservoirs' performance characterization. Traditionally, physical studies of cores are carried out by the means of manual time-consuming experiments. With the development of deep learning, scientists actively started working on developing machine-learning-based approaches to identify physical properties without any manual experiments. Several previous works used machine learning to determine the porosity and permeability of the rocks, but either method was inaccurate or computationally expensive. We are proposing to use self-supervised pretraining of the very small CNN-transformer-based model to predict the physical properties of the rocks with high accuracy in a time-efficient manner. We show that this technique prevents overfitting even for extremely small datasets. Github: https://github.com/Shahbozjon/porosity-and-permeability-prediction
    A Data Driven Method for Multi-step Prediction of Ship Roll Motion in High Sea States. (arXiv:2207.12673v2 [cs.LG] UPDATED)
    Ship roll motion in high sea state has large amplitude and nonlinear dynamics, and its prediction is significant for the operability, safety and survivability. This paper presents a novel data-driven methodology to provide multi-step prediction of the ship roll motion in high sea states. A hybrid neural network is proposed that combines long short-term memory (LSTM) and convolutional neural network (CNN) in parallel. The motivation is to extract the nonlinear dynamics characteristics and the hydrodynamic memory information through the advantage of CNN and LSTM, respectively. For the feature selection, the time histories of motion states and wave heights are selected to involve sufficient information. Taken a scaled KCS as the study object, the ship motions in sea state 7 irregular long crested waves are simulated and used for the validation. The results show that at least one period of roll motion can be accurately predicted by using the proposed method. Compared with the single LSTM and CNN method, the proposed method has better performance in the prediction of the amplitude of roll angles. Besides, the comparison results also demonstrate that selecting motion states and wave heights as feature space improves the prediction accuracy, verifying the effectiveness of the proposed method.
    Application of Deep Q Learning with Simulation Results for Elevator Optimization. (arXiv:2210.00065v2 [cs.LG] UPDATED)
    This paper presents a methodology for combining programming and mathematics to optimize elevator wait times. Based on simulated user data generated according to the canonical three-peak model of elevator traffic, we first develop a naive model from an intuitive understanding of the logic behind elevators. We take into consideration a general array of features including capacity, acceleration, and maximum wait time thresholds to adequately model realistic circumstances. Using the same evaluation framework, we proceed to develop a Deep Q Learning model in an attempt to match the hard-coded naive approach for elevator control. Throughout the majority of the paper, we work under a Markov Decision Process (MDP) schema, but later explore how the assumption fails to characterize the highly stochastic overall Elevator Group Control System (EGCS).
    GLCC: A General Framework for Graph-level Clustering. (arXiv:2210.11879v3 [cs.LG] UPDATED)
    This paper studies the problem of graph-level clustering, which is a novel yet challenging task. This problem is critical in a variety of real-world applications such as protein clustering and genome analysis in bioinformatics. Recent years have witnessed the success of deep clustering coupled with graph neural networks (GNNs). However, existing methods focus on clustering among nodes given a single graph, while exploring clustering on multiple graphs is still under-explored. In this paper, we propose a general graph-level clustering framework named Graph-Level Contrastive Clustering (GLCC) given multiple graphs. Specifically, GLCC first constructs an adaptive affinity graph to explore instance- and cluster-level contrastive learning (CL). Instance-level CL leverages graph Laplacian based contrastive loss to learn clustering-friendly representations while cluster-level CL captures discriminative cluster representations incorporating neighbor information of each sample. Moreover, we utilize neighbor-aware pseudo-labels to reward the optimization of representation learning. The two steps can be alternatively trained to collaborate and benefit each other. Experiments on a range of well-known datasets demonstrate the superiority of our proposed GLCC over competitive baselines.
    Generalizing Downsampling from Regular Data to Graphs. (arXiv:2208.03523v2 [cs.LG] UPDATED)
    Downsampling produces coarsened, multi-resolution representations of data and it is used, for example, to produce lossy compression and visualization of large images, reduce computational costs, and boost deep neural representation learning. Unfortunately, due to their lack of a regular structure, there is still no consensus on how downsampling should apply to graphs and linked data. Indeed reductions in graph data are still needed for the goals described above, but reduction mechanisms do not have the same focus on preserving topological structures and properties, while allowing for resolution-tuning, as is the case in regular data downsampling. In this paper, we take a step in this direction, introducing a unifying interpretation of downsampling in regular and graph data. In particular, we define a graph coarsening mechanism which is a graph-structured counterpart of controllable equispaced coarsening mechanisms in regular data. We prove theoretical guarantees for distortion bounds on path lengths, as well as the ability to preserve key topological properties in the coarsened graphs. We leverage these concepts to define a graph pooling mechanism that we empirically assess in graph classification tasks, providing a greedy algorithm that allows efficient parallel implementation on GPUs, and showing that it compares favorably against pooling methods in literature.
    Simplifying Clustering with Graph Neural Networks. (arXiv:2207.08779v2 [cs.LG] UPDATED)
    The objective functions used in spectral clustering are usually composed of two terms: i) a term that minimizes the local quadratic variation of the cluster assignments on the graph and; ii) a term that balances the clustering partition and helps avoiding degenerate solutions. This paper shows that a graph neural network, equipped with suitable message passing layers, can generate good cluster assignments by optimizing only a balancing term. Results on attributed graph datasets show the effectiveness of the proposed approach in terms of clustering performance and computation time.
    Doubly-Asynchronous Value Iteration: Making Value Iteration Asynchronous in Actions. (arXiv:2207.01613v2 [cs.LG] UPDATED)
    Value iteration (VI) is a foundational dynamic programming method, important for learning and planning in optimal control and reinforcement learning. VI proceeds in batches, where the update to the value of each state must be completed before the next batch of updates can begin. Completing a single batch is prohibitively expensive if the state space is large, rendering VI impractical for many applications. Asynchronous VI helps to address the large state space problem by updating one state at a time, in-place and in an arbitrary order. However, Asynchronous VI still requires a maximization over the entire action space, making it impractical for domains with large action space. To address this issue, we propose doubly-asynchronous value iteration (DAVI), a new algorithm that generalizes the idea of asynchrony from states to states and actions. More concretely, DAVI maximizes over a sampled subset of actions that can be of any user-defined size. This simple approach of using sampling to reduce computation maintains similarly appealing theoretical properties to VI without the need to wait for a full sweep through the entire action space in each update. In this paper, we show DAVI converges to the optimal value function with probability one, converges at a near-geometric rate with probability 1-delta, and returns a near-optimal policy in computation time that nearly matches a previously established bound for VI. We also empirically demonstrate DAVI's effectiveness in several experiments.
    Multivariate rank via entropic optimal transport: sample efficiency and generative modeling. (arXiv:2111.00043v3 [stat.ML] UPDATED)
    The framework of optimal transport has been leveraged to extend the notion of rank to the multivariate setting while preserving desirable properties of the resulting goodness-of-fit (GoF) statistics. In particular, the rank energy (RE) and rank maximum mean discrepancy (RMMD) are distribution-free under the null, exhibit high power in statistical testing, and are robust to outliers. In this paper, we point to and alleviate some of the practical shortcomings of these proposed GoF statistics, namely their high computational cost, high statistical sample complexity, and lack of differentiability with respect to the data. We show that all these practically important issues are addressed by considering entropy-regularized optimal transport maps in place of the rank map, which we refer to as the soft rank. We consequently propose two new statistics, the soft rank energy (sRE) and soft rank maximum mean discrepancy (sRMMD), which exhibit several desirable properties. Given $n$ sample data points, we provide non-asymptotic convergence rates for the sample estimate of the entropic transport map to its population version that are essentially of the order $n^{-1/2}$ when the starting measure is subgaussian and the target measure has compact support. This result is novel compared to existing results which achieve a rate of $n^{-1}$ but crucially rely on both measures having compact support. We leverage this result to demonstrate fast convergence of sample sRE and sRMMD to their population version making them useful for high-dimensional GoF testing. Our statistics are differentiable and amenable to popular machine learning frameworks that rely on gradient methods. We leverage these properties towards showcasing the utility of the proposed statistics for generative modeling on two important problems: image generation and generating valid knockoffs for controlled feature selection.
    Local Spatiotemporal Representation Learning for Longitudinally-consistent Neuroimage Analysis. (arXiv:2206.04281v3 [cs.CV] UPDATED)
    Recent self-supervised advances in medical computer vision exploit global and local anatomical self-similarity for pretraining prior to downstream tasks such as segmentation. However, current methods assume i.i.d. image acquisition, which is invalid in clinical study designs where follow-up longitudinal scans track subject-specific temporal changes. Further, existing self-supervised methods for medically-relevant image-to-image architectures exploit only spatial or temporal self-similarity and only do so via a loss applied at a single image-scale, with naive multi-scale spatiotemporal extensions collapsing to degenerate solutions. To these ends, this paper makes two contributions: (1) It presents a local and multi-scale spatiotemporal representation learning method for image-to-image architectures trained on longitudinal images. It exploits the spatiotemporal self-similarity of learned multi-scale intra-subject features for pretraining and develops several feature-wise regularizations that avoid collapsed identity representations; (2) During finetuning, it proposes a surprisingly simple self-supervised segmentation consistency regularization to exploit intra-subject correlation. Benchmarked in the one-shot segmentation setting, the proposed framework outperforms both well-tuned randomly-initialized baselines and current self-supervised techniques designed for both i.i.d. and longitudinal datasets. These improvements are demonstrated across both longitudinal neurodegenerative adult MRI and developing infant brain MRI and yield both higher performance and longitudinal consistency.
    Continual Learning Beyond a Single Model. (arXiv:2202.09826v2 [cs.LG] UPDATED)
    A growing body of research in continual learning focuses on the catastrophic forgetting problem. While many attempts have been made to alleviate this problem, the majority of the methods assume a single model in the continual learning setup. In this work, we question this assumption and show that employing ensemble models can be a simple yet effective method to improve continual performance. However, ensembles' training and inference costs can increase significantly as the number of models grows. Motivated by this limitation, we study different ensemble models to understand their benefits and drawbacks in continual learning scenarios. Finally, to overcome the high compute cost of ensembles, we leverage recent advances in neural network subspace to propose a computationally cheap algorithm with similar runtime to a single model yet enjoying the performance benefits of ensembles.
    Knowledge Distillation for 6D Pose Estimation by Aligning Distributions of Local Predictions. (arXiv:2205.14971v2 [cs.CV] UPDATED)
    Knowledge distillation facilitates the training of a compact student network by using a deep teacher one. While this has achieved great success in many tasks, it remains completely unstudied for image-based 6D object pose estimation. In this work, we introduce the first knowledge distillation method driven by the 6D pose estimation task. To this end, we observe that most modern 6D pose estimation frameworks output local predictions, such as sparse 2D keypoints or dense representations, and that the compact student network typically struggles to predict such local quantities precisely. Therefore, instead of imposing prediction-to-prediction supervision from the teacher to the student, we propose to distill the teacher's \emph{distribution} of local predictions into the student network, facilitating its training. Our experiments on several benchmarks show that our distillation method yields state-of-the-art results with different compact student models and for both keypoint-based and dense prediction-based architectures.
    Cross-Lingual Transfer Learning for Statistical Type Inference. (arXiv:2107.00157v3 [cs.AI] UPDATED)
    Hitherto statistical type inference systems rely thoroughly on supervised learning approaches, which require laborious manual effort to collect and label large amounts of data. Most Turing-complete imperative languages share similar control- and data-flow structures, which make it possible to transfer knowledge learned from one language to another. In this paper, we propose a cross-lingual transfer learning framework, PLATO, for statistical type inference, which allows us to leverage prior knowledge learned from the labeled dataset of one language and transfer it to the others, e.g., Python to JavaScript, Java to JavaScript, etc. PLATO is powered by a novel kernelized attention mechanism to constrain the attention scope of the backbone Transformer model such that model is forced to base its prediction on commonly shared features among languages. In addition, we propose the syntax enhancement that augments the learning on the feature overlap among language domains. Furthermore, PLATO can also be used to improve the performance of the conventional supervised-based type inference by introducing cross-language augmentation, which enables the model to learn more general features across multiple languages. We evaluated PLATO under two settings: 1) under the cross-domain scenario that the target language data is not labeled or labeled partially, the results show that PLATO outperforms the state-of-the-art domain transfer techniques by a large margin, e.g., it improves the Python to TypeScript baseline by +14.6%@EM, +18.6%@weighted-F1, and 2) under the conventional monolingual supervised scenario, PLATO improves the Python baseline by +4.10%@EM, +1.90%@weighted-F1 with the introduction of the cross-lingual augmentation.
    Quantum Lazy Training. (arXiv:2202.08232v4 [quant-ph] UPDATED)
    In the training of over-parameterized model functions via gradient descent, sometimes the parameters do not change significantly and remain close to their initial values. This phenomenon is called lazy training, and motivates consideration of the linear approximation of the model function around the initial parameters. In the lazy regime, this linear approximation imitates the behavior of the parameterized function whose associated kernel, called the tangent kernel, specifies the training performance of the model. Lazy training is known to occur in the case of (classical) neural networks with large widths. In this paper, we show that the training of geometrically local parameterized quantum circuits enters the lazy regime for large numbers of qubits. More precisely, we prove bounds on the rate of changes of the parameters of such a geometrically local parameterized quantum circuit in the training process, and on the precision of the linear approximation of the associated quantum model function; both of these bounds tend to zero as the number of qubits grows. We support our analytic results with numerical simulations.
    Visual Pre-training for Navigation: What Can We Learn from Noise?. (arXiv:2207.00052v2 [cs.CV] UPDATED)
    In visual navigation, one powerful paradigm is to predict actions from observations directly. Training such an end-to-end system allows representations that are useful for downstream tasks to emerge automatically. However, the lack of inductive bias makes this system data-hungry. We hypothesize a sufficient representation of the current view and the goal view for a navigation policy can be learned by predicting the location and size of a crop of the current view that corresponds to the goal. We further show that training such random crop prediction in a self-supervised fashion purely on synthetic noise images transfers well to natural home images. The learned representation can then be bootstrapped to learn a navigation policy efficiently with little interaction data. The code is available at https://yanweiw.github.io/noise2ptz/
    Online Dynamics Learning for Predictive Control with an Application to Aerial Robots. (arXiv:2207.09344v2 [cs.RO] UPDATED)
    In this work, we consider the task of improving the accuracy of dynamic models for model predictive control (MPC) in an online setting. Although prediction models can be learned and applied to model-based controllers, these models are often learned offline. In this offline setting, training data is first collected and a prediction model is learned through an elaborated training procedure. However, since the model is learned offline, it does not adapt to disturbances or model errors observed during deployment. To improve the adaptiveness of the model and the controller, we propose an online dynamics learning framework that continually improves the accuracy of the dynamic model during deployment. We adopt knowledge-based neural ordinary differential equations (KNODE) as the dynamic models, and use techniques inspired by transfer learning to continually improve the model accuracy. We demonstrate the efficacy of our framework with a quadrotor, and verify the framework in both simulations and physical experiments. Results show that our approach can account for disturbances that are possibly time-varying, while maintaining good trajectory tracking performance.
    The Separation Capacity of Random Neural Networks. (arXiv:2108.00207v2 [cs.LG] UPDATED)
    Neural networks with random weights appear in a variety of machine learning applications, most prominently as the initialization of many deep learning algorithms and as a computationally cheap alternative to fully learned neural networks. In the present article, we enhance the theoretical understanding of random neural networks by addressing the following data separation problem: under what conditions can a random neural network make two classes $\mathcal{X}^-, \mathcal{X}^+ \subset \mathbb{R}^d$ (with positive distance) linearly separable? We show that a sufficiently large two-layer ReLU-network with standard Gaussian weights and uniformly distributed biases can solve this problem with high probability. Crucially, the number of required neurons is explicitly linked to geometric properties of the underlying sets $\mathcal{X}^-, \mathcal{X}^+$ and their mutual arrangement. This instance-specific viewpoint allows us to overcome the usual curse of dimensionality (exponential width of the layers) in non-pathological situations where the data carries low-complexity structure. We quantify the relevant structure of the data in terms of a novel notion of mutual complexity (based on a localized version of Gaussian mean width), which leads to sound and informative separation guarantees. We connect our result with related lines of work on approximation, memorization, and generalization.
    KSD Aggregated Goodness-of-fit Test. (arXiv:2202.00824v4 [stat.ML] UPDATED)
    We investigate properties of goodness-of-fit tests based on the Kernel Stein Discrepancy (KSD). We introduce a strategy to construct a test, called KSDAgg, which aggregates multiple tests with different kernels. KSDAgg avoids splitting the data to perform kernel selection (which leads to a loss in test power), and rather maximises the test power over a collection of kernels. We provide non-asymptotic guarantees on the power of KSDAgg: we show it achieves the smallest uniform separation rate of the collection, up to a logarithmic term. For compactly supported densities with bounded model score function, we derive the rate for KSDAgg over restricted Sobolev balls; this rate corresponds to the minimax optimal rate over unrestricted Sobolev balls, up to an iterated logarithmic term. KSDAgg can be computed exactly in practice as it relies either on a parametric bootstrap or on a wild bootstrap to estimate the quantiles and the level corrections. In particular, for the crucial choice of bandwidth of a fixed kernel, it avoids resorting to arbitrary heuristics (such as median or standard deviation) or to data splitting. We find on both synthetic and real-world data that KSDAgg outperforms other state-of-the-art quadratic-time adaptive KSD-based goodness-of-fit testing procedures.
    A posteriori learning for quasi-geostrophic turbulence parametrization. (arXiv:2204.03911v2 [physics.flu-dyn] UPDATED)
    The use of machine learning to build subgrid parametrizations for climate models is receiving growing attention. State-of-the-art strategies address the problem as a supervised learning task and optimize algorithms that predict subgrid fluxes based on information from coarse resolution models. In practice, training data are generated from higher resolution numerical simulations transformed in order to mimic coarse resolution simulations. By essence, these strategies optimize subgrid parametrizations to meet so-called $\textit{a priori}$ criteria. But the actual purpose of a subgrid parametrization is to obtain good performance in terms of $\textit{a posteriori}$ metrics which imply computing entire model trajectories. In this paper, we focus on the representation of energy backscatter in two dimensional quasi-geostrophic turbulence and compare parametrizations obtained with different learning strategies at fixed computational complexity. We show that strategies based on $\textit{a priori}$ criteria yield parametrizations that tend to be unstable in direct simulations and describe how subgrid parametrizations can alternatively be trained end-to-end in order to meet $\textit{a posteriori}$ criteria. We illustrate that end-to-end learning strategies yield parametrizations that outperform known empirical and data-driven schemes in terms of performance, stability and ability to apply to different flow configurations. These results support the relevance of differentiable programming paradigms for climate models in the future.
    Planted Dense Subgraphs in Dense Random Graphs Can Be Recovered using Graph-based Machine Learning. (arXiv:2201.01825v2 [cs.LG] UPDATED)
    Multiple methods of finding the vertices belonging to a planted dense subgraph in a random dense $G(n, p)$ graph have been proposed, with an emphasis on planted cliques. Such methods can identify the planted subgraph in polynomial time, but are all limited to several subgraph structures. Here, we present PYGON, a graph neural network-based algorithm, which is insensitive to the structure of the planted subgraph. This is the first algorithm that uses advanced learning tools for recovering dense subgraphs. We show that PYGON can recover cliques of sizes $\Theta\left(\sqrt{n}\right)$, where $n$ is the size of the background graph, comparable with the state of the art. We also show that the same algorithm can recover multiple other planted subgraphs of size $\Theta\left(\sqrt{n}\right)$, in both directed and undirected graphs. We suggest a conjecture that no polynomial time PAC-learning algorithm can detect planted dense subgraphs with size smaller than $O\left(\sqrt{n}\right)$, even if in principle one could find dense subgraphs of logarithmic size.
    Deep Attention-Based Supernovae Classification of Multi-Band Light-Curves. (arXiv:2201.08482v3 [astro-ph.IM] UPDATED)
    In astronomical surveys, such as the Zwicky Transient Facility, supernovae (SNe) are relatively uncommon objects compared to other classes of variable events. Along with this scarcity, the processing of multi-band light-curves is a challenging task due to the highly irregular cadence, long time gaps, missing-values, few observations, etc. These issues are particularly detrimental to the analysis of transient events: SN-like light-curves. We offer three main contributions: 1) Based on temporal modulation and attention mechanisms, we propose a Deep attention model (TimeModAttn) to classify multi-band light-curves of different SN types, avoiding photometric or hand-crafted feature computations, missing-value assumptions, and explicit imputation/interpolation methods. 2) We propose a model for the synthetic generation of SN multi-band light-curves based on the Supernova Parametric Model, allowing us to increase the number of samples and the diversity of cadence. Thus, the TimeModAttn model is first pre-trained using synthetic light-curves. Then, a fine-tuning process is performed. The TimeModAttn model outperformed other Deep Learning models, based on Recurrent Neural Networks, in two scenarios: late-classification and early-classification. Also, the TimeModAttn model outperformed a Balanced Random Forest (BRF) classifier (trained with real data), increasing the balanced-$F_1$score from $\approx.525$ to $\approx.596$. When training the BRF with synthetic data, this model achieved similar performance to the TimeModAttn model proposed while still maintaining extra advantages. 3) We conducted interpretability experiments. High attention scores were obtained for observations earlier than and close to the SN brightness peaks. This also correlated with an early highly variability of the learned temporal modulation.
    You Can't Count on Luck: Why Decision Transformers and RvS Fail in Stochastic Environments. (arXiv:2205.15967v2 [cs.LG] UPDATED)
    Recently, methods such as Decision Transformer that reduce reinforcement learning to a prediction task and solve it via supervised learning (RvS) have become popular due to their simplicity, robustness to hyperparameters, and strong overall performance on offline RL tasks. However, simply conditioning a probabilistic model on a desired return and taking the predicted action can fail dramatically in stochastic environments since trajectories that result in a return may have only achieved that return due to luck. In this work, we describe the limitations of RvS approaches in stochastic environments and propose a solution. Rather than simply conditioning on the return of a single trajectory as is standard practice, our proposed method, ESPER, learns to cluster trajectories and conditions on average cluster returns, which are independent from environment stochasticity. Doing so allows ESPER to achieve strong alignment between target return and expected performance in real environments. We demonstrate this in several challenging stochastic offline-RL tasks including the challenging puzzle game 2048, and Connect Four playing against a stochastic opponent. In all tested domains, ESPER achieves significantly better alignment between the target return and achieved return than simply conditioning on returns. ESPER also achieves higher maximum performance than even the value-based baselines.
    Hessian Averaging in Stochastic Newton Methods Achieves Superlinear Convergence. (arXiv:2204.09266v2 [math.OC] UPDATED)
    We consider minimizing a smooth and strongly convex objective function using a stochastic Newton method. At each iteration, the algorithm is given an oracle access to a stochastic estimate of the Hessian matrix. The oracle model includes popular algorithms such as Subsampled Newton and Newton Sketch. Despite using second-order information, these existing methods do not exhibit superlinear convergence, unless the stochastic noise is gradually reduced to zero during the iteration, which would lead to a computational blow-up in the per-iteration cost. We propose to address this limitation with Hessian averaging: instead of using the most recent Hessian estimate, our algorithm maintains an average of all the past estimates. This reduces the stochastic noise while avoiding the computational blow-up. We show that this scheme exhibits local $Q$-superlinear convergence with a non-asymptotic rate of $(\Upsilon\sqrt{\log (t)/t}\,)^{t}$, where $\Upsilon$ is proportional to the level of stochastic noise in the Hessian oracle. A potential drawback of this (uniform averaging) approach is that the averaged estimates contain Hessian information from the global phase of the method, i.e., before the iterates converge to a local neighborhood. This leads to a distortion that may substantially delay the superlinear convergence until long after the local neighborhood is reached. To address this drawback, we study a number of weighted averaging schemes that assign larger weights to recent Hessians, so that the superlinear convergence arises sooner, albeit with a slightly slower rate. Remarkably, we show that there exists a universal weighted averaging scheme that transitions to local convergence at an optimal stage, and still exhibits a superlinear convergence rate nearly (up to a logarithmic factor) matching that of uniform Hessian averaging.
    A Unified Approach to Reinforcement Learning, Quantal Response Equilibria, and Two-Player Zero-Sum Games. (arXiv:2206.05825v2 [cs.LG] UPDATED)
    Algorithms designed for single-agent reinforcement learning (RL) generally fail to converge to equilibria in two-player zero-sum (2p0s) games. On the other hand, game-theoretic algorithms for approximating Nash and regularized equilibria in 2p0s games are not typically competitive for RL and can be difficult to scale. As a result, algorithms for these two cases are generally developed and evaluated separately. In this work, we show that a single algorithm can produce strong results in both settings, despite their fundamental differences. This algorithm, which we call magnet mirror descent (MMD), is a simple extension to mirror descent and a special case of a non-Euclidean proximal gradient algorithm. From a theoretical standpoint, we prove a novel linear convergence for this non-Euclidean proximal gradient algorithm for a class of variational inequality problems. It follows from this result that MMD converges linearly to quantal response equilibria (i.e., entropy regularized Nash equilibria) in extensive-form games; this is the first time linear convergence has been proven for a first order solver. Moreover, applied as a tabular Nash equilibrium solver via self-play, we show empirically that MMD produces results competitive with CFR; this is the first time that a standard RL algorithm has done so. Furthermore, for single-agent deep RL, on a small collection of Atari and Mujoco tasks, we show that MMD can produce results competitive with those of PPO. Lastly, for multi-agent deep RL, we show MMD can outperform NFSP in 3x3 Abrupt Dark Hex.
    A Kernel Perspective of Skip Connections in Convolutional Networks. (arXiv:2211.14810v1 [cs.LG])
    Over-parameterized residual networks (ResNets) are amongst the most successful convolutional neural architectures for image processing. Here we study their properties through their Gaussian Process and Neural Tangent kernels. We derive explicit formulas for these kernels, analyze their spectra, and provide bounds on their implied condition numbers. Our results indicate that (1) with ReLU activation, the eigenvalues of these residual kernels decay polynomially at a similar rate compared to the same kernels when skip connections are not used, thus maintaining a similar frequency bias; (2) however, residual kernels are more locally biased. Our analysis further shows that the matrices obtained by these residual kernels yield favorable condition numbers at finite depths than those obtained without the skip connections, enabling therefore faster convergence of training with gradient descent.
    Neural Circuit Architectural Priors for Embodied Control. (arXiv:2201.05242v2 [cs.LG] UPDATED)
    Artificial neural networks for motor control usually adopt generic architectures like fully connected MLPs. While general, these tabula rasa architectures rely on large amounts of experience to learn, are not easily transferable to new bodies, and have internal dynamics that are difficult to interpret. In nature, animals are born with highly structured connectivity in their nervous systems shaped by evolution; this innate circuitry acts synergistically with learning mechanisms to provide inductive biases that enable most animals to function well soon after birth and learn efficiently. Convolutional networks inspired by visual circuitry have encoded useful biases for vision. However, it is unknown the extent to which ANN architectures inspired by neural circuitry can yield useful biases for other AI domains. In this work, we ask what advantages biologically inspired ANN architecture can provide in the domain of motor control. Specifically, we translate C. elegans locomotion circuits into an ANN model controlling a simulated Swimmer agent. On a locomotion task, our architecture achieves good initial performance and asymptotic performance comparable with MLPs, while dramatically improving data efficiency and requiring orders of magnitude fewer parameters. Our architecture is interpretable and transfers to new body designs. An ablation analysis shows that constrained excitation/inhibition is crucial for learning, while weight initialization contributes to good initial performance. Our work demonstrates several advantages of biologically inspired ANN architecture and encourages future work in more complex embodied control.
    Curious Exploration via Structured World Models Yields Zero-Shot Object Manipulation. (arXiv:2206.11403v2 [cs.LG] UPDATED)
    It has been a long-standing dream to design artificial agents that explore their environment efficiently via intrinsic motivation, similar to how children perform curious free play. Despite recent advances in intrinsically motivated reinforcement learning (RL), sample-efficient exploration in object manipulation scenarios remains a significant challenge as most of the relevant information lies in the sparse agent-object and object-object interactions. In this paper, we propose to use structured world models to incorporate relational inductive biases in the control loop to achieve sample-efficient and interaction-rich exploration in compositional multi-object environments. By planning for future novelty inside structured world models, our method generates free-play behavior that starts to interact with objects early on and develops more complex behavior over time. Instead of using models only to compute intrinsic rewards, as commonly done, our method showcases that the self-reinforcing cycle between good models and good exploration also opens up another avenue: zero-shot generalization to downstream tasks via model-based planning. After the entirely intrinsic task-agnostic exploration phase, our method solves challenging downstream tasks such as stacking, flipping, pick & place, and throwing that generalizes to unseen numbers and arrangements of objects without any additional training.
    Learning Task-Aware Energy Disaggregation: a Federated Approach. (arXiv:2204.06767v2 [cs.LG] UPDATED)
    We consider the problem of learning the energy disaggregation signals for residential load data. Such task is referred as non-intrusive load monitoring (NILM), and in order to find individual devices' power consumption profiles based on aggregated meter measurements, a machine learning model is usually trained based on large amount of training data coming from a number of residential homes. Yet collecting such residential load datasets require both huge efforts and customers' approval on sharing metering data, while load data coming from different regions or electricity users may exhibit heterogeneous usage patterns. Both practical concerns make training a single, centralized NILM model challenging. In this paper, we propose a decentralized and task-adaptive learning scheme for NILM tasks, where nested meta learning and federated learning steps are designed for learning task-specific models collectively. Simulation results on benchmark dataset validate proposed algorithm's performance on efficiently inferring appliance-level consumption for a variety of homes and appliances.
    Combating Unknown Bias with Effective Bias-Conflicting Scoring and Gradient Alignment. (arXiv:2111.13108v2 [cs.LG] UPDATED)
    Models notoriously suffer from dataset biases which are detrimental to robustness and generalization. The identify-emphasize paradigm shows a promising effect in dealing with unknown biases. However, we find that it is still plagued by two challenges: A, the quality of the identified bias-conflicting samples is far from satisfactory; B, the emphasizing strategies just yield suboptimal performance. In this work, for challenge A, we propose an effective bias-conflicting scoring method to boost the identification accuracy with two practical strategies -- peer-picking and epoch-ensemble. For challenge B, we point out that the gradient contribution statistics can be a reliable indicator to inspect whether the optimization is dominated by bias-aligned samples. Then, we propose gradient alignment, which employs gradient statistics to balance the contributions of the mined bias-aligned and bias-conflicting samples dynamically throughout the learning process, forcing models to leverage intrinsic features to make fair decisions. Experiments are conducted on multiple datasets in various settings, demonstrating that the proposed solution can alleviate the impact of unknown biases and achieve state-of-the-art performance.
    AdaTerm: Adaptive T-Distribution Estimated Robust Moments towards Noise-Robust Stochastic Gradient Optimizer. (arXiv:2201.06714v2 [cs.LG] UPDATED)
    With deep learning applications becoming more practical, practitioners are inevitably faced with datasets corrupted by a variety of noise such as measurement errors, mislabeling and estimated surrogate inputs/outputs, which can have negative impacts on the optimization results. As a safety net, it is natural to improve the robustness to noise of the optimization algorithm which updates the network parameters in the final process of learning. Previous works revealed that the first momentum used in Adam-like stochastic gradient descent optimizers can be modified based on the Student's t-distribution to produce updates robust to noise. In this paper, we propose AdaTerm which derives not only the first momentum but also all of the involved statistics based on the Student's t-distribution, providing for the first time a unified treatment of the optimization process under the t-distribution statistical model. When the computed gradients statistically appear to be aberrant, AdaTerm excludes them from the update and reinforce its robustness for subsequent updates; otherwise, it normally updates the network parameters and relaxes its robustness for the following updates. With this noise-adaptive behavior, AdaTerm's excellent learning performance was confirmed via typical optimization problems with several cases where the noise ratio is different and/or unknown. In addition, we proved a new general trick for deriving a theoretical regret bound without AMSGrad.
    Tight Lower Bounds on Worst-Case Guarantees for Zero-Shot Learning with Attributes. (arXiv:2205.13068v2 [cs.LG] UPDATED)
    We develop a rigorous mathematical analysis of zero-shot learning with attributes. In this setting, the goal is to label novel classes with no training data, only detectors for attributes and a description of how those attributes are correlated with the target classes, called the class-attribute matrix. We develop the first non-trivial lower bound on the worst-case error of the best map from attributes to classes for this setting, even with perfect attribute detectors. The lower bound characterizes the theoretical intrinsic difficulty of the zero-shot problem based on the available information -- the class-attribute matrix -- and the bound is practically computable from it. Our lower bound is tight, as we show that we can always find a randomized map from attributes to classes whose expected error is upper bounded by the value of the lower bound. We show that our analysis can be predictive of how standard zero-shot methods behave in practice, including which classes will likely be confused with others.
    Heterogeneous Treatment Effect Estimation using machine learning for Healthcare application: tutorial and benchmark. (arXiv:2109.12769v4 [cs.LG] UPDATED)
    Developing new drugs for target diseases is a time-consuming and expensive task, drug repurposing has become a popular topic in the drug development field. As much health claim data become available, many studies have been conducted on the data. The real-world data is noisy, sparse, and has many confounding factors. In addition, many studies have shown that drugs effects are heterogeneous among the population. Lots of advanced machine learning models about estimating heterogeneous treatment effects (HTE) have emerged in recent years, and have been applied to in econometrics and machine learning communities. These studies acknowledge medicine and drug development as the main application area, but there has been limited translational research from the HTE methodology to drug development. We aim to introduce the HTE methodology to the healthcare area and provide feasibility consideration when translating the methodology with benchmark experiments on healthcare administrative claim data. Also, we want to use benchmark experiments to show how to interpret and evaluate the model when it is applied to healthcare research. By introducing the recent HTE techniques to a broad readership in biomedical informatics communities, we expect to promote the wide adoption of causal inference using machine learning. We also expect to provide the feasibility of HTE for personalized drug effectiveness.
    Data synthesis and adversarial networks: A review and meta-analysis in cancer imaging. (arXiv:2107.09543v2 [eess.IV] UPDATED)
    Despite technological and medical advances, the detection, interpretation, and treatment of cancer based on imaging data continue to pose significant challenges. These include inter-observer variability, class imbalance, dataset shifts, inter- and intra-tumour heterogeneity, malignancy determination, and treatment effect uncertainty. Given the recent advancements in Generative Adversarial Networks (GANs), data synthesis, and adversarial training, we assess the potential of these technologies to address a number of key challenges of cancer imaging. We categorise these challenges into (a) data scarcity and imbalance, (b) data access and privacy, (c) data annotation and segmentation, (d) cancer detection and diagnosis, and (e) tumour profiling, treatment planning and monitoring. Based on our analysis of 164 publications that apply adversarial training techniques in the context of cancer imaging, we highlight multiple underexplored solutions with research potential. We further contribute the Synthesis Study Trustworthiness Test (SynTRUST), a meta-analysis framework for assessing the validation rigour of medical image synthesis studies. SynTRUST is based on 26 concrete measures of thoroughness, reproducibility, usefulness, scalability, and tenability. Based on SynTRUST, we analyse 16 of the most promising cancer imaging challenge solutions and observe a high validation rigour in general, but also several desirable improvements. With this work, we strive to bridge the gap between the needs of the clinical cancer imaging community and the current and prospective research on data synthesis and adversarial networks in the artificial intelligence community.
    Accelerating Fully Connected Neural Network on Optical Network-on-Chip (ONoC). (arXiv:2109.14878v1 [cs.DC] CROSS LISTED)
    Fully Connected Neural Network (FCNN) is a class of Artificial Neural Networks widely used in computer science and engineering, whereas the training process can take a long time with large datasets in existing many-core systems. Optical Network-on-Chip (ONoC), an emerging chip-scale optical interconnection technology, has great potential to accelerate the training of FCNN with low transmission delay, low power consumption, and high throughput. However, existing methods based on Electrical Network-on-Chip (ENoC) cannot fit in ONoC because of the unique properties of ONoC. In this paper, we propose a fine-grained parallel computing model for accelerating FCNN training on ONoC and derive the optimal number of cores for each execution stage with the objective of minimizing the total amount of time to complete one epoch of FCNN training. To allocate the optimal number of cores for each execution stage, we present three mapping strategies and compare their advantages and disadvantages in terms of hotspot level, memory requirement, and state transitions. Simulation results show that the average prediction error for the optimal number of cores in NN benchmarks is within 2.3%. We further carry out extensive simulations which demonstrate that FCNN training time can be reduced by 22.28% and 4.91% on average using our proposed scheme, compared with traditional parallel computing methods that either allocate a fixed number of cores or allocate as many cores as possible, respectively. Compared with ENoC, simulation results show that under batch sizes of 64 and 128, on average ONoC can achieve 21.02% and 12.95% on reducing training time with 47.85% and 39.27% on saving energy, respectively.
    Controlled Gaussian Process Dynamical Models with Application to Robotic Cloth Manipulation. (arXiv:2103.06615v3 [cs.RO] UPDATED)
    Over the last years, robotic cloth manipulation has gained relevance within the research community. While significant advances have been made in robotic manipulation of rigid objects, the manipulation of non-rigid objects such as cloth garments is still a challenging problem. The uncertainty on how cloth behaves often requires the use of model-based approaches. However, cloth models have a very high dimensionality. Therefore, it is difficult to find a middle point between providing a manipulator with a dynamics model of cloth and working with a state space of tractable dimensionality. For this reason, most cloth manipulation approaches in literature perform static or quasi-static manipulation. In this paper, we propose a variation of Gaussian Process Dynamical Models (GPDMs) to model cloth dynamics in a low-dimensional manifold. GPDMs project a high-dimensional state space into a smaller dimension latent space which is capable of keeping the dynamic properties. Using such approach, we add control variables to the original formulation. In this way, it is possible to take into account the robot commands exerted on the cloth dynamics. We call this new version Controlled Gaussian Process Dynamical Model (CGPDM). Moreover, we propose an alternative parametric structure for the model, that is richer than the one employed in previous GPDM realizations. The modeling capacity of our proposal has been tested in both a simulated and a real scenario, where CGPDM proved to be capable of generalizing over a wide range of movements and correctly predicting the cloth motions obtained by previously unseen sequences of control actions.
    Differentiable Meta Multigraph Search with Partial Message Propagation on Heterogeneous Information Networks. (arXiv:2211.14752v1 [cs.LG])
    Heterogeneous information networks (HINs) are widely employed for describing real-world data with intricate entities and relationships. To automatically utilize their semantic information, graph neural architecture search has recently been developed on various tasks of HINs. Existing works, on the other hand, show weaknesses in instability and inflexibility. To address these issues, we propose a novel method called Partial Message Meta Multigraph search (PMMM) to automatically optimize the neural architecture design on HINs. Specifically, to learn how graph neural networks (GNNs) propagate messages along various types of edges, PMMM adopts an efficient differentiable framework to search for a meaningful meta multigraph, which can capture more flexible and complex semantic relations than a meta graph. The differentiable search typically suffers from performance instability, so we further propose a stable algorithm called partial message search to ensure that the searched meta multigraph consistently surpasses the manually designed meta-structures, i.e., meta-paths. Extensive experiments on six benchmark datasets over two representative tasks, including node classification and recommendation, demonstrate the effectiveness of the proposed method. Our approach outperforms the state-of-the-art heterogeneous GNNs, finds out meaningful meta multigraphs, and is significantly more stable.
    ABC-FL: Anomalous and Benign client Classification in Federated Learning. (arXiv:2108.04551v3 [cs.LG] UPDATED)
    Federated Learning is a distributed machine learning framework designed for data privacy preservation i.e., local data remain private throughout the entire training and testing procedure. Federated Learning is gaining popularity because it allows one to use machine learning techniques while preserving privacy. However, it inherits the vulnerabilities and susceptibilities raised in deep learning techniques. For instance, Federated Learning is particularly vulnerable to data poisoning attacks that may deteriorate its performance and integrity due to its distributed nature and inaccessibility to the raw data. In addition, it is extremely difficult to correctly identify malicious clients due to the non-Independently and/or Identically Distributed (non-IID) data. The real-world data can be complex and diverse, making them hardly distinguishable from the malicious data without direct access to the raw data. Prior research has focused on detecting malicious clients while treating only the clients having IID data as benign. In this study, we propose a method that detects and classifies anomalous clients from benign clients when benign ones have non-IID data. Our proposed method leverages feature dimension reduction, dynamic clustering, and cosine similarity-based clipping. The experimental results validates that our proposed method not only classifies the malicious clients but also alleviates their negative influences from the entire procedure. Our findings may be used in future studies to effectively eliminate anomalous clients when building a model with diverse data.
    Multi-Objective Loss Balancing for Physics-Informed Deep Learning. (arXiv:2110.09813v2 [cs.LG] UPDATED)
    Physics-Informed Neural Networks (PINN) are algorithms from deep learning leveraging physical laws by including partial differential equations together with a respective set of boundary and initial conditions as penalty terms into their loss function. In this work, we observe the significant role of correctly weighting the combination of multiple competitive loss functions for training PINNs effectively. To this end, we implement and evaluate different methods aiming at balancing the contributions of multiple terms of the PINNs loss function and their gradients. After reviewing of three existing loss scaling approaches (Learning Rate Annealing, GradNorm and SoftAdapt), we propose a novel self-adaptive loss balancing scheme for PINNs named \emph{ReLoBRaLo} (Relative Loss Balancing with Random Lookback). We extensively evaluate the performance of the aforementioned balancing schemes by solving both forward as well as inverse problems on three benchmark PDEs for PINNs: Burgers' equation, Kirchhoff's plate bending equation and Helmholtz's equation. The results show that ReLoBRaLo is able to consistently outperform the baseline of existing scaling methods in terms of accuracy, while also inducing significantly less computational overhead.
    Memory-efficient array redistribution through portable collective communication. (arXiv:2112.01075v2 [cs.DC] UPDATED)
    Modern large-scale deep learning workloads highlight the need for parallel execution across many devices in order to fit model data into hardware accelerator memories. In these settings, array redistribution may be required during a computation, but can also become a bottleneck if not done efficiently. In this paper we address the problem of redistributing multi-dimensional array data in SPMD computations, the most prevalent form of parallelism in deep learning. We present a type-directed approach to synthesizing array redistributions as sequences of MPI-style collective operations. We prove formally that our synthesized redistributions are memory-efficient and perform no excessive data transfers. Array redistribution for SPMD computations using collective operations has also been implemented in the context of the XLA SPMD partitioner, a production-grade tool for partitioning programs across accelerator systems. We evaluate our approach against the XLA implementation and find that our approach delivers a geometric mean speedup of $1.22\times$, with maximum speedups as a high as $5.7\times$, while offering provable memory guarantees, making our system particularly appealing for large-scale models.
    Local Explanations for Reinforcement Learning. (arXiv:2202.03597v2 [cs.LG] UPDATED)
    Many works in explainable AI have focused on explaining black-box classification models. Explaining deep reinforcement learning (RL) policies in a manner that could be understood by domain users has received much less attention. In this paper, we propose a novel perspective to understanding RL policies based on identifying important states from automatically learned meta-states. The key conceptual difference between our approach and many previous ones is that we form meta-states based on locality governed by the expert policy dynamics rather than based on similarity of actions, and that we do not assume any particular knowledge of the underlying topology of the state space. Theoretically, we show that our algorithm to find meta-states converges and the objective that selects important states from each meta-state is submodular leading to efficient high quality greedy selection. Experiments on four domains (four rooms, door-key, minipacman, and pong) and a carefully conducted user study illustrate that our perspective leads to better understanding of the policy. We conjecture that this is a result of our meta-states being more intuitive in that the corresponding important states are strong indicators of tractable intermediate goals that are easier for humans to interpret and follow.
    A Physics-informed Diffusion Model for High-fidelity Flow Field Reconstruction. (arXiv:2211.14680v1 [cs.LG])
    Machine learning models are gaining increasing popularity in the domain of fluid dynamics for their potential to accelerate the production of high-fidelity computational fluid dynamics data. However, many recently proposed machine learning models for high-fidelity data reconstruction require low-fidelity data for model training. Such requirement restrains the application performance of these models, since their data reconstruction accuracy would drop significantly if the low-fidelity input data used in model test has a large deviation from the training data. To overcome this restraint, we propose a diffusion model which only uses high-fidelity data at training. With different configurations, our model is able to reconstruct high-fidelity data from either a regular low-fidelity sample or a sparsely measured sample, and is also able to gain an accuracy increase by using physics-informed conditioning information from a known partial differential equation when that is available. Experimental results demonstrate that our model can produce accurate reconstruction results for 2d turbulent flows based on different input sources without retraining.
    Data-free Backdoor Removal based on Channel Lipschitzness. (arXiv:2208.03111v2 [cs.LG] UPDATED)
    Recent studies have shown that Deep Neural Networks (DNNs) are vulnerable to the backdoor attacks, which leads to malicious behaviors of DNNs when specific triggers are attached to the input images. It was further demonstrated that the infected DNNs possess a collection of channels, which are more sensitive to the backdoor triggers compared with normal channels. Pruning these channels was then shown to be effective in mitigating the backdoor behaviors. To locate those channels, it is natural to consider their Lipschitzness, which measures their sensitivity against worst-case perturbations on the inputs. In this work, we introduce a novel concept called Channel Lipschitz Constant (CLC), which is defined as the Lipschitz constant of the mapping from the input images to the output of each channel. Then we provide empirical evidences to show the strong correlation between an Upper bound of the CLC (UCLC) and the trigger-activated change on the channel activation. Since UCLC can be directly calculated from the weight matrices, we can detect the potential backdoor channels in a data-free manner, and do simple pruning on the infected DNN to repair the model. The proposed Channel Lipschitzness based Pruning (CLP) method is super fast, simple, data-free and robust to the choice of the pruning threshold. Extensive experiments are conducted to evaluate the efficiency and effectiveness of CLP, which achieves state-of-the-art results among the mainstream defense methods even without any data. Source codes are available at https://github.com/rkteddy/channel-Lipschitzness-based-pruning.
    Boundary Graph Neural Networks for 3D Simulations. (arXiv:2106.11299v4 [cs.LG] UPDATED)
    The abundance of data has given machine learning considerable momentum in natural sciences and engineering, though modeling of physical processes is often difficult. A particularly tough problem is the efficient representation of geometric boundaries. Triangularized geometric boundaries are well understood and ubiquitous in engineering applications. However, it is notoriously difficult to integrate them into machine learning approaches due to their heterogeneity with respect to size and orientation. In this work, we introduce an effective theory to model particle-boundary interactions, which leads to our new Boundary Graph Neural Networks (BGNNs) that dynamically modify graph structures to obey boundary conditions. The new BGNNs are tested on complex 3D granular flow processes of hoppers, rotating drums and mixers, which are all standard components of modern industrial machinery but still have complicated geometry. BGNNs are evaluated in terms of computational efficiency as well as prediction accuracy of particle flows and mixing entropies. BGNNs are able to accurately reproduce 3D granular flows within simulation uncertainties over hundreds of thousands of simulation timesteps. Most notably, in our experiments, particles stay within the geometric objects without using handcrafted conditions or restrictions.
    Unsupervised Wildfire Change Detection based on Contrastive Learning. (arXiv:2211.14654v1 [cs.CV])
    The accurate characterization of the severity of the wildfire event strongly contributes to the characterization of the fuel conditions in fire-prone areas, and provides valuable information for disaster response. The aim of this study is to develop an autonomous system built on top of high-resolution multispectral satellite imagery, with an advanced deep learning method for detecting burned area change. This work proposes an initial exploration of using an unsupervised model for feature extraction in wildfire scenarios. It is based on the contrastive learning technique SimCLR, which is trained to minimize the cosine distance between augmentations of images. The distance between encoded images can also be used for change detection. We propose changes to this method that allows it to be used for unsupervised burned area detection and following downstream tasks. We show that our proposed method outperforms the tested baseline approaches.
    On the convex hull of convex quadratic optimization problems with indicators. (arXiv:2201.00387v2 [math.OC] UPDATED)
    We consider the convex quadratic optimization problem with indicator variables and arbitrary constraints on the indicators. We show that a convex hull description of the associated mixed-integer set in an extended space with a quadratic number of additional variables consists of a single positive semidefinite constraint (explicitly stated) and linear constraints. In particular, convexification of this class of problems reduces to describing a polyhedral set in an extended formulation. While the vertex representation of this polyhedral set is exponential and an explicit linear inequality description may not be readily available in general, we derive a compact mixed-integer linear formulation whose solutions coincide with the vertices of the polyhedral set. We also give descriptions in the original space of variables: we provide a description based on an infinite number of conic-quadratic inequalities, which are ``finitely generated." In particular, it is possible to characterize whether a given inequality is necessary to describe the convex hull. The new theory presented here unifies several previously established results, and paves the way toward utilizing polyhedral methods to analyze the convex hull of mixed-integer nonlinear sets.
    Latent SHAP: Toward Practical Human-Interpretable Explanations. (arXiv:2211.14797v1 [cs.LG])
    Model agnostic feature attribution algorithms (such as SHAP and LIME) are ubiquitous techniques for explaining the decisions of complex classification models, such as deep neural networks. However, since complex classification models produce superior performance when trained on low-level (or encoded) features, in many cases, the explanations generated by these algorithms are neither interpretable nor usable by humans. Methods proposed in recent studies that support the generation of human-interpretable explanations are impractical, because they require a fully invertible transformation function that maps the model's input features to the human-interpretable features. In this work, we introduce Latent SHAP, a black-box feature attribution framework that provides human-interpretable explanations, without the requirement for a fully invertible transformation function. We demonstrate Latent SHAP's effectiveness using (1) a controlled experiment where invertible transformation functions are available, which enables robust quantitative evaluation of our method, and (2) celebrity attractiveness classification (using the CelebA dataset) where invertible transformation functions are not available, which enables thorough qualitative evaluation of our method.
    Neural Networks as Paths through the Space of Representations. (arXiv:2206.10999v2 [cs.LG] UPDATED)
    Deep neural networks implement a sequence of layer-by-layer operations that are each relatively easy to understand, but the resulting overall computation is generally difficult to understand. We consider a simple hypothesis for interpreting the layer-by-layer construction of useful representations: perhaps the role of each layer is to reformat information to reduce the "distance" to the desired outputs. With this framework, the layer-wise computation implemented by a deep neural network can be viewed as a path through a high-dimensional representation space. We formalize this intuitive idea of a "path" by leveraging recent advances in *metric* representational similarity. We extend existing representational distance methods by computing geodesics, angles, and projections of representations, going beyond mere layer distances. We then demonstrate these tools by visualizing and comparing the paths taken by ResNet and VGG architectures on CIFAR-10. We conclude by sketching additional ways that this kind of representational geometry can be used to understand and interpret network training, and to describe novel kinds of similarities between different models.
    Neural Network Verification as Piecewise Linear Optimization: Formulations for the Composition of Staircase Functions. (arXiv:2211.14706v1 [cs.LG])
    We present a technique for neural network verification using mixed-integer programming (MIP) formulations. We derive a \emph{strong formulation} for each neuron in a network using piecewise linear activation functions. Additionally, as in general, these formulations may require an exponential number of inequalities, we also derive a separation procedure that runs in super-linear time in the input dimension. We first introduce and develop our technique on the class of \emph{staircase} functions, which generalizes the ReLU, binarized, and quantized activation functions. We then use results for staircase activation functions to obtain a separation method for general piecewise linear activation functions. Empirically, using our strong formulation and separation technique, we can reduce the computational time in exact verification settings based on MIP and improve the false negative rate for inexact verifiers relying on the relaxation of the MIP formulation.
    Mean-Shifted Contrastive Loss for Anomaly Detection. (arXiv:2106.03844v2 [cs.CV] UPDATED)
    Deep anomaly detection methods learn representations that separate between normal and anomalous images. Although self-supervised representation learning is commonly used, small dataset sizes limit its effectiveness. It was previously shown that utilizing external, generic datasets (e.g. ImageNet classification) can significantly improve anomaly detection performance. One approach is outlier exposure, which fails when the external datasets do not resemble the anomalies. We take the approach of transferring representations pre-trained on external datasets for anomaly detection. Anomaly detection performance can be significantly improved by fine-tuning the pre-trained representations on the normal training images. In this paper, we first demonstrate and analyze that contrastive learning, the most popular self-supervised learning paradigm cannot be naively applied to pre-trained features. The reason is that pre-trained feature initialization causes poor conditioning for standard contrastive objectives, resulting in bad optimization dynamics. Based on our analysis, we provide a modified contrastive objective, the Mean-Shifted Contrastive Loss. Our method is highly effective and achieves a new state-of-the-art anomaly detection performance including $98.6\%$ ROC-AUC on the CIFAR-10 dataset.
    Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming. (arXiv:2210.14306v2 [cs.SE] UPDATED)
    AI code-recommendation systems (CodeRec), such as Copilot, can assist programmers inside an IDE by suggesting and autocompleting arbitrary code; potentially improving their productivity. To understand how these AI improve programmers in a coding session, we need to understand how they affect programmers' behavior. To make progress, we studied GitHub Copilot, and developed CUPS -- a taxonomy of 12 programmer activities common to AI code completion systems. We then conducted a study with 21 programmers who completed coding tasks and used our labeling tool to retrospectively label their sessions with CUPS. We analyze over 3000 label instances, and visualize the results with timelines and state machines to profile programmer-CodeRec interaction. This reveals novel insights into the distribution and patterns of programmer behavior, as well as inefficiencies and time costs. Finally, we use these insights to inform future interventions to improve AI-assisted programming and human-AI interaction.
    Deep Learning Training Procedure Augmentations. (arXiv:2211.14395v1 [cs.CV])
    Recent advances in Deep Learning have greatly improved performance on various tasks such as object detection, image segmentation, sentiment analysis. The focus of most research directions up until very recently has been on beating state-of-the-art results. This has materialized in the utilization of bigger and bigger models and techniques which help the training procedure to extract more predictive power out of a given dataset. While this has lead to great results, many of which with real-world applications, other relevant aspects of deep learning have remained neglected and unknown. In this work, we will present several novel deep learning training techniques which, while capable of offering significant performance gains they also reveal several interesting analysis results regarding convergence speed, optimization landscape smoothness, and adversarial robustness. The methods presented in this work are the following: $\bullet$ Perfect Ordering Approximation; a generalized model agnostic curriculum learning approach. The results show the effectiveness of the technique for improving training time as well as offer some new insight into the training process of deep networks. $\bullet$ Cascading Sum Augmentation; an extension of mixup capable of utilizing more data points for linear interpolation by leveraging a smoother optimization landscape. This can be used for computer vision tasks in order to improve both prediction performance as well as improve passive model robustness.
    Game Theoretic Mixed Experts for Combinational Adversarial Machine Learning. (arXiv:2211.14669v1 [cs.LG])
    Recent advances in adversarial machine learning have shown that defenses considered to be robust are actually susceptible to adversarial attacks which are specifically tailored to target their weaknesses. These defenses include Barrage of Random Transforms (BaRT), Friendly Adversarial Training (FAT), Trash is Treasure (TiT) and ensemble models made up of Vision Transformers (ViTs), Big Transfer models and Spiking Neural Networks (SNNs). A natural question arises: how can one best leverage a combination of adversarial defenses to thwart such attacks? In this paper, we provide a game-theoretic framework for ensemble adversarial attacks and defenses which answers this question. In addition to our framework we produce the first adversarial defense transferability study to further motivate a need for combinational defenses utilizing a diverse set of defense architectures. Our framework is called Game theoretic Mixed Experts (GaME) and is designed to find the Mixed-Nash strategy for a defender when facing an attacker employing compositional adversarial attacks. We show that this framework creates an ensemble of defenses with greater robustness than multiple state-of-the-art, single-model defenses in addition to combinational defenses with uniform probability distributions. Overall, our framework and analyses advance the field of adversarial machine learning by yielding new insights into compositional attack and defense formulations.
    Class-aware Information for Logit-based Knowledge Distillation. (arXiv:2211.14773v1 [cs.CV])
    Knowledge distillation aims to transfer knowledge to the student model by utilizing the predictions/features of the teacher model, and feature-based distillation has recently shown its superiority over logit-based distillation. However, due to the cumbersome computation and storage of extra feature transformation, the training overhead of feature-based methods is much higher than that of logit-based distillation. In this work, we revisit the logit-based knowledge distillation, and observe that the existing logit-based distillation methods treat the prediction logits only in the instance level, while many other useful semantic information is overlooked. To address this issue, we propose a Class-aware Logit Knowledge Distillation (CLKD) method, that extents the logit distillation in both instance-level and class-level. CLKD enables the student model mimic higher semantic information from the teacher model, hence improving the distillation performance. We further introduce a novel loss called Class Correlation Loss to force the student learn the inherent class-level correlation of the teacher. Empirical comparisons demonstrate the superiority of the proposed method over several prevailing logit-based methods and feature-based methods, in which CLKD achieves compelling results on various visual classification tasks and outperforms the state-of-the-art baselines.
    Two Sides of the Same Coin: Heterophily and Oversmoothing in Graph Convolutional Neural Networks. (arXiv:2102.06462v8 [cs.LG] UPDATED)
    In node classification tasks, graph convolutional neural networks (GCNs) have demonstrated competitive performance over traditional methods on diverse graph data. However, it is known that the performance of GCNs degrades with increasing number of layers (oversmoothing problem) and recent studies have also shown that GCNs may perform worse in heterophilous graphs, where neighboring nodes tend to belong to different classes (heterophily problem). These two problems are usually viewed as unrelated, and thus are studied independently, often at the graph filter level from a spectral perspective. We are the first to take a unified perspective to jointly explain the oversmoothing and heterophily problems at the node level. Specifically, we profile the nodes via two quantitative metrics: the relative degree of a node (compared to its neighbors) and the node-level heterophily. Our theory shows that the interplay of these two profiling metrics defines three cases of node behaviors, which explain the oversmoothing and heterophily problems jointly and can predict the performance of GCNs. Based on insights from our theory, we show theoretically and empirically the effectiveness of two strategies: structure-based edge correction, which learns corrected edge weights from structural properties (i.e., degrees), and feature-based edge correction, which learns signed edge weights from node features. Compared to other approaches, which tend to handle well either heterophily or oversmoothing, we show that {our model, GGCN}, which incorporates the two strategies performs well in both problems.
    ReGrAt: Regularization in Graphs using Attention to handle class imbalance. (arXiv:2211.14770v1 [cs.LG])
    Node classification is an important task to solve in graph-based learning. Even though a lot of work has been done in this field, imbalance is neglected. Real-world data is not perfect, and is imbalanced in representations most of the times. Apart from text and images, data can be represented using graphs, and thus addressing the imbalance in graphs has become of paramount importance. In the context of node classification, one class has less examples than others. Changing data composition is a popular way to address the imbalance in node classification. This is done by resampling the data to balance the dataset. However, that can sometimes lead to loss of information or add noise to the dataset. Therefore, in this work, we implicitly solve the problem by changing the model loss. Specifically, we study how attention networks can help tackle imbalance. Moreover, we observe that using a regularizer to assign larger weights to minority nodes helps to mitigate this imbalance. We achieve State of the Art results than the existing methods on several standard citation benchmark datasets.
    Link Prediction with Non-Contrastive Learning. (arXiv:2211.14394v1 [cs.LG])
    A recent focal area in the space of graph neural networks (GNNs) is graph self-supervised learning (SSL), which aims to derive useful node representations without labeled data. Notably, many state-of-the-art graph SSL methods are contrastive methods, which use a combination of positive and negative samples to learn node representations. Owing to challenges in negative sampling (slowness and model sensitivity), recent literature introduced non-contrastive methods, which instead only use positive samples. Though such methods have shown promising performance in node-level tasks, their suitability for link prediction tasks, which are concerned with predicting link existence between pairs of nodes (and have broad applicability to recommendation systems contexts) is yet unexplored. In this work, we extensively evaluate the performance of existing non-contrastive methods for link prediction in both transductive and inductive settings. While most existing non-contrastive methods perform poorly overall, we find that, surprisingly, BGRL generally performs well in transductive settings. However, it performs poorly in the more realistic inductive settings where the model has to generalize to links to/from unseen nodes. We find that non-contrastive models tend to overfit to the training graph and use this analysis to propose T-BGRL, a novel non-contrastive framework that incorporates cheap corruptions to improve the generalization ability of the model. This simple modification strongly improves inductive performance in 5/6 of our datasets, with up to a 120% improvement in Hits@50--all with comparable speed to other non-contrastive baselines and up to 14x faster than the best-performing contrastive baseline. Our work imparts interesting findings about non-contrastive learning for link prediction and paves the way for future researchers to further expand upon this area.
    A Particle-based Sparse Gaussian Process Optimizer. (arXiv:2211.14517v1 [cs.LG])
    Task learning in neural networks typically requires finding a globally optimal minimizer to a loss function objective. Conventional designs of swarm based optimization methods apply a fixed update rule, with possibly an adaptive step-size for gradient descent based optimization. While these methods gain huge success in solving different optimization problems, there are some cases where these schemes are either inefficient or suffering from local-minimum. We present a new particle-swarm-based framework utilizing Gaussian Process Regression to learn the underlying dynamical process of descent. The biggest advantage of this approach is greater exploration around the current state before deciding a descent direction. Empirical results show our approach can escape from the local minima compare with the widely-used state-of-the-art optimizers when solving non-convex optimization problems. We also test our approach under high-dimensional parameter space case, namely, image classification task.
    Target-Free Text-guided Image Manipulation. (arXiv:2211.14544v1 [cs.CV])
    We tackle the problem of target-free text-guided image manipulation, which requires one to modify the input reference image based on the given text instruction, while no ground truth target image is observed during training. To address this challenging task, we propose a Cyclic-Manipulation GAN (cManiGAN) in this paper, which is able to realize where and how to edit the image regions of interest. Specifically, the image editor in cManiGAN learns to identify and complete the input image, while cross-modal interpreter and reasoner are deployed to verify the semantic correctness of the output image based on the input instruction. While the former utilizes factual/counterfactual description learning for authenticating the image semantics, the latter predicts the "undo" instruction and provides pixel-level supervision for the training of cManiGAN. With such operational cycle-consistency, our cManiGAN can be trained in the above weakly supervised setting. We conduct extensive experiments on the datasets of CLEVR and COCO, and the effectiveness and generalizability of our proposed method can be successfully verified. Project page: https://sites.google.com/view/wancyuanfan/projects/cmanigan.
    Learning Bimanual Scooping Policies for Food Acquisition. (arXiv:2211.14652v1 [cs.RO])
    A robotic feeding system must be able to acquire a variety of foods. Prior bite acquisition works consider single-arm spoon scooping or fork skewering, which do not generalize to foods with complex geometries and deformabilities. For example, when acquiring a group of peas, skewering could smoosh the peas while scooping without a barrier could result in chasing the peas on the plate. In order to acquire foods with such diverse properties, we propose stabilizing food items during scooping using a second arm, for example, by pushing peas against the spoon with a flat surface to prevent dispersion. The added stabilizing arm can lead to new challenges. Critically, this arm should stabilize the food scene without interfering with the acquisition motion, which is especially difficult for easily breakable high-risk food items like tofu. These high-risk foods can break between the pusher and spoon during scooping, which can lead to food waste falling out of the spoon. We propose a general bimanual scooping primitive and an adaptive stabilization strategy that enables successful acquisition of a diverse set of food geometries and physical properties. Our approach, CARBS: Coordinated Acquisition with Reactive Bimanual Scooping, learns to stabilize without impeding task progress by identifying high-risk foods and robustly scooping them using closed-loop visual feedback. We find that CARBS is able to generalize across food shape, size, and deformability and is additionally able to manipulate multiple food items simultaneously. CARBS achieves 87.0% success on scooping rigid foods, which is 25.8% more successful than a single-arm baseline, and reduces food breakage by 16.2% compared to an analytical baseline. Videos can be found at https://sites.google.com/view/bimanualscoop-corl22/home .
    Interpreting Unfairness in Graph Neural Networks via Training Node Attribution. (arXiv:2211.14383v1 [cs.LG])
    Graph Neural Networks (GNNs) have emerged as the leading paradigm for solving graph analytical problems in various real-world applications. Nevertheless, GNNs could potentially render biased predictions towards certain demographic subgroups. Understanding how the bias in predictions arises is critical, as it guides the design of GNN debiasing mechanisms. However, most existing works overwhelmingly focus on GNN debiasing, but fall short on explaining how such bias is induced. In this paper, we study a novel problem of interpreting GNN unfairness through attributing it to the influence of training nodes. Specifically, we propose a novel strategy named Probabilistic Distribution Disparity (PDD) to measure the bias exhibited in GNNs, and develop an algorithm to efficiently estimate the influence of each training node on such bias. We verify the validity of PDD and the effectiveness of influence estimation through experiments on real-world datasets. Finally, we also demonstrate how the proposed framework could be used for debiasing GNNs. Open-source code can be found at https://github.com/yushundong/BIND.
    Paying Attention to Astronomical Transients: Introducing the Time-series Transformer for Photometric Classification. (arXiv:2105.06178v2 [astro-ph.IM] UPDATED)
    Future surveys such as the Legacy Survey of Space and Time (LSST) of the Vera C. Rubin Observatory will observe an order of magnitude more astrophysical transient events than any previous survey before. With this deluge of photometric data, it will be impossible for all such events to be classified by humans alone. Recent efforts have sought to leverage machine learning methods to tackle the challenge of astronomical transient classification, with ever improving success. Transformers are a recently developed deep learning architecture, first proposed for natural language processing, that have shown a great deal of recent success. In this work we develop a new transformer architecture, which uses multi-head self attention at its core, for general multi-variate time-series data. Furthermore, the proposed time-series transformer architecture supports the inclusion of an arbitrary number of additional features, while also offering interpretability. We apply the time-series transformer to the task of photometric classification, minimising the reliance of expert domain knowledge for feature selection, while achieving results comparable to state-of-the-art photometric classification methods. We achieve a logarithmic-loss of 0.507 on imbalanced data in a representative setting using data from the Photometric LSST Astronomical Time-Series Classification Challenge (PLAsTiCC). Moreover, we achieve a micro-averaged receiver operating characteristic area under curve of 0.98 and micro-averaged precision-recall area under curve of 0.87.
    Spatio-Temporal Meta-Graph Learning for Traffic Forecasting. (arXiv:2211.14701v1 [cs.LG])
    Traffic forecasting as a canonical task of multivariate time series forecasting has been a significant research topic in AI community. To address the spatio-temporal heterogeneity and non-stationarity implied in the traffic stream, in this study, we propose Spatio-Temporal Meta-Graph Learning as a novel Graph Structure Learning mechanism on spatio-temporal data. Specifically, we implement this idea into Meta-Graph Convolutional Recurrent Network (MegaCRN) by plugging the Meta-Graph Learner powered by a Meta-Node Bank into GCRN encoder-decoder. We conduct a comprehensive evaluation on two benchmark datasets (METR-LA and PEMS-BAY) and a new large-scale traffic speed dataset in which traffic incident information is contained. Our model outperformed the state-of-the-arts to a large degree on all three datasets (over 27% MAE and 34% RMSE). Besides, through a series of qualitative evaluations, we demonstrate that our model can explicitly disentangle the road links and time slots with different patterns and be robustly adaptive to any anomalous traffic situations. Codes and datasets are available at https://github.com/deepkashiwa20/MegaCRN.
    Wiener Graph Deconvolutional Network Improves Graph Self-Supervised Learning. (arXiv:2206.12933v3 [cs.LG] UPDATED)
    Graph self-supervised learning (SSL) has been vastly employed to learn representations from unlabeled graphs. Existing methods can be roughly divided into predictive learning and contrastive learning, where the latter one attracts more research attention with better empirical performance. We argue that, however, predictive models weaponed with powerful decoder could achieve comparable or even better representation power than contrastive models. In this work, we propose a Wiener Graph Deconvolutional Network (WGDN), an augmentation-adaptive decoder empowered by graph wiener filter to perform information reconstruction. Theoretical analysis proves the superior reconstruction ability of graph wiener filter. Extensive experimental results on various datasets demonstrate the effectiveness of our approach.
    BeGin: Extensive Benchmark Scenarios and An Easy-to-use Framework for Graph Continual Learning. (arXiv:2211.14568v1 [cs.LG])
    Continual Learning (CL) is the process of learning ceaselessly a sequence of tasks. Most existing CL methods deal with independent data (e.g., images and text) for which many benchmark frameworks and results under standard experimental settings are available. CL methods for graph data, however, are surprisingly underexplored because of (a) the lack of standard experimental settings, especially regarding how to deal with the dependency between instances, (b) the lack of benchmark datasets and scenarios, and (c) high complexity in implementation and evaluation due to the dependency. In this paper, regarding (a), we define four standard incremental settings (task-, class-, domain-, and time-incremental settings) for graph data, which are naturally applied to many node-, link-, and graph-level problems. Regarding (b), we provide 23 benchmark scenarios based on 14 real-world graphs. Regarding (c), we develop BeGin, an easy and fool-proof framework for graph CL. BeGin is easily extended since it is modularized with reusable modules for data processing, algorithm design, and evaluation. Especially, the evaluation module is completely separated from user code to eliminate potential mistakes in evaluation. Using all above, we report extensive benchmark results of seven graph CL methods. Compared to the latest benchmark for graph CL, using BeGin, we cover three times more combinations of incremental settings and levels of problems.
    Who is Gambling? Finding Cryptocurrency Gamblers Using Multi-modal Retrieval Methods. (arXiv:2211.14779v1 [cs.CR])
    With the popularity of cryptocurrencies and the remarkable development of blockchain technology, decentralized applications emerged as a revolutionary force for the Internet. Meanwhile, decentralized applications have also attracted intense attention from the online gambling community, with more and more decentralized gambling platforms created through the help of smart contracts. Compared with conventional gambling platforms, decentralized gambling have transparent rules and a low participation threshold, attracting a substantial number of gamblers. In order to discover gambling behaviors and identify the contracts and addresses involved in gambling, we propose a tool termed ETHGamDet. The tool is able to automatically detect the smart contracts and addresses involved in gambling by scrutinizing the smart contract code and address transaction records. Interestingly, we present a novel LightGBM model with memory components, which possesses the ability to learn from its own misclassifications. As a side contribution, we construct and release a large-scale gambling dataset at https://github.com/AwesomeHuang/Bitcoin-Gambling-Dataset to facilitate future research in this field. Empirically, ETHGamDet achieves a F1-score of 0.72 and 0.89 in address classification and contract classification respectively, and offers novel and interesting insights.
    Convergence Rate Analysis for Optimal Computing Budget Allocation Algorithms. (arXiv:2211.14722v1 [stat.ML])
    Ordinal optimization (OO) is a widely-studied technique for optimizing discrete-event dynamic systems (DEDS). It evaluates the performance of the system designs in a finite set by sampling and aims to correctly make ordinal comparison of the designs. A well-known method in OO is the optimal computing budget allocation (OCBA). It builds the optimality conditions for the number of samples allocated to each design, and the sample allocation that satisfies the optimality conditions is shown to asymptotically maximize the probability of correct selection for the best design. In this paper, we investigate two popular OCBA algorithms. With known variances for samples of each design, we characterize their convergence rates with respect to different performance measures. We first demonstrate that the two OCBA algorithms achieve the optimal convergence rate under measures of probability of correct selection and expected opportunity cost. It fills the void of convergence analysis for OCBA algorithms. Next, we extend our analysis to the measure of cumulative regret, a main measure studied in the field of machine learning. We show that with minor modification, the two OCBA algorithms can reach the optimal convergence rate under cumulative regret. It indicates the potential of broader use of algorithms designed based on the OCBA optimality conditions.
    Interval-censored Hawkes processes. (arXiv:2104.07932v4 [cs.LG] UPDATED)
    Interval-censored data solely records the aggregated counts of events during specific time intervals - such as the number of patients admitted to the hospital or the volume of vehicles passing traffic loop detectors - and not the exact occurrence time of the events. It is currently not understood how to fit the Hawkes point processes to this kind of data. Its typical loss function (the point process log-likelihood) cannot be computed without exact event times. Furthermore, it does not have the independent increments property to use the Poisson likelihood. This work builds a novel point process, a set of tools, and approximations for fitting Hawkes processes within interval-censored data scenarios. First, we define the Mean Behavior Poisson process (MBPP), a novel Poisson process with a direct parameter correspondence to the popular self-exciting Hawkes process. We fit MBPP in the interval-censored setting using an interval-censored Poisson log-likelihood (IC-LL). We use the parameter equivalence to uncover the parameters of the associated Hawkes process. Second, we introduce two novel exogenous functions to distinguish the exogenous from the endogenous events. We propose the multi-impulse exogenous function - for when the exogenous events are observed as event time - and the latent homogeneous Poisson process exogenous function - for when the exogenous events are presented as interval-censored volumes. Third, we provide several approximation methods to estimate the intensity and compensator function of MBPP when no analytical solution exists. Fourth and finally, we connect the interval-censored loss of MBPP to a broader class of Bregman divergence-based functions. Using the connection, we show that the popularity estimation algorithm Hawkes Intensity Process (HIP) is a particular case of the MBPP. We verify our models through empirical testing on synthetic data and real-world data.
    Unsupervised Representation Learning in Deep Reinforcement Learning: A Review. (arXiv:2208.14226v2 [cs.LG] UPDATED)
    This review addresses the problem of learning abstract representations of the measurement data in the context of Deep Reinforcement Learning (DRL). While the data are often ambiguous, high-dimensional, and complex to interpret, many dynamical systems can be effectively described by a low-dimensional set of state variables. Discovering these state variables from the data is a crucial aspect for improving the data efficiency, robustness and generalization of DRL methods, tackling the \textit{curse of dimensionality}, and bringing interpretability and insights into black-box DRL. This review provides a comprehensive and complete overview of unsupervised representation learning in DRL by describing the main Deep Learning tools used for learning representations of the world, providing a systematic view of the method and principles, summarizing applications, benchmarks and evaluation strategies, and discussing open challenges and future directions.
    Acceptability Judgements via Examining the Topology of Attention Maps. (arXiv:2205.09630v2 [cs.CL] CROSS LISTED)
    The role of the attention mechanism in encoding linguistic knowledge has received special interest in NLP. However, the ability of the attention heads to judge the grammatical acceptability of a sentence has been underexplored. This paper approaches the paradigm of acceptability judgments with topological data analysis (TDA), showing that the geometric properties of the attention graph can be efficiently exploited for two standard practices in linguistics: binary judgments and linguistic minimal pairs. Topological features enhance the BERT-based acceptability classifier scores by $8$%-$24$% on CoLA in three languages (English, Italian, and Swedish). By revealing the topological discrepancy between attention maps of minimal pairs, we achieve the human-level performance on the BLiMP benchmark, outperforming nine statistical and Transformer LM baselines. At the same time, TDA provides the foundation for analyzing the linguistic functions of attention heads and interpreting the correspondence between the graph features and grammatical phenomena.
    Where to Pay Attention in Sparse Training for Feature Selection?. (arXiv:2211.14627v1 [cs.LG])
    A new line of research for feature selection based on neural networks has recently emerged. Despite its superiority to classical methods, it requires many training iterations to converge and detect informative features. The computational time becomes prohibitively long for datasets with a large number of samples or a very high dimensional feature space. In this paper, we present a new efficient unsupervised method for feature selection based on sparse autoencoders. In particular, we propose a new sparse training algorithm that optimizes a model's sparse topology during training to pay attention to informative features quickly. The attention-based adaptation of the sparse topology enables fast detection of informative features after a few training iterations. We performed extensive experiments on 10 datasets of different types, including image, speech, text, artificial, and biological. They cover a wide range of characteristics, such as low and high-dimensional feature spaces, and few and large training samples. Our proposed approach outperforms the state-of-the-art methods in terms of selecting informative features while reducing training iterations and computational costs substantially. Moreover, the experiments show the robustness of our method in extremely noisy environments.
    Distribution Free Prediction Sets for Node Classification. (arXiv:2211.14555v1 [stat.ML])
    Graph Neural Networks (GNNs) are able to achieve high classification accuracy on many large real world datasets, but provide no rigorous notion of predictive uncertainty. We leverage recent advances in conformal prediction to construct prediction sets for node classification in inductive learning scenarios, and verify the efficacy of our approach across standard benchmark datasets using popular GNN models. The code is available at \href{https://github.com/jase-clarkson/graph_cp}{this link}.
    Why Neural Networks Work. (arXiv:2211.14632v1 [cs.LG])
    We argue that many properties of fully-connected feedforward neural networks (FCNNs), also called multi-layer perceptrons (MLPs), are explainable from the analysis of a single pair of operations, namely a random projection into a higher-dimensional space than the input, followed by a sparsification operation. For convenience, we call this pair of successive operations expand-and-sparsify following the terminology of Dasgupta. We show how expand-and-sparsify can explain the observed phenomena that have been discussed in the literature, such as the so-called Lottery Ticket Hypothesis, the surprisingly good performance of randomly-initialized untrained neural networks, the efficacy of Dropout in training and most importantly, the mysterious generalization ability of overparameterized models, first highlighted by Zhang et al. and subsequently identified even in non-neural network models by Belkin et al.
    A Path Towards Clinical Adaptation of Accelerated MRI. (arXiv:2208.12835v3 [eess.IV] UPDATED)
    Accelerated MRI reconstructs images of clinical anatomies from sparsely sampled signal data to reduce patient scan times. While recent works have leveraged deep learning to accomplish this task, such approaches have often only been explored in simulated environments where there is no signal corruption or resource limitations. In this work, we explore augmentations to neural network MRI image reconstructors to enhance their clinical relevancy. Namely, we propose a ConvNet model for detecting sources of image artifacts that achieves a classifier $F_2$ score of 79.1%. We also demonstrate that training reconstructors on MR signal data with variable acceleration factors can improve their average performance during a clinical patient scan by up to 2%. We offer a loss function to overcome catastrophic forgetting when models learn to reconstruct MR images of multiple anatomies and orientations. Finally, we propose a method for using simulated phantom data to pre-train reconstructors in situations with limited clinically acquired datasets and compute capabilities. Our results provide a potential path forward for clinical adaptation of accelerated MRI.
    Direct-Effect Risk Minimization for Domain Generalization. (arXiv:2211.14594v1 [cs.LG])
    We study the problem of out-of-distribution (o.o.d.) generalization where spurious correlations of attributes vary across training and test domains. This is known as the problem of correlation shift and has posed concerns on the reliability of machine learning. In this work, we introduce the concepts of direct and indirect effects from causal inference to the domain generalization problem. We argue that models that learn direct effects minimize the worst-case risk across correlation-shifted domains. To eliminate the indirect effects, our algorithm consists of two stages: in the first stage, we learn an indirect-effect representation by minimizing the prediction error of domain labels using the representation and the class label; in the second stage, we remove the indirect effects learned in the first stage by matching each data with another data of similar indirect-effect representation but of different class label. We also propose a new model selection method by matching the validation set in the same way, which is shown to improve the generalization performance of existing models on correlation-shifted datasets. Experiments on 5 correlation-shifted datasets and the DomainBed benchmark verify the effectiveness of our approach.
    Deep neuroevolution to predict primary brain tumor grade from functional MRI adjacency matrices. (arXiv:2211.14500v1 [cs.NE])
    Whereas MRI produces anatomic information about the brain, functional MRI (fMRI) tells us about neural activity within the brain, including how various regions communicate with each other. The full chorus of conversations within the brain is summarized elegantly in the adjacency matrix. Although information-rich, adjacency matrices typically provide little in the way of intuition. Whereas trained radiologists viewing anatomic MRI can readily distinguish between different kinds of brain cancer, a similar determination using adjacency matrices would exceed any expert's grasp. Artificial intelligence (AI) in radiology usually analyzes anatomic imaging, providing assistance to radiologists. For non-intuitive data types such as adjacency matrices, AI moves beyond the role of helpful assistant, emerging as indispensible. We sought here to show that AI can learn to discern between two important brain tumor types, high-grade glioma (HGG) and low-grade glioma (LGG), based on adjacency matrices. We trained a convolutional neural networks (CNN) with the method of deep neuroevolution (DNE), because of the latter's recent promising results; DNE has produced remarkably accurate CNNs even when relying on small and noisy training sets, or performing nuanced tasks. After training on just 30 adjacency matrices, our CNN could tell HGG apart from LGG with perfect testing set accuracy. Saliency maps revealed that the network learned highly sophisticated and complex features to achieve its success. Hence, we have shown that it is possible for AI to recognize brain tumor type from functional connectivity. In future work, we will apply DNE to other noisy and somewhat cryptic forms of medical data, including further explorations with fMRI.
    Can Hybrid Geometric Scattering Networks Help Solve the Maximum Clique Problem?. (arXiv:2206.01506v2 [cs.LG] UPDATED)
    We propose a geometric scattering-based graph neural network (GNN) for approximating solutions of the NP-hard maximum clique (MC) problem. We construct a loss function with two terms, one which encourages the network to find highly connected nodes and the other which acts as a surrogate for the constraint that the nodes form a clique. We then use this loss to train an efficient GNN architecture that outputs a vector representing the probability for each node to be part of the MC and apply a rule-based decoder to make our final prediction. The incorporation of the scattering transform alleviates the so-called oversmoothing problem that is often encountered in GNNs and would degrade the performance of our proposed setup. Our empirical results demonstrate that our method outperforms representative GNN baselines in terms of solution accuracy and inference speed as well as conventional solvers like Gurobi with limited time budgets. Furthermore, our scattering model is very parameter efficient with only $\sim$ 0.1\% of the number of parameters compared to previous GNN baseline models.
    Learned k-NN Distance Estimation. (arXiv:2208.14210v2 [cs.DB] UPDATED)
    Big data mining is well known to be an important task for data science, because it can provide useful observations and new knowledge hidden in given large datasets. Proximity-based data analysis is particularly utilized in many real-life applications. In such analysis, the distances to k nearest neighbors are usually employed, thus its main bottleneck is derived from data retrieval. Much efforts have been made to improve the efficiency of these analyses. However, they still incur large costs, because they essentially need many data accesses. To avoid this issue, we propose a machine-learning technique that quickly and accurately estimates the k-NN distances (i.e., distances to the k nearest neighbors) of a given query. We train a fully connected neural network model and utilize pivots to achieve accurate estimation. Our model is designed to have useful advantages: it infers distances to the k-NNs at a time, its inference time is O(1) (no data accesses are incurred), but it keeps high accuracy. Our experimental results and case studies on real datasets demonstrate the efficiency and effectiveness of our solution.
    Knowledge Distillation from A Stronger Teacher. (arXiv:2205.10536v2 [cs.CV] UPDATED)
    Unlike existing knowledge distillation methods focus on the baseline settings, where the teacher models and training strategies are not that strong and competing as state-of-the-art approaches, this paper presents a method dubbed DIST to distill better from a stronger teacher. We empirically find that the discrepancy of predictions between the student and a stronger teacher may tend to be fairly severer. As a result, the exact match of predictions in KL divergence would disturb the training and make existing methods perform poorly. In this paper, we show that simply preserving the relations between the predictions of teacher and student would suffice, and propose a correlation-based loss to capture the intrinsic inter-class relations from the teacher explicitly. Besides, considering that different instances have different semantic similarities to each class, we also extend this relational match to the intra-class level. Our method is simple yet practical, and extensive experiments demonstrate that it adapts well to various architectures, model sizes and training strategies, and can achieve state-of-the-art performance consistently on image classification, object detection, and semantic segmentation tasks. Code is available at: https://github.com/hunto/DIST_KD .
    Multi-fidelity Gaussian Process for Biomanufacturing Process Modeling with Small Data. (arXiv:2211.14493v1 [cs.LG])
    In biomanufacturing, developing an accurate model to simulate the complex dynamics of bioprocesses is an important yet challenging task. This is partially due to the uncertainty associated with bioprocesses, high data acquisition cost, and lack of data availability to learn complex relations in bioprocesses. To deal with these challenges, we propose to use a statistical machine learning approach, multi-fidelity Gaussian process, for process modelling in biomanufacturing. Gaussian process regression is a well-established technique based on probability theory which can naturally consider uncertainty in a dataset via Gaussian noise, and multi-fidelity techniques can make use of multiple sources of information with different levels of fidelity, thus suitable for bioprocess modeling with small data. We apply the multi-fidelity Gaussian process to solve two significant problems in biomanufacturing, bioreactor scale-up and knowledge transfer across cell lines, and demonstrate its efficacy on real-world datasets.
    Ensemble Multi-Quantile: Adaptively Flexible Distribution Prediction for Uncertainty Quantification. (arXiv:2211.14545v1 [cs.LG])
    We propose a novel, succinct, and effective approach to quantify uncertainty in machine learning. It incorporates adaptively flexible distribution prediction for $\mathbb{P}(\mathbf{y}|\mathbf{X}=x)$ in regression tasks. For predicting this conditional distribution, its quantiles of probability levels spreading the interval $(0,1)$ are boosted by additive models which are designed by us with intuitions and interpretability. We seek an adaptive balance between the structural integrity and the flexibility for $\mathbb{P}(\mathbf{y}|\mathbf{X}=x)$, while Gaussian assumption results in a lack of flexibility for real data and highly flexible approaches (e.g., estimating the quantiles separately without a distribution structure) inevitably have drawbacks and may not lead to good generalization. This ensemble multi-quantiles approach called EMQ proposed by us is totally data-driven, and can gradually depart from Gaussian and discover the optimal conditional distribution in the boosting. On extensive regression tasks from UCI datasets, we show that EMQ achieves state-of-the-art performance comparing to many recent uncertainty quantification methods including Gaussian assumption-based, Bayesian methods, quantile regression-based, and traditional tree models, under the metrics of calibration, sharpness, and tail-side calibration. Visualization results show what we actually learn from the real data and how, illustrating the necessity and the merits of such an ensemble model.
    Simulation Intelligence: Towards a New Generation of Scientific Methods. (arXiv:2112.03235v2 [cs.AI] UPDATED)
    The original "Seven Motifs" set forth a roadmap of essential methods for the field of scientific computing, where a motif is an algorithmic method that captures a pattern of computation and data movement. We present the "Nine Motifs of Simulation Intelligence", a roadmap for the development and integration of the essential algorithms necessary for a merger of scientific computing, scientific simulation, and artificial intelligence. We call this merger simulation intelligence (SI), for short. We argue the motifs of simulation intelligence are interconnected and interdependent, much like the components within the layers of an operating system. Using this metaphor, we explore the nature of each layer of the simulation intelligence operating system stack (SI-stack) and the motifs therein: (1) Multi-physics and multi-scale modeling; (2) Surrogate modeling and emulation; (3) Simulation-based inference; (4) Causal modeling and inference; (5) Agent-based modeling; (6) Probabilistic programming; (7) Differentiable programming; (8) Open-ended optimization; (9) Machine programming. We believe coordinated efforts between motifs offers immense opportunity to accelerate scientific discovery, from solving inverse problems in synthetic biology and climate science, to directing nuclear energy experiments and predicting emergent behavior in socioeconomic settings. We elaborate on each layer of the SI-stack, detailing the state-of-art methods, presenting examples to highlight challenges and opportunities, and advocating for specific ways to advance the motifs and the synergies from their combinations. Advancing and integrating these technologies can enable a robust and efficient hypothesis-simulation-analysis type of scientific method, which we introduce with several use-cases for human-machine teaming and automated science.
    The Principles of Data-Centric AI (DCAI). (arXiv:2211.14611v1 [cs.LG])
    Data is a crucial infrastructure to how artificial intelligence (AI) systems learn. However, these systems to date have been largely model-centric, putting a premium on the model at the expense of the data quality. Data quality issues beset the performance of AI systems, particularly in downstream deployments and in real-world applications. Data-centric AI (DCAI) as an emerging concept brings data, its quality and its dynamism to the forefront in considerations of AI systems through an iterative and systematic approach. As one of the first overviews, this article brings together data-centric perspectives and concepts to outline the foundations of DCAI. It specifically formulates six guiding principles for researchers and practitioners and gives direction for future advancement of DCAI.
    Less Data, More Knowledge: Building Next Generation Semantic Communication Networks. (arXiv:2211.14343v1 [cs.AI])
    Semantic communication is viewed as a revolutionary paradigm that can potentially transform how we design and operate wireless communication systems. However, despite a recent surge of research activities in this area, the research landscape remains limited. In this tutorial, we present the first rigorous vision of a scalable end-to-end semantic communication network that is founded on novel concepts from artificial intelligence (AI), causal reasoning, and communication theory. We first discuss how the design of semantic communication networks requires a move from data-driven networks towards knowledge-driven ones. Subsequently, we highlight the necessity of creating semantic representations of data that satisfy the key properties of minimalism, generalizability, and efficiency so as to do more with less. We then explain how those representations can form the basis a so-called semantic language. By using semantic representation and languages, we show that the traditional transmitter and receiver now become a teacher and apprentice. Then, we define the concept of reasoning by investigating the fundamentals of causal representation learning and their role in designing semantic communication networks. We demonstrate that reasoning faculties are majorly characterized by the ability to capture causal and associational relationships in datastreams. For such reasoning-driven networks, we propose novel and essential semantic communication metrics that include new "reasoning capacity" measures that could go beyond Shannon's bound to capture the convergence of computing and communication. Finally, we explain how semantic communications can be scaled to large-scale networks (6G and beyond). In a nutshell, we expect this tutorial to provide a comprehensive reference on how to properly build, analyze, and deploy future semantic communication networks.
    Homology-constrained vector quantization entropy regularizer. (arXiv:2211.14363v1 [cs.LG])
    This paper describes an entropy regularization term for vector quantization (VQ) based on the analysis of persistent homology of the VQ embeddings. Higher embedding entropy positively correlates with higher codebook utilization, mitigating overfit towards the identity and codebook collapse in VQ-based autoencoders [1]. We show that homology-constrained regularization is an effective way to increase entropy of the VQ process (approximated to input entropy) while preserving the approximated topology in the quantized latent space, averaged over mini batches. This work further explores some patterns of persistent homology diagrams of latents formed by vector quantization. We implement and test the proposed algorithm as a module integrated into a sample VQ-VAE. Linked code repository provides a functioning implementation of the proposed architecture, referred to as homology-constrained vector quantization (HC-VQ) further in this work.
    Deep Curvilinear Editing: Commutative and Nonlinear Image Manipulation for Pretrained Deep Generative Model. (arXiv:2211.14573v1 [cs.CV])
    Semantic editing of images is the fundamental goal of computer vision. Although deep learning methods, such as generative adversarial networks (GANs), are capable of producing high-quality images, they often do not have an inherent way of editing generated images semantically. Recent studies have investigated a way of manipulating the latent variable to determine the images to be generated. However, methods that assume linear semantic arithmetic have certain limitations in terms of the quality of image editing, whereas methods that discover nonlinear semantic pathways provide non-commutative editing, which is inconsistent when applied in different orders. This study proposes a novel method called deep curvilinear editing (DeCurvEd) to determine semantic commuting vector fields on the latent space. We theoretically demonstrate that owing to commutativity, the editing of multiple attributes depends only on the quantities and not on the order. Furthermore, we experimentally demonstrate that compared to previous methods, the nonlinear and commutative nature of DeCurvEd facilitates the disentanglement of image attributes and provides higher-quality editing.
    Supervised Contrastive Prototype Learning: Augmentation Free Robust Neural Network. (arXiv:2211.14424v1 [cs.LG])
    Transformations in the input space of Deep Neural Networks (DNN) lead to unintended changes in the feature space. Almost perceptually identical inputs, such as adversarial examples, can have significantly distant feature representations. On the contrary, Out-of-Distribution (OOD) samples can have highly similar feature representations to training set samples. Our theoretical analysis for DNNs trained with a categorical classification head suggests that the inflexible logit space restricted by the classification problem size is one of the root causes for the lack of $\textit{robustness}$. Our second observation is that DNNs over-fit to the training augmentation technique and do not learn $\textit{nuance invariant}$ representations. Inspired by the recent success of prototypical and contrastive learning frameworks for both improving robustness and learning nuance invariant representations, we propose a training framework, $\textbf{Supervised Contrastive Prototype Learning}$ (SCPL). We use N-pair contrastive loss with prototypes of the same and opposite classes and replace a categorical classification head with a $\textbf{Prototype Classification Head}$ (PCH). Our approach is $\textit{sample efficient}$, does not require $\textit{sample mining}$, can be implemented on any existing DNN without modification to their architecture, and combined with other training augmentation techniques. We empirically evaluate the $\textbf{clean}$ robustness of our method on out-of-distribution and adversarial samples. Our framework outperforms other state-of-the-art contrastive and prototype learning approaches in $\textit{robustness}$.
    Don't Watch Me: A Spatio-Temporal Trojan Attack on Deep-Reinforcement-Learning-Augment Autonomous Driving. (arXiv:2211.14440v1 [cs.CR])
    Deep reinforcement learning (DRL) is one of the most popular algorithms to realize an autonomous driving (AD) system. The key success factor of DRL is that it embraces the perception capability of deep neural networks which, however, have been proven vulnerable to Trojan attacks. Trojan attacks have been widely explored in supervised learning (SL) tasks (e.g., image classification), but rarely in sequential decision-making tasks solved by DRL. Hence, in this paper, we explore Trojan attacks on DRL for AD tasks. First, we propose a spatio-temporal DRL algorithm based on the recurrent neural network and attention mechanism to prove that capturing spatio-temporal traffic features is the key factor to the effectiveness and safety of a DRL-augment AD system. We then design a spatial-temporal Trojan attack on DRL policies, where the trigger is hidden in a sequence of spatial and temporal traffic features, rather than a single instant state used in existing Trojan on SL and DRL tasks. With our Trojan, the adversary acts as a surrounding normal vehicle and can trigger attacks via specific spatial-temporal driving behaviors, rather than physical or wireless access. Through extensive experiments, we show that while capturing spatio-temporal traffic features can improve the performance of DRL for different AD tasks, they suffer from Trojan attacks since our designed Trojan shows high stealthy (various spatio-temporal trigger patterns), effective (less than 3.1\% performance variance rate and more than 98.5\% attack success rate), and sustainable to existing advanced defenses.
    Deep Active Learning for Computer Vision: Past and Future. (arXiv:2211.14819v1 [cs.LG])
    As an important data selection schema, active learning emerges as the essential component when iterating an Artificial Intelligence (AI) model. It becomes even more critical given the dominance of deep neural network based models, which are composed of a large number of parameters and data hungry, in application. Despite its indispensable role for developing AI models, research on active learning is not as intensive as other research directions. In this paper, we present a review of active learning through deep active learning approaches from the following perspectives: 1) technical advancements in active learning, 2) applications of active learning in computer vision, 3) industrial systems leveraging or with potential to leverage active learning for data iteration, 4) current limitations and future research directions. We expect this paper to clarify the significance of active learning in a modern AI model manufacturing process and to bring additional research attention to active learning. By addressing data automation challenges and coping with automated machine learning systems, active learning will facilitate democratization of AI technologies by boosting model production at scale.
    Demystifying Bitcoin Address Behavior via Graph Neural Networks. (arXiv:2211.14582v1 [cs.CR])
    Bitcoin is one of the decentralized cryptocurrencies powered by a peer-to-peer blockchain network. Parties who trade in the bitcoin network are not required to disclose any personal information. Such property of anonymity, however, precipitates potential malicious transactions to a certain extent. Indeed, various illegal activities such as money laundering, dark network trading, and gambling in the bitcoin network are nothing new now. While a proliferation of work has been developed to identify malicious bitcoin transactions, the behavior analysis and classification of bitcoin addresses are largely overlooked by existing tools. In this paper, we propose BAClassifier, a tool that can automatically classify bitcoin addresses based on their behaviors. Technically, we come up with the following three key designs. First, we consider casting the transactions of the bitcoin address into an address graph structure, of which we introduce a graph node compression technique and a graph structure augmentation method to characterize a unified graph representation. Furthermore, we leverage a graph feature network to learn the graph representations of each address and generate the graph embeddings. Finally, we aggregate all graph embeddings of an address into the address-level representation, and engage in a classification model to give the address behavior classification. As a side contribution, we construct and release a large-scale annotated dataset that consists of over 2 million real-world bitcoin addresses and concerns 4 types of address behaviors. Experimental results demonstrate that our proposed framework outperforms state-of-the-art bitcoin address classifiers and existing classification models, where the precision and F1-score are 96% and 95%, respectively. Our implementation and dataset are released, hoping to inspire others.
    Inverse Solvability and Security with Applications to Federated Learning. (arXiv:2211.14115v2 [stat.ML] UPDATED)
    We introduce the concepts of inverse solvability and security for a generic linear forward model and demonstrate how they can be applied to models used in federated learning. We provide examples of such models which differ in the resulting inverse solvability and security as defined in this paper. We also show how the large number of users participating in a given iteration of federated learning can be leveraged to increase both solvability and security. Finally, we discuss possible extensions of the presented concepts including the nonlinear case.
    Simple initialization and parametrization of sinusoidal networks via their kernel bandwidth. (arXiv:2211.14503v1 [cs.LG])
    Neural networks with sinusoidal activations have been proposed as an alternative to networks with traditional activation functions. Despite their promise, particularly for learning implicit models, their training behavior is not yet fully understood, leading to a number of empirical design choices that are not well justified. In this work, we first propose a simplified version of such sinusoidal neural networks, which allows both for easier practical implementation and simpler theoretical analysis. We then analyze the behavior of these networks from the neural tangent kernel perspective and demonstrate that their kernel approximates a low-pass filter with an adjustable bandwidth. Finally, we utilize these insights to inform the sinusoidal network initialization, optimizing their performance for each of a series of tasks, including learning implicit models and solving differential equations.
    Unsupervised User-Based Insider Threat Detection Using Bayesian Gaussian Mixture Models. (arXiv:2211.14437v1 [cs.CR])
    Insider threats are a growing concern for organizations due to the amount of damage that their members can inflict by combining their privileged access and domain knowledge. Nonetheless, the detection of such threats is challenging, precisely because of the ability of the authorized personnel to easily conduct malicious actions and because of the immense size and diversity of audit data produced by organizations in which the few malicious footprints are hidden. In this paper, we propose an unsupervised insider threat detection system based on audit data using Bayesian Gaussian Mixture Models. The proposed approach leverages a user-based model to optimize specific behaviors modelization and an automatic feature extraction system based on Word2Vec for ease of use in a real-life scenario. The solution distinguishes itself by not requiring data balancing nor to be trained only on normal instances, and by its little domain knowledge required to implement. Still, results indicate that the proposed method competes with state-of-the-art approaches, presenting a good recall of 88\%, accuracy and true negative rate of 93%, and a false positive rate of 6.9%. For our experiments, we used the benchmark dataset CERT version 4.2.
    A Systematic Review of Robustness in Deep Learning for Computer Vision: Mind the gap?. (arXiv:2112.00639v2 [cs.CV] UPDATED)
    Deep neural networks for computer vision are deployed in increasingly safety-critical and socially-impactful applications, motivating the need to close the gap in model performance under varied, naturally occurring imaging conditions. Robustness, ambiguously used in multiple contexts including adversarial machine learning, refers here to preserving model performance under naturally-induced image corruptions or alterations. We perform a systematic review to identify, analyze, and summarize current definitions and progress towards non-adversarial robustness in deep learning for computer vision. We find this area of research has received disproportionately less attention relative to adversarial machine learning, yet a significant robustness gap exists that manifests in performance degradation similar in magnitude to adversarial conditions. Toward developing a more transparent definition of robustness, we provide a conceptual framework based on a structural causal model of the data generating process and interpret non-adversarial robustness as pertaining to a model's behavior on corrupted images corresponding to low-probability samples from the unaltered data distribution. We identify key architecture-, data augmentation-, and optimization tactics for improving neural network robustness. This robustness perspective reveals that common practices in the literature correspond to causal concepts. We offer perspectives on how future research may mind this evident and significant non-adversarial robustness gap.
    Contextual Expressive Text-to-Speech. (arXiv:2211.14548v1 [eess.AS])
    The goal of expressive Text-to-speech (TTS) is to synthesize natural speech with desired content, prosody, emotion, or timbre, in high expressiveness. Most of previous studies attempt to generate speech from given labels of styles and emotions, which over-simplifies the problem by classifying styles and emotions into a fixed number of pre-defined categories. In this paper, we introduce a new task setting, Contextual TTS (CTTS). The main idea of CTTS is that how a person speaks depends on the particular context she is in, where the context can typically be represented as text. Thus, in the CTTS task, we propose to utilize such context to guide the speech synthesis process instead of relying on explicit labels of styles and emotions. To achieve this task, we construct a synthetic dataset and develop an effective framework. Experiments show that our framework can generate high-quality expressive speech based on the given context both in synthetic datasets and real-world scenarios.
    Efficient Aggregated Kernel Tests using Incomplete $U$-statistics. (arXiv:2206.09194v2 [stat.ML] UPDATED)
    We propose a series of computationally efficient nonparametric tests for the two-sample, independence, and goodness-of-fit problems, using the Maximum Mean Discrepancy (MMD), Hilbert Schmidt Independence Criterion (HSIC), and Kernel Stein Discrepancy (KSD), respectively. Our test statistics are incomplete $U$-statistics, with a computational cost that interpolates between linear time in the number of samples, and quadratic time, as associated with classical $U$-statistic tests. The three proposed tests aggregate over several kernel bandwidths to detect departures from the null on various scales: we call the resulting tests MMDAggInc, HSICAggInc and KSDAggInc. This procedure provides a solution to the fundamental kernel selection problem as we can aggregate a large number of kernels with several bandwidths without incurring a significant loss of test power. For the test thresholds, we derive a quantile bound for wild bootstrapped incomplete $U$-statistics, which is of independent interest. We derive non-asymptotic uniform separation rates for MMDAggInc and HSICAggInc, and quantify exactly the trade-off between computational efficiency and the attainable rates: this result is novel for tests based on incomplete $U$-statistics, to our knowledge. We further show that in the quadratic-time case, the wild bootstrap incurs no penalty to test power over the more widespread permutation-based approach, since both attain the same minimax optimal rates (which in turn match the rates that use oracle quantiles). We support our claims with numerical experiments on the trade-off between computational efficiency and test power. In all three testing frameworks, the linear-time versions of our proposed tests perform at least as well as the current linear-time state-of-the-art tests.
    D-Point Trigonometric Path Planning based on Q-Learning in Uncertain Environments. (arXiv:1910.12020v1 [cs.AI] CROSS LISTED)
    Finding the optimum path for a robot for moving from start to the goal position through obstacles is still a challenging issue. This paper presents a novel path planning method, named D-point trigonometric, based on Q-learning algorithm for dynamic and uncertain environments, in which all the obstacles and the target are moving. We define a new state, action and reward functions for the Q-learning by which the agent can find the best action in every state to reach the goal in the most appropriate path. The D-point approach minimizes the possible number of states. Moreover, the experiments in Unity3D confirmed the high convergence speed, the high hit rate, as well as the low dependency on environmental parameters of the proposed method compared with an opponent approach.
    The Impact of Racial Distribution in Training Data on Face Recognition Bias: A Closer Look. (arXiv:2211.14498v1 [cs.CV])
    Face recognition algorithms, when used in the real world, can be very useful, but they can also be dangerous when biased toward certain demographics. So, it is essential to understand how these algorithms are trained and what factors affect their accuracy and fairness to build better ones. In this study, we shed some light on the effect of racial distribution in the training data on the performance of face recognition models. We conduct 16 different experiments with varying racial distributions of faces in the training data. We analyze these trained models using accuracy metrics, clustering metrics, UMAP projections, face quality, and decision thresholds. We show that a uniform distribution of races in the training datasets alone does not guarantee bias-free face recognition algorithms and how factors like face image quality play a crucial role. We also study the correlation between the clustering metrics and bias to understand whether clustering is a good indicator of bias. Finally, we introduce a metric called racial gradation to study the inter and intra race correlation in facial features and how they affect the learning ability of the face recognition models. With this study, we try to bring more understanding to an essential element of face recognition training, the data. A better understanding of the impact of training data on the bias of face recognition algorithms will aid in creating better datasets and, in turn, better face recognition systems.
    Transfer learning with high-dimensional quantile regression. (arXiv:2211.14578v1 [stat.ML])
    Transfer learning has become an essential technique to exploit information from the source domain to boost performance of the target task. Despite the prevalence in high-dimensional data, heterogeneity and/or heavy tails tend to be discounted in current transfer learning approaches and thus may undermine the resulting performance. We propose a transfer learning procedure in the framework of high-dimensional quantile regression models to accommodate the heterogeneity and heavy tails in the source and target domains. We establish error bounds of the transfer learning estimator based on delicately selected transferable source domains, showing that lower error bounds can be achieved for critical selection criterion and larger sample size of source tasks. We further propose valid confidence interval and hypothesis test procedures for individual component of quantile regression coefficients by advocating a one-step debiased estimator of transfer learning estimator wherein the consistent variance estimation is proposed via the technique of transfer learning again. Simulation results demonstrate that the proposed method exhibits some favorable performances.
    A Time Series is Worth 64 Words: Long-term Forecasting with Transformers. (arXiv:2211.14730v1 [cs.LG])
    We propose an efficient design of Transformer-based models for multivariate time series forecasting and self-supervised representation learning. It is based on two key components: (i) segmentation of time series into subseries-level patches which are served as input tokens to Transformer; (ii) channel-independence where each channel contains a single univariate time series that shares the same embedding and Transformer weights across all the series. Patching design naturally has three-fold benefit: local semantic information is retained in the embedding; computation and memory usage of the attention maps are quadratically reduced given the same look-back window; and the model can attend longer history. Our channel-independent patch time series Transformer (PatchTST) can improve the long-term forecasting accuracy significantly when compared with that of SOTA Transformer-based models. We also apply our model to self-supervised pre-training tasks and attain excellent fine-tuning performance, which outperforms supervised training on large datasets. Transferring of masked pre-trained representation on one dataset to others also produces SOTA forecasting accuracy. Code is available at: https://github.com/yuqinie98/PatchTST.
    Distribution estimation and change-point detection for time series via DNN-based GANs. (arXiv:2211.14577v1 [cs.LG])
    The generative adversarial networks (GANs) have recently been applied to estimating the distribution of independent and identically distributed data, and got excellent performances. In this paper, we use the blocking technique to demonstrate the effectiveness of GANs for estimating the distribution of stationary time series. Theoretically, we obtain a non-asymptotic error bound for the Deep Neural Network (DNN)-based GANs estimator for the stationary distribution of the time series. Based on our theoretical analysis, we put forward an algorithm for detecting the change-point in time series. We simulate in our first experiment a stationary time series by the multivariate autoregressive model to test our GAN estimator, while the second experiment is to use our proposed algorithm to detect the change-point in a time series sequence. Both perform very well. The third experiment is to use our GAN estimator to learn the distribution of a real financial time series data, which is not stationary, we can see from the experiment results that our estimator cannot match the distribution of the time series very well but give the right changing tendency.
    Transfer RL via the Undo Maps Formalism. (arXiv:2211.14469v1 [cs.LG])
    Transferring knowledge across domains is one of the most fundamental problems in machine learning, but doing so effectively in the context of reinforcement learning remains largely an open problem. Current methods make strong assumptions on the specifics of the task, often lack principled objectives, and -- crucially -- modify individual policies, which might be sub-optimal when the domains differ due to a drift in the state space, i.e., it is intrinsic to the environment and therefore affects every agent interacting with it. To address these drawbacks, we propose TvD: transfer via distribution matching, a framework to transfer knowledge across interactive domains. We approach the problem from a data-centric perspective, characterizing the discrepancy in environments by means of (potentially complex) transformation between their state spaces, and thus posing the problem of transfer as learning to undo this transformation. To accomplish this, we introduce a novel optimization objective based on an optimal transport distance between two distributions over trajectories -- those generated by an already-learned policy in the source domain and a learnable pushforward policy in the target domain. We show this objective leads to a policy update scheme reminiscent of imitation learning, and derive an efficient algorithm to implement it. Our experiments in simple gridworlds show that this method yields successful transfer learning across a wide range of environment transformations.
    Siamese based Neural Network for Offline Writer Identification on word level data. (arXiv:2211.14443v1 [cs.CV])
    Handwriting recognition is one of the desirable attributes of document comprehension and analysis. It is concerned with the documents writing style and characteristics that distinguish the authors. The diversity of text images, notably in images with varying handwriting, makes the process of learning good features difficult in cases where little data is available. In this paper, we propose a novel scheme to identify the author of a document based on the input word image. Our method is text independent and does not impose any constraint on the size of the input image under examination. To begin with, we detect crucial components in handwriting and extract regions surrounding them using Scale Invariant Feature Transform (SIFT). These patches are designed to capture individual writing features (including allographs, characters, or combinations of characters) that are likely to be unique for an individual writer. These features are then passed through a deep Convolutional Neural Network (CNN) in which the weights are learned by applying the concept of Similarity learning using Siamese network. Siamese network enhances the discrimination power of CNN by mapping similarity between different pairs of input image. Features learned at different scales of the extracted SIFT key-points are encoded using Sparse PCA, each components of the Sparse PCA is assigned a saliency score signifying its level of significance in discriminating different writers effectively. Finally, the weighted Sparse PCA corresponding to each SIFT key-points is combined to arrive at a final classification score for each writer. The proposed algorithm was evaluated on two publicly available databases (namely IAM and CVL) and is able to achieve promising result, when compared with other deep learning based algorithm.
    Synergies Between Disentanglement and Sparsity: a Multi-Task Learning Perspective. (arXiv:2211.14666v1 [cs.LG])
    Although disentangled representations are often said to be beneficial for downstream tasks, current empirical and theoretical understanding is limited. In this work, we provide evidence that disentangled representations coupled with sparse base-predictors improve generalization. In the context of multi-task learning, we prove a new identifiability result that provides conditions under which maximally sparse base-predictors yield disentangled representations. Motivated by this theoretical result, we propose a practical approach to learn disentangled representations based on a sparsity-promoting bi-level optimization problem. Finally, we explore a meta-learning version of this algorithm based on group Lasso multiclass SVM base-predictors, for which we derive a tractable dual formulation. It obtains competitive results on standard few-shot classification benchmarks, while each task is using only a fraction of the learned representations.
    SGCE-Font: Skeleton Guided Channel Expansion for Chinese Font Generation. (arXiv:2211.14475v1 [cs.CV])
    The automatic generation of Chinese fonts is an important problem involved in many applications. The predominated methods for the Chinese font generation are based on the deep generative models, especially the generative adversarial networks (GANs). However, existing GAN-based methods (say, CycleGAN) for the Chinese font generation usually suffer from the mode collapse issue, mainly due to the lack of effective guidance information. This paper proposes a novel information guidance module called the skeleton guided channel expansion (SGCE) module for the Chinese font generation through integrating the skeleton information into the generator with the channel expansion way, motivated by the observation that the skeleton embodies both local and global structure information of Chinese characters. We conduct extensive experiments to show the effectiveness of the proposed module. Numerical results show that the mode collapse issue suffered by the known CycleGAN can be effectively alleviated by equipping with the proposed SGCE module, and the CycleGAN equipped with SGCE outperforms the state-of-the-art models in terms of four important evaluation metrics and visualization quality. Besides CycleGAN, we also show that the suggested SGCE module can be adapted to other models for Chinese font generation as a plug-and-play module to further improve their performance.
    Granular-Ball Fuzzy Set and Its Implementation in SVM. (arXiv:2210.11675v2 [cs.LG] UPDATED)
    Most existing fuzzy set methods use points as their input, which is the finest granularity from the perspective of granular computing. Consequently, these methods are neither efficient nor robust to label noise. Therefore, we propose a frame-work called granular-ball fuzzy set by introducing granular-ball computing into fuzzy set. The computational framework is based on the granular-balls input rather than points; therefore, it is more efficient and robust than traditional fuzzy methods, and can be used in various fields of fuzzy data processing according to its extensibility. Furthermore, the framework is extended to the classifier fuzzy support vector machine (FSVM), to derive the granular ball fuzzy SVM (GBFSVM). The experimental results demonstrate the effectiveness and efficiency of GBFSVM.
    L3Cube-HindBERT and DevBERT: Pre-Trained BERT Transformer models for Devanagari based Hindi and Marathi Languages. (arXiv:2211.11418v2 [cs.CL] UPDATED)
    The monolingual Hindi BERT models currently available on the model hub do not perform better than the multi-lingual models on downstream tasks. We present L3Cube-HindBERT, a Hindi BERT model pre-trained on Hindi monolingual corpus. Further, since Indic languages, Hindi and Marathi share the Devanagari script, we train a single model for both languages. We release DevBERT, a Devanagari BERT model trained on both Marathi and Hindi monolingual datasets. We evaluate these models on downstream Hindi and Marathi text classification and named entity recognition tasks. The HindBERT and DevBERT-based models show superior performance compared to their multi-lingual counterparts. These models are shared at https://huggingface.co/l3cube-pune .
    Mixture Manifold Networks: A Computationally Efficient Baseline for Inverse Modeling. (arXiv:2211.14366v1 [cs.LG])
    We propose and show the efficacy of a new method to address generic inverse problems. Inverse modeling is the task whereby one seeks to determine the control parameters of a natural system that produce a given set of observed measurements. Recent work has shown impressive results using deep learning, but we note that there is a trade-off between model performance and computational time. For some applications, the computational time at inference for the best performing inverse modeling method may be overly prohibitive to its use. We present a new method that leverages multiple manifolds as a mixture of backward (e.g., inverse) models in a forward-backward model architecture. These multiple backwards models all share a common forward model, and their training is mitigated by generating training examples from the forward model. The proposed method thus has two innovations: 1) the multiple Manifold Mixture Network (MMN) architecture, and 2) the training procedure involving augmenting backward model training data using the forward model. We demonstrate the advantages of our method by comparing to several baselines on four benchmark inverse problems, and we furthermore provide analysis to motivate its design.
    Multistep prediction for short-term wind speed based on the MLP and LSTM method with rankpooling. (arXiv:2211.14434v1 [cs.LG])
    The actual wind speed data suffers from the intermittent and fluctuating property, which implies that it is very difficult to forecast wind speed with high accuracy by applying single or shallow models. Hence, with the purpose of improving the forecasting accuracy and obtain better forecasting results, in this paper, a novel hybrid deep learning model is proposed for multistep forecasting of wind speed, which is intuitively abbreviated as LR-FFT-RP-LSTM and LR-FFT-RP-LSTM. Under these formulated model, the rankpooling method is firstly presented to extract local features of the raw meteorological data, and the Fast Fourier Transformation (FFT) is adopted to extract local and global features of the raw meteorological data to obtain pre-processed data, and the data obtained is then integrated with the original data using the two procedures to produce two input datasets. Then, deep learning model named multi-layer perceptron method (MLP) and long short-term memory (LSTM) are adopted to predict the wind speed dataset. The target prediction results are then obtained by integrating the preliminary prediction findings using the linear regression method.Practical wind speed data from 2010 to 2020 are exploited to evaluate the performance of the proposed model. Case study results indicate that the proposed model for wind speed has a superior forecasting capability. Moreover, the proposed hybrid model is very competitive compared to the state-of-the-art single model and other hybrid models involved in this paper.
    Sign Language to Text Conversion in Real Time using Transfer Learning. (arXiv:2211.14446v1 [cs.CV])
    The people in the world who are hearing impaired face many obstacles in communication and require an interpreter to comprehend what a person is saying. There has been constant scientific research and the existing models lack the ability to make accurate predictions. So we propose a deep learning model trained on the ASL i.e. American Sign Language which will take action in the form of American Sign Language as input and translate it into text. To achieve the former a Convolution Neural Network based VGG16 architecture is used as well as a TensorFlow model for image classification and we have improved the accuracy of the latter by over 4%. There has been an improvement in accuracy from 94% of CNN to 98.7% by Transfer Learning. An application with the deep learning model integrated has also been built.
    Mitigating Relational Bias on Knowledge Graphs. (arXiv:2211.14489v1 [cs.AI])
    Knowledge graph data are prevalent in real-world applications, and knowledge graph neural networks (KGNNs) are essential techniques for knowledge graph representation learning. Although KGNN effectively models the structural information from knowledge graphs, these frameworks amplify the underlying data bias that leads to discrimination towards certain groups or individuals in resulting applications. Additionally, as existing debiasing approaches mainly focus on the entity-wise bias, eliminating the multi-hop relational bias that pervasively exists in knowledge graphs remains an open question. However, it is very challenging to eliminate relational bias due to the sparsity of the paths that generate the bias and the non-linear proximity structure of knowledge graphs. To tackle the challenges, we propose Fair-KGNN, a KGNN framework that simultaneously alleviates multi-hop bias and preserves the proximity information of entity-to-relation in knowledge graphs. The proposed framework is generalizable to mitigate the relational bias for all types of KGNN. We develop two instances of Fair-KGNN incorporating with two state-of-the-art KGNN models, RGCN and CompGCN, to mitigate gender-occupation and nationality-salary bias. The experiments carried out on three benchmark knowledge graph datasets demonstrate that the Fair-KGNN can effectively mitigate unfair situations during representation learning while preserving the predictive performance of KGNN models.
    Deep Learning and Linear Programming for Automated Ensemble Forecasting and Interpretation. (arXiv:2201.00426v2 [cs.LG] UPDATED)
    This paper presents an ensemble forecasting method that shows strong results on the M4 Competition dataset by decreasing feature and model selection assumptions, termed DONUT (DO Not UTilize human beliefs). Our assumption reductions, primarily consisting of auto-generated features and a more diverse model pool for the ensemble, significantly outperform the statistical, feature-based ensemble method FFORMA by Montero-Manso et al. (2020). We also investigate feature extraction with a Long Short-term Memory Network (LSTM) Autoencoder and find that such features contain crucial information not captured by standard statistical feature approaches. The ensemble weighting model uses LSTM and statistical features to combine the models accurately. The analysis of feature importance and interaction shows a slight superiority for LSTM features over the statistical ones alone. Clustering analysis shows that essential LSTM features differ from most statistical features and each other. We also find that increasing the solution space of the weighting model by augmenting the ensemble with new models is something the weighting model learns to use, thus explaining part of the accuracy gains. Moreover, we present a formal ex-post-facto analysis of an optimal combination and selection for ensembles, quantifying differences through linear optimization on the M4 dataset. Our findings indicate that classical statistical time series features, such as trend and seasonality, alone do not capture all relevant information for forecasting a time series. On the contrary, our novel LSTM features contain significantly more predictive power than the statistical ones alone, but combining the two feature sets proved the best in practice.
    Multiclass classification using quantum convolutional neural networks with hybrid quantum-classical learning. (arXiv:2203.15368v2 [quant-ph] UPDATED)
    Multiclass classification is of great interest for various applications, for example, it is a common task in computer vision, where one needs to categorize an image into three or more classes. Here we propose a quantum machine learning approach based on quantum convolutional neural networks for solving the multiclass classification problem. The corresponding learning procedure is implemented via TensorFlowQuantum as a hybrid quantum-classical (variational) model, where quantum output results are fed to the softmax activation function with the subsequent minimization of the cross entropy loss via optimizing the parameters of the quantum circuit. Our conceptional improvements here include a new model for a quantum perceptron and an optimized structure of the quantum circuit. We use the proposed approach to solve a 4-class classification problem for the case of the MNIST dataset using eight qubits for data encoding and four ancilla qubits; previous results have been obtained for 3-class classification problems. Our results show that accuracies of our solution are similar to classical convolutional neural networks with comparable numbers of trainable parameters. We expect that our finding provide a new step towards the use of quantum neural networks for solving relevant problems in the NISQ era and beyond.
    Transform Once: Efficient Operator Learning in Frequency Domain. (arXiv:2211.14453v1 [cs.LG])
    Spectral analysis provides one of the most effective paradigms for information-preserving dimensionality reduction, as simple descriptions of naturally occurring signals are often obtained via few terms of periodic basis functions. In this work, we study deep neural networks designed to harness the structure in frequency domain for efficient learning of long-range correlations in space or time: frequency-domain models (FDMs). Existing FDMs are based on complex-valued transforms i.e. Fourier Transforms (FT), and layers that perform computation on the spectrum and input data separately. This design introduces considerable computational overhead: for each layer, a forward and inverse FT. Instead, this work introduces a blueprint for frequency domain learning through a single transform: transform once (T1). To enable efficient, direct learning in the frequency domain we derive a variance-preserving weight initialization scheme and investigate methods for frequency selection in reduced-order FDMs. Our results noticeably streamline the design process of FDMs, pruning redundant transforms, and leading to speedups of 3x to 10x that increase with data resolution and model size. We perform extensive experiments on learning the solution operator of spatio-temporal dynamics, including incompressible Navier-Stokes, turbulent flows around airfoils and high-resolution video of smoke. T1 models improve on the test performance of FDMs while requiring significantly less computation (5 hours instead of 32 for our large-scale experiment), with over 20% reduction in average predictive error across tasks.
    PatchGT: Transformer over Non-trainable Clusters for Learning Graph Representations. (arXiv:2211.14425v1 [cs.LG])
    Recently the Transformer structure has shown good performances in graph learning tasks. However, these Transformer models directly work on graph nodes and may have difficulties learning high-level information. Inspired by the vision transformer, which applies to image patches, we propose a new Transformer-based graph neural network: Patch Graph Transformer (PatchGT). Unlike previous transformer-based models for learning graph representations, PatchGT learns from non-trainable graph patches, not from nodes directly. It can help save computation and improve the model performance. The key idea is to segment a graph into patches based on spectral clustering without any trainable parameters, with which the model can first use GNN layers to learn patch-level representations and then use Transformer to obtain graph-level representations. The architecture leverages the spectral information of graphs and combines the strengths of GNNs and Transformers. Further, we show the limitations of previous hierarchical trainable clusters theoretically and empirically. We also prove the proposed non-trainable spectral clustering method is permutation invariant and can help address the information bottlenecks in the graph. PatchGT achieves higher expressiveness than 1-WL-type GNNs, and the empirical study shows that PatchGT achieves competitive performances on benchmark datasets and provides interpretability to its predictions. The implementation of our algorithm is released at our Github repo: https://github.com/tufts-ml/PatchGT.
    BEAR: Physics-Principled Building Environment for Control and Reinforcement Learning. (arXiv:2211.14744v1 [eess.SY])
    Recent advancements in reinforcement learning algorithms have opened doors for researchers to operate and optimize building energy management systems autonomously. However, the lack of an easily configurable building dynamical model and energy management task simulation and evaluation platform has arguably slowed the progress in developing advanced and dedicated reinforcement learning (RL) and control algorithms for building operation tasks. Here we propose "BEAR", a physics-principled Building Environment for Control And Reinforcement Learning. The platform allows researchers to benchmark both model-based and model-free controllers using a broad collection of standard building models in Python without co-simulation using external building simulators. In this paper, we discuss the design of this platform and compare it with other existing building simulation frameworks. We demonstrate the compatibility and performance of BEAR with different controllers, including both model predictive control (MPC) and several state-of-the-art RL methods with two case studies.
    How Crucial is Transformer in Decision Transformer?. (arXiv:2211.14655v1 [cs.LG])
    Decision Transformer (DT) is a recently proposed architecture for Reinforcement Learning that frames the decision-making process as an auto-regressive sequence modeling problem and uses a Transformer model to predict the next action in a sequence of states, actions, and rewards. In this paper, we analyze how crucial the Transformer model is in the complete DT architecture on continuous control tasks. Namely, we replace the Transformer by an LSTM model while keeping the other parts unchanged to obtain what we call a Decision LSTM model. We compare it to DT on continuous control tasks, including pendulum swing-up and stabilization, in simulation and on physical hardware. Our experiments show that DT struggles with continuous control problems, such as inverted pendulum and Furuta pendulum stabilization. On the other hand, the proposed Decision LSTM is able to achieve expert-level performance on these tasks, in addition to learning a swing-up controller on the real system. These results suggest that the strength of the Decision Transformer for continuous control tasks may lie in the overall sequential modeling architecture and not in the Transformer per se.
    Elements of effective machine learning datasets in astronomy. (arXiv:2211.14401v1 [astro-ph.IM])
    In this work, we identify elements of effective machine learning datasets in astronomy and present suggestions for their design and creation. Machine learning has become an increasingly important tool for analyzing and understanding the large-scale flood of data in astronomy. To take advantage of these tools, datasets are required for training and testing. However, building machine learning datasets for astronomy can be challenging. Astronomical data is collected from instruments built to explore science questions in a traditional fashion rather than to conduct machine learning. Thus, it is often the case that raw data, or even downstream processed data is not in a form amenable to machine learning. We explore the construction of machine learning datasets and we ask: what elements define effective machine learning datasets? We define effective machine learning datasets in astronomy to be formed with well-defined data points, structure, and metadata. We discuss why these elements are important for astronomical applications and ways to put them in practice. We posit that these qualities not only make the data suitable for machine learning, they also help to foster usable, reusable, and replicable science practices.
    Mixture of Decision Trees for Interpretable Machine Learning. (arXiv:2211.14617v1 [cs.LG])
    This work introduces a novel interpretable machine learning method called Mixture of Decision Trees (MoDT). It constitutes a special case of the Mixture of Experts ensemble architecture, which utilizes a linear model as gating function and decision trees as experts. Our proposed method is ideally suited for problems that cannot be satisfactorily learned by a single decision tree, but which can alternatively be divided into subproblems. Each subproblem can then be learned well from a single decision tree. Therefore, MoDT can be considered as a method that improves performance while maintaining interpretability by making each of its decisions understandable and traceable to humans. Our work is accompanied by a Python implementation, which uses an interpretable gating function, a fast learning algorithm, and a direct interface to fine-tuned interpretable visualization methods. The experiments confirm that the implementation works and, more importantly, show the superiority of our approach compared to single decision trees and random forests of similar complexity.
    Generalizing Gaussian Smoothing for Random Search. (arXiv:2211.14721v1 [cs.LG])
    Gaussian smoothing (GS) is a derivative-free optimization (DFO) algorithm that estimates the gradient of an objective using perturbations of the current parameters sampled from a standard normal distribution. We generalize it to sampling perturbations from a larger family of distributions. Based on an analysis of DFO for non-convex functions, we propose to choose a distribution for perturbations that minimizes the mean squared error (MSE) of the gradient estimate. We derive three such distributions with provably smaller MSE than Gaussian smoothing. We conduct evaluations of the three sampling distributions on linear regression, reinforcement learning, and DFO benchmarks in order to validate our claims. Our proposal improves on GS with the same computational complexity, and are usually competitive with and often outperform Guided ES and Orthogonal ES, two computationally more expensive algorithms that adapt the covariance matrix of normally distributed perturbations.
    Machine Learning Algorithms for Time Series Analysis and Forecasting. (arXiv:2211.14387v1 [cs.LG])
    Time series data is being used everywhere, from sales records to patients' health evolution metrics. The ability to deal with this data has become a necessity, and time series analysis and forecasting are used for the same. Every Machine Learning enthusiast would consider these as very important tools, as they deepen the understanding of the characteristics of data. Forecasting is used to predict the value of a variable in the future, based on its past occurrences. A detailed survey of the various methods that are used for forecasting has been presented in this paper. The complete process of forecasting, from preprocessing to validation has also been explained thoroughly. Various statistical and deep learning models have been considered, notably, ARIMA, Prophet and LSTMs. Hybrid versions of Machine Learning models have also been explored and elucidated. Our work can be used by anyone to develop a good understanding of the forecasting process, and to identify various state of the art models which are being used today.
    Pac-Man Pete: An extensible framework for building AI in VEX Robotics. (arXiv:2211.14385v1 [cs.RO])
    This technical report details VEX Robotics team BLRSAI's development of a fully autonomous robot for VEX Robotics' Tipping Point AI Competition. We identify and develop three separate critical components. This includes a Unity simulation and reinforcement learning model training pipeline, a malleable computer vision pipeline, and a data transfer pipeline to offload large computations from the VEX V5 Brain/micro-controller to an external computer. We give the community access to all of these components in hopes they can reuse and improve upon them in the future, and that it'll spark new ideas for autonomy as well as the necessary infrastructure and programs for AI in educational robotics.
    A Survey of Text Representation Methods and Their Genealogy. (arXiv:2211.14591v1 [cs.CL])
    In recent years, with the advent of highly scalable artificial-neural-network-based text representation methods the field of natural language processing has seen unprecedented growth and sophistication. It has become possible to distill complex linguistic information of text into multidimensional dense numeric vectors with the use of the distributional hypothesis. As a consequence, text representation methods have been evolving at such a quick pace that the research community is struggling to retain knowledge of the methods and their interrelations. We contribute threefold to this lack of compilation, composition, and systematization by providing a survey of current approaches, by arranging them in a genealogy, and by conceptualizing a taxonomy of text representation methods to examine and explain the state-of-the-art. Our research is a valuable guide and reference for artificial intelligence researchers and practitioners interested in natural language processing applications such as recommender systems, chatbots, and sentiment analysis.
    Automated Deep Aberration Detection from Chromosome Karyotype Images. (arXiv:2211.14312v1 [q-bio.QM])
    Chromosome analysis is essential for diagnosing genetic disorders. For hematologic malignancies, identification of somatic clonal aberrations by karyotype analysis remains the standard of care. However, karyotyping is costly and time-consuming because of the largely manual process and the expertise required in identifying and annotating aberrations. Efforts to automate karyotype analysis to date fell short in aberration detection. Using a training set of ~10k patient specimens and ~50k karyograms from over 5 years from the Fred Hutchinson Cancer Center, we created a labeled set of images representing individual chromosomes. These individual chromosomes were used to train and assess deep learning models for classifying the 24 human chromosomes and identifying chromosomal aberrations. The top-accuracy models utilized the recently introduced Topological Vision Transformers (TopViTs) with 2-level-block-Toeplitz masking, to incorporate structural inductive bias. TopViT outperformed CNN (Inception) models with >99.3% accuracy for chromosome identification, and exhibited accuracies >99% for aberration detection in most aberrations. Notably, we were able to show high-quality performance even in "few shot" learning scenarios. Incorporating the definition of clonality substantially improved both precision and recall (sensitivity). When applied to "zero shot" scenarios, the model captured aberrations without training, with perfect precision at >50% recall. Together these results show that modern deep learning models can approach expert-level performance for chromosome aberration detection. To our knowledge, this is the first study demonstrating the downstream effectiveness of TopViTs. These results open up exciting opportunities for not only expediting patient results but providing a scalable technology for early screening of low-abundance chromosomal lesions.
    The applicability of transperceptual and deep learning approaches to the study and mimicry of complex cartilaginous tissues. (arXiv:2211.14314v1 [cs.CV])
    Complex soft tissues, for example the knee meniscus, play a crucial role in mobility and joint health, but when damaged are incredibly difficult to repair and replace. This is due to their highly hierarchical and porous nature which in turn leads to their unique mechanical properties. In order to design tissue substitutes, the internal architecture of the native tissue needs to be understood and replicated. Here we explore a combined audio-visual approach - so called transperceptual - to generate artificial architectures mimicking the native ones. The proposed method uses both traditional imagery, and sound generated from each image as a method of rapidly comparing and contrasting the porosity and pore size within the samples. We have trained and tested a generative adversarial network (GAN) on the 2D image stacks. The impact of the training set of images on the similarity of the artificial to the original dataset was assessed by analyzing two samples. The first consisting of n=478 pairs of audio and image files for which the images were downsampled to 64 $\times$ 64 pixels, the second one consisting of n=7640 pairs of audio and image files for which the full resolution 256 $\times$ 256 pixels is retained but each image is divided into 16 squares to maintain the limit of 64 $\times$ 64 pixels required by the GAN. We reconstruct the 2D stacks of artificially generated datasets into 3D objects and run image analysis algorithms to characterize statistically the architectural parameters - pore size, tortuosity and pore connectivity - and compare them with the original dataset. Results show that the artificially generated dataset that undergoes downsampling performs better in terms of parameter matching. Our audiovisual approach has the potential to be extended to larger data sets to explore both how similarities and differences can be audibly recognized across multiple samples.
    BERN-NN: Tight Bound Propagation For Neural Networks Using Bernstein Polynomial Interval Arithmetic. (arXiv:2211.14438v1 [cs.LG])
    In this paper, we present BERN-NN as an efficient tool to perform bound propagation of Neural Networks (NNs). Bound propagation is a critical step in wide range of NN model checkers and reachability analysis tools. Given a bounded input set, bound propagation algorithms aim to compute tight bounds on the output of the NN. So far, linear and convex optimizations have been used to perform bound propagation. Since neural networks are highly non-convex, state-of-the-art bound propagation techniques suffer from introducing large errors. To circumvent such drawback, BERN-NN approximates the bounds of each neuron using a class of polynomials called Bernstein polynomials. Bernstein polynomials enjoy several interesting properties that allow BERN-NN to obtain tighter bounds compared to those relying on linear and convex approximations. BERN-NN is efficiently parallelized on graphic processing units (GPUs). Extensive numerical results show that bounds obtained by BERN-NN are orders of magnitude tighter than those obtained by state-of-the-art verifiers such as linear programming and linear interval arithmetic. Moreoveer, BERN-NN is both faster and produces tighter outputs compared to convex programming approaches like alpha-CROWN.
    Neuroevolution deep learning architecture search for estimation of river surface elevation from photogrammetric Digital Surface Models. (arXiv:2112.12510v2 [cs.NE] UPDATED)
    Development of the new methods of surface water observation is crucial in the perspective of increasingly frequent extreme hydrological events related to global warming and increasing demand for water. Orthophotos and digital surface models (DSMs) obtained using UAV photogrammetry can be used to determine the Water Surface Elevation (WSE) of a river. However, this task is difficult due to disturbances of the water surface on DSMs caused by limitations of photogrammetric algorithms. In this study, machine learning was used to extract a WSE value from disturbed photogrammetric data. A brand new dataset has been prepared specifically for this purpose by hydrology and photogrammetry experts. The new method is an important step toward automating water surface level measurements with high spatial and temporal resolution. Such data can be used to validate and calibrate of hydrological, hydraulic and hydrodynamic models making hydrological forecasts more accurate, in particular predicting extreme and dangerous events such as floods or droughts. For our knowledge this is the first approach in which dataset was created for this purpose and deep learning models were used for this task. Additionally, neuroevolution algorithm was set to explore different architectures to find local optimal models and non-gradient search was performed to fine-tune the model parameters. The achieved results have better accuracy compared to manual methods of determining WSE from photogrammetric DSMs.
    An Analysis of Social Biases Present in BERT Variants Across Multiple Languages. (arXiv:2211.14402v1 [cs.CL])
    Although large pre-trained language models have achieved great success in many NLP tasks, it has been shown that they reflect human biases from their pre-training corpora. This bias may lead to undesirable outcomes when these models are applied in real-world settings. In this paper, we investigate the bias present in monolingual BERT models across a diverse set of languages (English, Greek, and Persian). While recent research has mostly focused on gender-related biases, we analyze religious and ethnic biases as well and propose a template-based method to measure any kind of bias, based on sentence pseudo-likelihood, that can handle morphologically complex languages with gender-based adjective declensions. We analyze each monolingual model via this method and visualize cultural similarities and differences across different dimensions of bias. Ultimately, we conclude that current methods of probing for bias are highly language-dependent, necessitating cultural insights regarding the unique ways bias is expressed in each language and culture (e.g. through coded language, synecdoche, and other similar linguistic concepts). We also hypothesize that higher measured social biases in the non-English BERT models correlate with user-generated content in their training.
    Asymptotic Optimality of Myopic Ranking and Selection Procedures. (arXiv:2211.14723v1 [stat.ML])
    Ranking and selection (R&S) is a popular model for studying discrete-event dynamic systems. It aims to select the best design (the design with the largest mean performance) from a finite set, where the mean of each design is unknown and has to be learned by samples. Great research efforts have been devoted to this problem in the literature for developing procedures with superior empirical performance and showing their optimality. In these efforts, myopic procedures were popular. They select the best design using a 'naive' mechanism of iteratively and myopically improving an approximation of the objective measure. Although they are based on simple heuristics and lack theoretical support, they turned out highly effective, and often achieved competitive empirical performance compared to procedures that were proposed later and shown to be asymptotically optimal. In this paper, we theoretically analyze these myopic procedures and prove that they also satisfy the optimality conditions of R&S, just like some other popular R&S methods. It explains the good performance of myopic procedures in various numerical tests, and provides good insight into the structure and theoretical development of efficient R&S procedures.
    GLAMI-1M: A Multilingual Image-Text Fashion Dataset. (arXiv:2211.14451v1 [cs.CV])
    We introduce GLAMI-1M: the largest multilingual image-text classification dataset and benchmark. The dataset contains images of fashion products with item descriptions, each in 1 of 13 languages. Categorization into 191 classes has high-quality annotations: all 100k images in the test set and 75% of the 1M training set were human-labeled. The paper presents baselines for image-text classification showing that the dataset presents a challenging fine-grained classification problem: The best scoring EmbraceNet model using both visual and textual features achieves 69.7% accuracy. Experiments with a modified Imagen model show the dataset is also suitable for image generation conditioned on text. The dataset, source code and model checkpoints are published at https://github.com/glami/glami-1m
    When Spectral Modeling Meets Convolutional Networks: A Method for Discovering Reionization-era Lensed Quasars in Multi-band Imaging Data. (arXiv:2211.14543v1 [astro-ph.GA])
    Over the last two decades, around three hundred quasars have been discovered at $z\gtrsim6$, yet only one was identified as being strong-gravitationally lensed. We explore a new approach, enlarging the permitted spectral parameter space while introducing a new spatial geometry veto criterion, implemented via image-based deep learning. We made the first application of this approach in a systematic search for reionization-era lensed quasars, using data from the Dark Energy Survey, the Visible and Infrared Survey Telescope for Astronomy Hemisphere Survey, and the Wide-field Infrared Survey Explorer. Our search method consists of two main parts: (i) pre-selection of the candidates based on their spectral energy distributions (SEDs) using catalog-level photometry and (ii) relative probabilities calculation of being a lens or some contaminant utilizing a convolutional neural network (CNN) classification. The training datasets are constructed by painting deflected point-source lights over actual galaxy images to generate realistic galaxy-quasar lens models, optimized to find systems with small image separations, i.e., Einstein radii of $\theta_\mathrm{E} \leq 1$ arcsec. Visual inspection is then performed for sources with CNN scores of $P_\mathrm{lens} > 0.1$, which led us to obtain 36 newly-selected lens candidates, waiting for spectroscopic confirmation. These findings show that automated SED modeling and deep learning pipelines, supported by modest human input, are a promising route for detecting strong lenses from large catalogs that can overcome the veto limitations of primarily dropout-based SED selection approaches.
    A Contextual Master-Slave Framework on Urban Region Graph for Urban Village Detection. (arXiv:2211.14633v1 [cs.LG])
    Urban villages (UVs) refer to the underdeveloped informal settlement falling behind the rapid urbanization in a city. Since there are high levels of social inequality and social risks in these UVs, it is critical for city managers to discover all UVs for making appropriate renovation policies. Existing approaches to detecting UVs are labor-intensive or have not fully addressed the unique challenges in UV detection such as the scarcity of labeled UVs and the diverse urban patterns in different regions. To this end, we first build an urban region graph (URG) to model the urban area in a hierarchically structured way. Then, we design a novel contextual master-slave framework to effectively detect the urban village from the URG. The core idea of such a framework is to firstly pre-train a basis (or master) model over the URG, and then to adaptively derive specific (or slave) models from the basis model for different regions. The proposed framework can learn to balance the generality and specificity for UV detection in an urban area. Finally, we conduct extensive experiments in three cities to demonstrate the effectiveness of our approach.
    Rectified Pessimistic-Optimistic Learning for Stochastic Continuum-armed Bandit with Constraints. (arXiv:2211.14720v1 [cs.LG])
    This paper studies the problem of stochastic continuum-armed bandit with constraints (SCBwC), where we optimize a black-box reward function $f(x)$ subject to a black-box constraint function $g(x)\leq 0$ over a continuous space $\mathcal X$. We model reward and constraint functions via Gaussian processes (GPs) and propose a Rectified Pessimistic-Optimistic Learning framework (RPOL), a penalty-based method incorporating optimistic and pessimistic GP bandit learning for reward and constraint functions, respectively. We consider the metric of cumulative constraint violation $\sum_{t=1}^T(g(x_t))^{+},$ which is strictly stronger than the traditional long-term constraint violation $\sum_{t=1}^Tg(x_t).$ The rectified design for the penalty update and the pessimistic learning for the constraint function in RPOL guarantee the cumulative constraint violation is minimal. RPOL can achieve sublinear regret and cumulative constraint violation for SCBwC and its variants (e.g., under delayed feedback and non-stationary environment). These theoretical results match their unconstrained counterparts. Our experiments justify RPOL outperforms several existing baseline algorithms.
    Similarity-based Cooperation. (arXiv:2211.14468v1 [cs.GT])
    As machine learning agents act more autonomously in the world, they will increasingly interact with each other. Unfortunately, in many social dilemmas like the one-shot Prisoner's Dilemma, standard game theory predicts that ML agents will fail to cooperate with each other. Prior work has shown that one way to enable cooperative outcomes in the one-shot Prisoner's Dilemma is to make the agents mutually transparent to each other, i.e., to allow them to access one another's source code (Rubinstein 1998, Tennenholtz 2004) -- or weights in the case of ML agents. However, full transparency is often unrealistic, whereas partial transparency is commonplace. Moreover, it is challenging for agents to learn their way to cooperation in the full transparency setting. In this paper, we introduce a more realistic setting in which agents only observe a single number indicating how similar they are to each other. We prove that this allows for the same set of cooperative outcomes as the full transparency setting. We also demonstrate experimentally that cooperation can be learned using simple ML methods.
    Utility Assessment of Synthetic Data Generation Methods. (arXiv:2211.14428v1 [cs.LG])
    Big data analysis poses the dual problem of privacy preservation and utility, i.e., how accurate data analyses remain after transforming original data in order to protect the privacy of the individuals that the data is about - and whether they are accurate enough to be meaningful. In this paper, we thus investigate across several datasets whether different methods of generating fully synthetic data vary in their utility a priori (when the specific analyses to be performed on the data are not known yet), how closely their results conform to analyses on original data a posteriori, and whether these two effects are correlated. We find some methods (decision-tree based) to perform better than others across the board, sizeable effects of some choices of imputation parameters (notably the number of released datasets), no correlation between broad utility metrics and analysis accuracy, and varying correlations for narrow metrics. We did get promising findings for classification tasks when using synthetic data for training machine learning models, which we consider worth exploring further also in terms of mitigating privacy attacks against ML models such as membership inference and model inversion.
    A Theoretical Study of Inductive Biases in Contrastive Learning. (arXiv:2211.14699v1 [cs.LG])
    Understanding self-supervised learning is important but challenging. Previous theoretical works study the role of pretraining losses, and view neural networks as general black boxes. However, the recent work of Saunshi et al. argues that the model architecture -- a component largely ignored by previous works -- also has significant influences on the downstream performance of self-supervised learning. In this work, we provide the first theoretical analysis of self-supervised learning that incorporates the effect of inductive biases originating from the model class. In particular, we focus on contrastive learning -- a popular self-supervised learning method that is widely used in the vision domain. We show that when the model has limited capacity, contrastive representations would recover certain special clustering structures that are compatible with the model architecture, but ignore many other clustering structures in the data distribution. As a result, our theory can capture the more realistic setting where contrastive representations have much lower dimensionality than the number of clusters in the data distribution. We instantiate our theory on several synthetic data distributions, and provide empirical evidence to support the theory.  ( 2 min )
    The smooth output assumption, and why deep networks are better than wide ones. (arXiv:2211.14347v1 [cs.LG])
    When several models have similar training scores, classical model selection heuristics follow Occam's razor and advise choosing the ones with least capacity. Yet, modern practice with large neural networks has often led to situations where two networks with exactly the same number of parameters score similar on the training set, but the deeper one generalizes better to unseen examples. With this in mind, it is well accepted that deep networks are superior to shallow wide ones. However, theoretically there is no difference between the two. In fact, they are both universal approximators. In this work we propose a new unsupervised measure that predicts how well a model will generalize. We call it the output sharpness, and it is based on the fact that, in reality, boundaries between concepts are generally unsharp. We test this new measure on several neural network settings, and architectures, and show how generally strong the correlation is between our metric, and test set performance. Having established this measure, we give a mathematical probabilistic argument that predicts network depth to be correlated with our proposed measure. After verifying this in real data, we are able to formulate the key argument of the work: output sharpness hampers generalization; deep networks have an in built bias against it; therefore, deep networks beat wide ones. All in all the work not only provides a helpful predictor of overfitting that can be used in practice for model selection (or even regularization), but also provides a much needed theoretical grounding for the success of modern deep neural networks.
    A Data-driven Pricing Scheme for Optimal Routing through Artificial Currencies. (arXiv:2211.14793v1 [eess.SY])
    Mobility systems often suffer from a high price of anarchy due to the uncontrolled behavior of selfish users. This may result in societal costs that are significantly higher compared to what could be achieved by a centralized system-optimal controller. Monetary tolling schemes can effectively align the behavior of selfish users with the system-optimum. Yet, they inevitably discriminate the population in terms of income. Artificial currencies were recently presented as an effective alternative that can achieve the same performance, whilst guaranteeing fairness among the population. However, those studies were based on behavioral models that may differ from practical implementations. This paper presents a data-driven approach to automatically adapt artificial-currency tolls within repetitive-game settings. We first consider a parallel-arc setting whereby users commute on a daily basis from a unique origin to a unique destination, choosing a route in exchange of an artificial-currency price or reward while accounting for the impact of the choices of the other users on travel discomfort. Second, we devise a model-based reinforcement learning controller that autonomously learns the optimal pricing policy by interacting with the proposed framework considering the closeness of the observed aggregate flows to a desired system-optimal distribution as a reward function. Our numerical results show that the proposed data-driven pricing scheme can effectively align the users' flows with the system optimum, significantly reducing the societal costs with respect to the uncontrolled flows (by about 15% and 25% depending on the scenario), and respond to environmental changes in a robust and efficient manner.  ( 2 min )
    Supervised Pretraining for Molecular Force Fields and Properties Prediction. (arXiv:2211.14429v1 [physics.chem-ph])
    Machine learning approaches have become popular for molecular modeling tasks, including molecular force fields and properties prediction. Traditional supervised learning methods suffer from scarcity of labeled data for particular tasks, motivating the use of large-scale dataset for other relevant tasks. We propose to pretrain neural networks on a dataset of 86 millions of molecules with atom charges and 3D geometries as inputs and molecular energies as labels. Experiments show that, compared to training from scratch, fine-tuning the pretrained model can significantly improve the performance for seven molecular property prediction tasks and two force field tasks. We also demonstrate that the learned representations from the pretrained model contain adequate information about molecular structures, by showing that linear probing of the representations can predict many molecular information including atom types, interatomic distances, class of molecular scaffolds, and existence of molecular fragments. Our results show that supervised pretraining is a promising research direction in molecular modeling  ( 2 min )
    Nocturne: a scalable driving benchmark for bringing multi-agent learning one step closer to the real world. (arXiv:2206.09889v2 [cs.MA] UPDATED)
    We introduce \textit{Nocturne}, a new 2D driving simulator for investigating multi-agent coordination under partial observability. The focus of Nocturne is to enable research into inference and theory of mind in real-world multi-agent settings without the computational overhead of computer vision and feature extraction from images. Agents in this simulator only observe an obstructed view of the scene, mimicking human visual sensing constraints. Unlike existing benchmarks that are bottlenecked by rendering human-like observations directly using a camera input, Nocturne uses efficient intersection methods to compute a vectorized set of visible features in a C++ back-end, allowing the simulator to run at $2000+$ steps-per-second. Using open-source trajectory and map data, we construct a simulator to load and replay arbitrary trajectories and scenes from real-world driving data. Using this environment, we benchmark reinforcement-learning and imitation-learning agents and demonstrate that the agents are quite far from human-level coordination ability and deviate significantly from the expert trajectories.
    On the Impossibility of Learning to Cooperate with Adaptive Partner Strategies in Repeated Games. (arXiv:2206.10614v2 [cs.GT] UPDATED)
    Learning to cooperate with other agents is challenging when those agents also possess the ability to adapt to our own behavior. Practical and theoretical approaches to learning in cooperative settings typically assume that other agents' behaviors are stationary, or else make very specific assumptions about other agents' learning processes. The goal of this work is to understand whether we can reliably learn to cooperate with other agents without such restrictive assumptions, which are unlikely to hold in real-world applications. Our main contribution is a set of impossibility results, which show that no learning algorithm can reliably learn to cooperate with all possible adaptive partners in a repeated matrix game, even if that partner is guaranteed to cooperate with some stationary strategy. Motivated by these results, we then discuss potential alternative assumptions which capture the idea that an adaptive partner will only adapt rationally to our behavior.
    3D Reconstruction of Protein Complex Structures Using Synthesized Multi-View AFM Images. (arXiv:2211.14662v1 [cs.CV])
    Recent developments in deep learning-based methods demonstrated its potential to predict the 3D protein structures using inputs such as protein sequences, Cryo-Electron microscopy (Cryo-EM) images of proteins, etc. However, these methods struggle to predict the protein complexes (PC), structures with more than one protein. In this work, we explore the atomic force microscope (AFM) assisted deep learning-based methods to predict the 3D structure of PCs. The images produced by AFM capture the protein structure in different and random orientations. These multi-view images can help train the neural network to predict the 3D structure of protein complexes. However, obtaining the dataset of actual AFM images is time-consuming and not a pragmatic task. We propose a virtual AFM imaging pipeline that takes a 'PDB' protein file and generates multi-view 2D virtual AFM images using volume rendering techniques. With this, we created a dataset of around 8K proteins. We train a neural network for 3D reconstruction called Pix2Vox++ using the synthesized multi-view 2D AFM images dataset. We compare the predicted structure obtained using a different number of views and get the intersection over union (IoU) value of 0.92 on the training dataset and 0.52 on the validation dataset. We believe this approach will lead to better prediction of the structure of protein complexes.
    A Maximum Log-Likelihood Method for Imbalanced Few-Shot Learning Tasks. (arXiv:2211.14668v1 [cs.CV])
    Few-shot learning is a rapidly evolving area of research in machine learning where the goal is to classify unlabeled data with only one or "a few" labeled exemplary samples. Neural networks are typically trained to minimize a distance metric between labeled exemplary samples and a query set. Early few-shot approaches use an episodic training process to sub-sample the training data into few-shot batches. This training process matches the sub-sampling done on evaluation. Recently, conventional supervised training coupled with a cosine distance has achieved superior performance for few-shot. Despite the diversity of few-shot approaches over the past decade, most methods still rely on the cosine or Euclidean distance layer between the latent features of the trained network. In this work, we investigate the distributions of trained few-shot features and demonstrate that they can be roughly approximated as exponential distributions. Under this assumption of an exponential distribution, we propose a new maximum log-likelihood metric for few-shot architectures. We demonstrate that the proposed metric achieves superior performance accuracy w.r.t. conventional similarity metrics (e.g., cosine, Euclidean, etc.), and achieve state-of-the-art inductive few-shot performance. Further, additional gains can be achieved by carefully combining multiple metrics and neither of our methods require post-processing feature transformations, which are common to many algorithms. Finally, we demonstrate a novel iterative algorithm designed around our maximum log-likelihood approach that achieves state-of-the-art transductive few-shot performance when the evaluation data is imbalanced. We have made our code publicly available at https://github.com/samuelhess/MLL_FSL/.
    Hierachical Delta-Attention Method for Multimodal Fusion. (arXiv:2011.10916v2 [cs.CV] UPDATED)
    In vision and linguistics; the main input modalities are facial expressions, speech patterns, and the words uttered. The issue with analysis of any one mode of expression (Visual, Verbal or Vocal) is that lot of contextual information can get lost. This asks researchers to inspect multiple modalities to get a thorough understanding of the cross-modal dependencies and temporal context of the situation to analyze the expression. This work attempts at preserving the long-range dependencies within and across different modalities, which would be bottle-necked by the use of recurrent networks and adds the concept of delta-attention to focus on local differences per modality to capture the idiosyncrasy of different people. We explore a cross-attention fusion technique to get the global view of the emotion expressed through these delta-self-attended modalities, in order to fuse all the local nuances and global context together. The addition of attention is new to the multi-modal fusion field and currently being scrutinized for on what stage the attention mechanism should be used, this work achieves competitive accuracy for overall and per-class classification which is close to the current state-of-the-art with almost half number of parameters.
    Looking at the posterior: on the origin of uncertainty in neural-network classification. (arXiv:2211.14605v1 [cs.LG])
    Bayesian inference can quantify uncertainty in the predictions of neural networks using posterior distributions for model parameters and network output. By looking at these posterior distributions, one can separate the origin of uncertainty into aleatoric and epistemic. We use the joint distribution of predictive uncertainty and epistemic uncertainty to quantify how this interpretation of uncertainty depends upon model architecture, dataset complexity, and data distributional shifts in image classification tasks. We conclude that the origin of uncertainty is subjective to each neural network and that the quantification of the induced uncertainty from data distributional shifts depends on the complexity of the underlying dataset. Furthermore, we show that the joint distribution of predictive and epistemic uncertainty can be used to identify data domains where the model is most accurate. To arrive at these results, we use two common posterior approximation methods, Monte-Carlo dropout and deep ensembles, for fully-connected, convolutional and attention-based neural networks.
    Sentence-Level Sign Language Recognition Framework. (arXiv:2211.14447v1 [cs.CV])
    We present two solutions to sentence-level SLR. Sentence-level SLR required mapping videos of sign language sentences to sequences of gloss labels. Connectionist Temporal Classification (CTC) has been used as the classifier level of both models. CTC is used to avoid pre-segmenting the sentences into individual words. The first model is an LRCN-based model, and the second model is a Multi-Cue Network. LRCN is a model in which a CNN as a feature extractor is applied to each frame before feeding them into an LSTM. In the first approach, no prior knowledge has been leveraged. Raw frames are fed into an 18-layer LRCN with a CTC on top. In the second approach, three main characteristics (hand shape, hand position, and hand movement information) associated with each sign have been extracted using Mediapipe. 2D landmarks of hand shape have been used to create the skeleton of the hands and then are fed to a CONV-LSTM model. Hand locations and hand positions as relative distance to head are fed to separate LSTMs. All three sources of information have been then integrated into a Multi-Cue network with a CTC classification layer. We evaluated the performance of proposed models on RWTH-PHOENIX-Weather. After performing an excessive search on model hyper-parameters such as the number of feature maps, input size, batch size, sequence length, LSTM memory cell, regularization, and dropout, we were able to achieve 35 Word Error Rate (WER).  ( 2 min )
    EasyMLServe: Easy Deployment of REST Machine Learning Services. (arXiv:2211.14417v1 [cs.LG])
    Various research domains use machine learning approaches because they can solve complex tasks by learning from data. Deploying machine learning models, however, is not trivial and developers have to implement complete solutions which are often installed locally and include Graphical User Interfaces (GUIs). Distributing software to various users on-site has several problems. Therefore, we propose a concept to deploy software in the cloud. There are several frameworks available based on Representational State Transfer (REST) which can be used to implement cloud-based machine learning services. However, machine learning services for scientific users have special requirements that state-of-the-art REST frameworks do not cover completely. We contribute an EasyMLServe software framework to deploy machine learning services in the cloud using REST interfaces and generic local or web-based GUIs. Furthermore, we apply our framework on two real-world applications, \ie, energy time-series forecasting and cell instance segmentation. The EasyMLServe framework and the use cases are available on GitHub.
    Receptive Field Refinement for Convolutional Neural Networks Reliably Improves Predictive Performance. (arXiv:2211.14487v1 [cs.CV])
    Minimal changes to neural architectures (e.g. changing a single hyperparameter in a key layer), can lead to significant gains in predictive performance in Convolutional Neural Networks (CNNs). In this work, we present a new approach to receptive field analysis that can yield these types of theoretical and empirical performance gains across twenty well-known CNN architectures examined in our experiments. By further developing and formalizing the analysis of receptive field expansion in convolutional neural networks, we can predict unproductive layers in an automated manner before ever training a model. This allows us to optimize the parameter-efficiency of a given architecture at low cost. Our method is computationally simple and can be done in an automated manner or even manually with minimal effort for most common architectures. We demonstrate the effectiveness of this approach by increasing parameter efficiency across past and current top-performing CNN-architectures. Specifically, our approach is able to improve ImageNet1K performance across a wide range of well-known, state-of-the-art (SOTA) model classes, including: VGG Nets, MobileNetV1, MobileNetV3, NASNet A (mobile), MnasNet, EfficientNet, and ConvNeXt - leading to a new SOTA result for each model class.
    How to Backpropagate through Hungarian in Your DETR?. (arXiv:2211.14448v1 [cs.CV])
    The DEtection TRansformer (DETR) approach, which uses a transformer encoder-decoder architecture and a set-based global loss, has become a building block in many transformer based applications. However, as originally presented, the assignment cost and the global loss are not aligned, i.e., reducing the former is likely but not guaranteed to reduce the latter. And the issue of gradient is ignored when a combinatorial solver such as Hungarian is used. In this paper we show that the global loss can be expressed as the sum of an assignment-independent term, and an assignment-dependent term which can be used to define the assignment cost matrix. Recent results on generalized gradients of optimal assignment cost with respect to parameters of an assignment problem are then used to define generalized gradients of the loss with respect to network parameters, and backpropagation is carried out properly. Our experiments using the same loss weights show interesting convergence properties and a potential for further performance improvements.
    Self-attention Presents Low-dimensional Knowledge Graph Embeddings for Link Prediction. (arXiv:2112.10644v3 [cs.LG] UPDATED)
    A few models have tried to tackle the link prediction problem, also known as knowledge graph completion, by embedding knowledge graphs in comparably lower dimensions. However, the state-of-the-art results are attained at the cost of considerably increasing the dimensionality of embeddings which causes scalability issues in the case of huge knowledge bases. Transformers have been successfully used recently as powerful encoders for knowledge graphs, but available models still have scalability issues. To address this limitation, we introduce a Transformer-based model to gain expressive low-dimensional embeddings. We utilize a large number of self-attention heads as the key to applying query-dependent projections to capture mutual information between entities and relations. Empirical results on WN18RR and FB15k-237 as standard link prediction benchmarks demonstrate that our model has favorably comparable performance with the current state-of-the-art models. Notably, we yield our promising results with a significant reduction of 66.9% in the dimensionality of embeddings compared to the five best recent state-of-the-art competitors on average.
    Photo Rater: Photographs Auto-Selector with Deep Learning. (arXiv:2211.14420v1 [cs.CV])
    Photo Rater is a computer vision project that uses neural networks to help photographers select the best photo among those that are taken based on the same scene. This process is usually referred to as "culling" in photography, and it can be tedious and time-consuming if done manually. Photo Rater utilizes three separate neural networks to complete such a task: one for general image quality assessment, one for classifying whether the photo is blurry (either due to unsteady hands or out-of-focusness), and one for assessing general aesthetics (including the composition of the photo, among others). After feeding the image through each neural network, Photo Rater outputs a final score for each image, ranking them based on this score and presenting it to the user.
    Using Sequential Statistical Tests for Efficient Hyperparameter Tuning. (arXiv:2112.12438v2 [cs.LG] UPDATED)
    Hyperparameter tuning is one of the the most time-consuming parts in machine learning. Despite the existence of modern optimization algorithms that minimize the number of evaluations needed, evaluations of a single setting may still be expensive. Usually a resampling technique is used, where the machine learning method has to be fitted a fixed number of k times on different training datasets. The respective mean performance of the k fits is then used as performance estimator. Many hyperparameter settings could be discarded after less than k resampling iterations if they are clearly inferior to high-performing settings. However, resampling is often performed until the very end, wasting a lot of computational effort. To this end, we propose the Sequential Random Search (SQRS) which extends the regular random search algorithm by a sequential testing procedure aimed at detecting and eliminating inferior parameter configurations early. We compared our SQRS with regular random search using multiple publicly available regression and classification datasets. Our simulation study showed that the SQRS is able to find similarly well-performing parameter settings while requiring noticeably fewer evaluations. Our results underscore the potential for integrating sequential tests into hyperparameter tuning.
    Test-time adaptation with slot-centric models. (arXiv:2203.11194v2 [cs.CV] UPDATED)
    Current supervised visual detectors, though impressive within their training distribution, often fail to segment out-of-distribution scenes into their constituent entities. Recent test-time adaptation methods use auxiliary self-supervised losses to adapt the network parameters to each test example independently and have shown promising results towards generalization outside the training distribution for the task of image classification. In our work, we find evidence that these losses can be insufficient for instance segmentation tasks, without also considering architectural inductive biases. For image segmentation, recent slot-centric generative models break such dependence on supervision by attempting to segment scenes into entities in a self-supervised manner by reconstructing pixels. Drawing upon these two lines of work, we propose Slot-TTA, a semi-supervised instance segmentation model equipped with a slot-centric inductive bias, that is adapted per scene at test time through gradient descent on reconstruction or novel view synthesis objectives. We show that test-time adaptation in Slot-TTA greatly improves instance segmentation in out-of-distribution scenes. We evaluate Slot-TTA in several 3D and 2D scene instance segmentation benchmarks and show substantial out-of-distribution performance improvements against state-of-the-art supervised feed-forward detectors and self-supervised test-time adaptation methods.  ( 2 min )
    Optimal Approximation Rates for Deep ReLU Neural Networks on Sobolev Spaces. (arXiv:2211.14400v1 [stat.ML])
    We study the problem of how efficiently, in terms of the number of parameters, deep neural networks with the ReLU activation function can approximate functions in the Sobolev space $W^s(L_q(\Omega))$ on a bounded domain $\Omega$, where the error is measured in $L_p(\Omega)$. This problem is important for studying the application of neural networks in scientific computing and has previously been solved only in the case $p=q=\infty$. Our contribution is to provide a solution for all $1\leq p,q\leq \infty$ and $s > 0$. Our results show that deep ReLU networks significantly outperform classical methods of approximation, but that this comes at the cost of parameters which are not encodable.  ( 2 min )
    Condensed Gradient Boosting. (arXiv:2211.14599v1 [cs.LG])
    This paper presents a computationally efficient variant of gradient boosting for multi-class classification and multi-output regression tasks. Standard gradient boosting uses a 1-vs-all strategy for classifications tasks with more than two classes. This strategy translates in that one tree per class and iteration has to be trained. In this work, we propose the use of multi-output regressors as base models to handle the multi-class problem as a single task. In addition, the proposed modification allows the model to learn multi-output regression problems. An extensive comparison with other multi-ouptut based gradient boosting methods is carried out in terms of generalization and computational efficiency. The proposed method showed the best trade-off between generalization ability and training and predictions speeds.
    c-TPE: Generalizing Tree-structured Parzen Estimator with Inequality Constraints for Continuous and Categorical Hyperparameter Optimization. (arXiv:2211.14411v1 [cs.LG])
    Hyperparameter optimization (HPO) is crucial for strong performance of deep learning algorithms. A widely-used versatile HPO method is a variant of Bayesian optimization called tree-structured Parzen estimator (TPE), which splits data into good and bad groups and uses the density ratio of those groups as an acquisition function (AF). However, real-world applications often have some constraints, such as memory requirements, or latency. In this paper, we present an extension of TPE to constrained optimization (c-TPE) via simple factorization of AFs. The experiments demonstrate c-TPE is robust to various constraint levels and exhibits the best average rank performance among existing methods with statistical significance on search spaces with categorical parameters on 81 settings.
    Learning Visuo-Haptic Skewering Strategies for Robot-Assisted Feeding. (arXiv:2211.14648v1 [cs.RO])
    Acquiring food items with a fork poses an immense challenge to a robot-assisted feeding system, due to the wide range of material properties and visual appearances present across food groups. Deformable foods necessitate different skewering strategies than firm ones, but inferring such characteristics for several previously unseen items on a plate remains nontrivial. Our key insight is to leverage visual and haptic observations during interaction with an item to rapidly and reactively plan skewering motions. We learn a generalizable, multimodal representation for a food item from raw sensory inputs which informs the optimal skewering strategy. Given this representation, we propose a zero-shot framework to sense visuo-haptic properties of a previously unseen item and reactively skewer it, all within a single interaction. Real-robot experiments with foods of varying levels of visual and textural diversity demonstrate that our multimodal policy outperforms baselines which do not exploit both visual and haptic cues or do not reactively plan. Across 6 plates of different food items, our proposed framework achieves 71\% success over 69 skewering attempts total. Supplementary material, datasets, code, and videos can be found on our $\href{https://sites.google.com/view/hapticvisualnet-corl22/home}{website}$.
    Machine Learning for Postprocessing Ensemble Streamflow Forecasts. (arXiv:2106.09547v3 [cs.LG] UPDATED)
    Skillful streamflow forecasts can inform decisions in various areas of water policy and management. We integrate numerical weather prediction ensembles, distributed hydrological model and machine learning to generate ensemble streamflow forecasts at medium-range lead times (1 - 7 days). We demonstrate a case study for machine learning applications in postprocessing ensemble streamflow forecasts in the Upper Susquehanna River basin in the eastern United States. Our results show that the machine learning postprocessor can improve streamflow forecasts relative to low complexity forecasts (e.g., climatological and temporal persistence) as well as standalone hydrometeorological modeling and neural network. The relative gain in forecast skill from postprocessor is generally higher at medium-range timescales compared to shorter lead times; high flows compared to low-moderate flows, and warm-season compared to cool ones. Overall, our results highlight the benefits of machine learning in many aspects for improving both the skill and reliability of streamflow forecasts.
    Instance-level Heterogeneous Domain Adaptation for Limited-labeled Sketch-to-Photo Retrieval. (arXiv:2211.14515v1 [cs.CV])
    Although sketch-to-photo retrieval has a wide range of applications, it is costly to obtain paired and rich-labeled ground truth. Differently, photo retrieval data is easier to acquire. Therefore, previous works pre-train their models on rich-labeled photo retrieval data (i.e., source domain) and then fine-tune them on the limited-labeled sketch-to-photo retrieval data (i.e., target domain). However, without co-training source and target data, source domain knowledge might be forgotten during the fine-tuning process, while simply co-training them may cause negative transfer due to domain gaps. Moreover, identity label spaces of source data and target data are generally disjoint and therefore conventional category-level Domain Adaptation (DA) is not directly applicable. To address these issues, we propose an Instance-level Heterogeneous Domain Adaptation (IHDA) framework. We apply the fine-tuning strategy for identity label learning, aiming to transfer the instance-level knowledge in an inductive transfer manner. Meanwhile, labeled attributes from the source data are selected to form a shared label space for source and target domains. Guided by shared attributes, DA is utilized to bridge cross-dataset domain gaps and heterogeneous domain gaps, which transfers instance-level knowledge in a transductive transfer manner. Experiments show that our method has set a new state of the art on three sketch-to-photo image retrieval benchmarks without extra annotations, which opens the door to train more effective models on limited-labeled heterogeneous image retrieval tasks. Related codes are available at \url{https://github.com/fandulu/IHDA.
    Accelerated Riemannian Optimization: Handling Constraints with a Prox to Bound Geometric Penalties. (arXiv:2211.14645v1 [math.OC])
    We propose a globally-accelerated, first-order method for the optimization of smooth and (strongly or not) geodesically-convex functions in a wide class of Hadamard manifolds. We achieve the same convergence rates as Nesterov's accelerated gradient descent, up to a multiplicative geometric penalty and log factors. Crucially, we can enforce our method to stay within a compact set we define. Prior fully accelerated works \textit{resort to assuming} that the iterates of their algorithms stay in some pre-specified compact set, except for two previous methods of limited applicability. For our manifolds, this solves the open question in [KY22] about obtaining global general acceleration without iterates assumptively staying in the feasible set.
    Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials. (arXiv:2206.03688v2 [cs.LG] UPDATED)
    A recent goal in the theory of deep learning is to identify how neural networks can escape the "lazy training," or Neural Tangent Kernel (NTK) regime, where the network is coupled with its first order Taylor expansion at initialization. While the NTK is minimax optimal for learning dense polynomials (Ghorbani et al, 2021), it cannot learn features, and hence has poor sample complexity for learning many classes of functions including sparse polynomials. Recent works have thus aimed to identify settings where gradient based algorithms provably generalize better than the NTK. One such example is the "QuadNTK" approach of Bai and Lee (2020), which analyzes the second-order term in the Taylor expansion. Bai and Lee (2020) show that the second-order term can learn sparse polynomials efficiently; however, it sacrifices the ability to learn general dense polynomials. In this paper, we analyze how gradient descent on a two-layer neural network can escape the NTK regime by utilizing a spectral characterization of the NTK (Montanari and Zhong, 2020) and building on the QuadNTK approach. We first expand upon the spectral analysis to identify "good" directions in parameter space in which we can move without harming generalization. Next, we show that a wide two-layer neural network can jointly use the NTK and QuadNTK to fit target functions consisting of a dense low-degree term and a sparse high-degree term -- something neither the NTK nor the QuadNTK can do on their own. Finally, we construct a regularizer which encourages our parameter vector to move in the "good" directions, and show that gradient descent on the regularized loss will converge to a global minimizer, which also has low test error. This yields an end to end convergence and generalization guarantee with provable sample complexity improvement over both the NTK and QuadNTK on their own.
    Interpretability Analysis of Deep Models for COVID-19 Detection. (arXiv:2211.14372v1 [eess.AS])
    During the outbreak of COVID-19 pandemic, several research areas joined efforts to mitigate the damages caused by SARS-CoV-2. In this paper we present an interpretability analysis of a convolutional neural network based model for COVID-19 detection in audios. We investigate which features are important for model decision process, investigating spectrograms, F0, F0 standard deviation, sex and age. Following, we analyse model decisions by generating heat maps for the trained models to capture their attention during the decision process. Focusing on a explainable Inteligence Artificial approach, we show that studied models can taken unbiased decisions even in the presence of spurious data in the training set, given the adequate preprocessing steps. Our best model has 94.44% of accuracy in detection, with results indicating that models favors spectrograms for the decision process, particularly, high energy areas in the spectrogram related to prosodic domains, while F0 also leads to efficient COVID-19 detection.
    Deep neuroevolution for limited, heterogeneous data: proof-of-concept application to Neuroblastoma brain metastasis using a small virtual pooled image collection. (arXiv:2211.14499v1 [cs.NE])
    Artificial intelligence (AI) in radiology has made great strides in recent years, but many hurdles remain. Overfitting and lack of generalizability represent important ongoing challenges hindering accurate and dependable clinical deployment. If AI algorithms can avoid overfitting and achieve true generalizability, they can go from the research realm to the forefront of clinical work. Recently, small data AI approaches such as deep neuroevolution (DNE) have avoided overfitting small training sets. We seek to address both overfitting and generalizability by applying DNE to a virtually pooled data set consisting of images from various institutions. Our use case is classifying neuroblastoma brain metastases on MRI. Neuroblastoma is well-suited for our goals because it is a rare cancer. Hence, studying this pediatric disease requires a small data approach. As a tertiary care center, the neuroblastoma images in our local Picture Archiving and Communication System (PACS) are largely from outside institutions. These multi-institutional images provide a heterogeneous data set that can simulate real world clinical deployment. As in prior DNE work, we used a small training set, consisting of 30 normal and 30 metastasis-containing post-contrast MRI brain scans, with 37% outside images. The testing set was enriched with 83% outside images. DNE converged to a testing set accuracy of 97%. Hence, the algorithm was able to predict image class with near-perfect accuracy on a testing set that simulates real-world data. Hence, the work described here represents a considerable contribution toward clinically feasible AI.
    MDA: Availability-Aware Federated Learning Client Selection. (arXiv:2211.14391v1 [cs.LG])
    Recently, a new distributed learning scheme called Federated Learning (FL) has been introduced. FL is designed so that server never collects user-owned data meaning it is great at preserving privacy. FL's process starts with the server sending a model to clients, then the clients train that model using their data and send the updated model back to the server. Afterward, the server aggregates all the updates and modifies the global model. This process is repeated until the model converges. This study focuses on an FL setting called cross-device FL, which trains based on a large number of clients. Since many devices may be unavailable in cross-device FL, and communication between the server and all clients is extremely costly, only a fraction of clients gets selected for training at each round. In vanilla FL, clients are selected randomly, which results in an acceptable accuracy but is not ideal from the overall training time perspective, since some clients are slow and can cause some training rounds to be slow. If only fast clients get selected the learning would speed up, but it will be biased toward only the fast clients' data, and the accuracy degrades. Consequently, new client selection techniques have been proposed to improve the training time by considering individual clients' resources and speed. This paper introduces the first availability-aware selection strategy called MDA. The results show that our approach makes learning faster than vanilla FL by up to 6.5%. Moreover, we show that resource heterogeneity-aware techniques are effective but can become even better when combined with our approach, making it faster than the state-of-the-art selectors by up to 16%. Lastly, our approach selects more unique clients for training compared to client selectors that only select fast clients, which reduces our technique's bias.
    Constrained Pure Exploration Multi-Armed Bandits with a Fixed Budget. (arXiv:2211.14768v1 [cs.LG])
    We consider a constrained, pure exploration, stochastic multi-armed bandit formulation under a fixed budget. Each arm is associated with an unknown, possibly multi-dimensional distribution and is described by multiple attributes that are a function of this distribution. The aim is to optimize a particular attribute subject to user-defined constraints on the other attributes. This framework models applications such as financial portfolio optimization, where it is natural to perform risk-constrained maximization of mean return. We assume that the attributes can be estimated using samples from the arms' distributions and that these estimators satisfy suitable concentration inequalities. We propose an algorithm called \textsc{Constrained-SR} based on the Successive Rejects framework, which recommends an optimal arm and flags the instance as being feasible or infeasible. A key feature of this algorithm is that it is designed on the basis of an information theoretic lower bound for two-armed instances. We characterize an instance-dependent upper bound on the probability of error under \textsc{Constrained-SR}, that decays exponentially with respect to the budget. We further show that the associated decay rate is nearly optimal relative to an information theoretic lower bound in certain special cases.
    Predictive linguistic cues for fake news: a societal artificial intelligence problem. (arXiv:2211.14505v1 [cs.CL])
    Media news are making a large part of public opinion and, therefore, must not be fake. News on web sites, blogs, and social media must be analyzed before being published. In this paper, we present linguistic characteristics of media news items to differentiate between fake news and real news using machine learning algorithms. Neural fake news generation, headlines created by machines, semantic incongruities in text and image captions generated by machine are other types of fake news problems. These problems use neural networks which mainly control distributional features rather than evidence. We propose applying correlation between features set and class, and correlation among the features to compute correlation attribute evaluation metric and covariance metric to compute variance of attributes over the news items. Features unique, negative, positive, and cardinal numbers with high values on the metrics are observed to provide a high area under the curve (AUC) and F1-score.  ( 2 min )
    Communication-Efficient Collaborative Best Arm Identification. (arXiv:2208.09029v2 [cs.LG] UPDATED)
    We investigate top-$m$ arm identification, a basic problem in bandit theory, in a multi-agent learning model in which agents collaborate to learn an objective function. We are interested in designing collaborative learning algorithms that achieve maximum speedup (compared to single-agent learning algorithms) using minimum communication cost, as communication is frequently the bottleneck in multi-agent learning. We give both algorithmic and impossibility results, and conduct a set of experiments to demonstrate the effectiveness of our algorithms.  ( 2 min )
    Decentralized Complete Dictionary Learning via $\ell^{4}$-Norm Maximization. (arXiv:2211.03628v2 [cs.LG] UPDATED)
    With the rapid development of information technologies, centralized data processing is subject to many limitations, such as computational overheads, communication delays, and data privacy leakage. Decentralized data processing over networked terminal nodes becomes an important technology in the era of big data. Dictionary learning is a powerful representation learning method to exploit the low-dimensional structure from the high-dimensional data. By exploiting the low-dimensional structure, the storage and the processing overhead of data can be effectively reduced. In this paper, we propose a novel decentralized complete dictionary learning algorithm, which is based on $\ell^{4}$-norm maximization. Compared with existing decentralized dictionary learning algorithms, comprehensive numerical experiments show that the novel algorithm has significant advantages in terms of per-iteration computational complexity, communication cost, and convergence rate in many scenarios. Moreover, a rigorous theoretical analysis shows that the dictionaries learned by the proposed algorithm can converge to the one learned by a centralized dictionary learning algorithm at a linear rate with high probability under certain conditions.
    Traditional Classification Neural Networks are Good Generators: They are Competitive with DDPMs and GANs. (arXiv:2211.14794v1 [cs.CV])
    Classifiers and generators have long been separated. We break down this separation and showcase that conventional neural network classifiers can generate high-quality images of a large number of categories, being comparable to the state-of-the-art generative models (e.g., DDPMs and GANs). We achieve this by computing the partial derivative of the classification loss function with respect to the input to optimize the input to produce an image. Since it is widely known that directly optimizing the inputs is similar to targeted adversarial attacks incapable of generating human-meaningful images, we propose a mask-based stochastic reconstruction module to make the gradients semantic-aware to synthesize plausible images. We further propose a progressive-resolution technique to guarantee fidelity, which produces photorealistic images. Furthermore, we introduce a distance metric loss and a non-trivial distribution loss to ensure classification neural networks can synthesize diverse and high-fidelity images. Using traditional neural network classifiers, we can generate good-quality images of 256$\times$256 resolution on ImageNet. Intriguingly, our method is also applicable to text-to-image generation by regarding image-text foundation models as generalized classifiers. Proving that classifiers have learned the data distribution and are ready for image generation has far-reaching implications, for classifiers are much easier to train than generative models like DDPMs and GANs. We don't even need to train classification models because tons of public ones are available for download. Also, this holds great potential for the interpretability and robustness of classifiers.
    DigGAN: Discriminator gradIent Gap Regularization for GAN Training with Limited Data. (arXiv:2211.14694v1 [cs.LG])
    Generative adversarial nets (GANs) have been remarkably successful at learning to sample from distributions specified by a given dataset, particularly if the given dataset is reasonably large compared to its dimensionality. However, given limited data, classical GANs have struggled, and strategies like output-regularization, data-augmentation, use of pre-trained models and pruning have been shown to lead to improvements. Notably, the applicability of these strategies is 1) often constrained to particular settings, e.g., availability of a pretrained GAN; or 2) increases training time, e.g., when using pruning. In contrast, we propose a Discriminator gradIent Gap regularized GAN (DigGAN) formulation which can be added to any existing GAN. DigGAN augments existing GANs by encouraging to narrow the gap between the norm of the gradient of a discriminator's prediction w.r.t.\ real images and w.r.t.\ the generated samples. We observe this formulation to avoid bad attractors within the GAN loss landscape, and we find DigGAN to significantly improve the results of GAN training when limited data is available. Code is available at \url{https://github.com/AilsaF/DigGAN}.
    FedSysID: A Federated Approach to Sample-Efficient System Identification. (arXiv:2211.14393v1 [cs.LG])
    We study the problem of learning a linear system model from the observations of $M$ clients. The catch: Each client is observing data from a different dynamical system. This work addresses the question of how multiple clients collaboratively learn dynamical models in the presence of heterogeneity. We pose this problem as a federated learning problem and characterize the tension between achievable performance and system heterogeneity. Furthermore, our federated sample complexity result provides a constant factor improvement over the single agent setting. Finally, we describe a meta federated learning algorithm, FedSysID, that leverages existing federated algorithms at the client level.
    Sampling Neural Radiance Fields for Refractive Objects. (arXiv:2211.14799v1 [cs.CV])
    Recently, differentiable volume rendering in neural radiance fields (NeRF) has gained a lot of popularity, and its variants have attained many impressive results. However, existing methods usually assume the scene is a homogeneous volume so that a ray is cast along the straight path. In this work, the scene is instead a heterogeneous volume with a piecewise-constant refractive index, where the path will be curved if it intersects the different refractive indices. For novel view synthesis of refractive objects, our NeRF-based framework aims to optimize the radiance fields of bounded volume and boundary from multi-view posed images with refractive object silhouettes. To tackle this challenging problem, the refractive index of a scene is reconstructed from silhouettes. Given the refractive index, we extend the stratified and hierarchical sampling techniques in NeRF to allow drawing samples along a curved path tracked by the Eikonal equation. The results indicate that our framework outperforms the state-of-the-art method both quantitatively and qualitatively, demonstrating better performance on the perceptual similarity metric and an apparent improvement in the rendering quality on several synthetic and real scenes.
    Autonomous Racing using a Hybrid Imitation-Reinforcement Learning Architecture. (arXiv:2110.05437v2 [cs.RO] UPDATED)
    In this work, we present a rigorous end-to-end control strategy for autonomous vehicles aimed at minimizing lap times in a time attack racing event. We also introduce AutoRACE Simulator developed as a part of this research project, which was employed to simulate accurate vehicular and environmental dynamics along with realistic audio-visual effects. We adopted a hybrid imitation-reinforcement learning architecture and crafted a novel reward function to train a deep neural network policy to drive (using imitation learning) and race (using reinforcement learning) a car autonomously in less than 20 hours. Deployment results were reported as a direct comparison of 10 autonomous laps against 100 manual laps by 10 different human players. The autonomous agent not only exhibited superior performance by gaining 0.96 seconds over the best manual lap, but it also dominated the human players by 1.46 seconds with regard to the mean lap time. This dominance could be justified in terms of better trajectory optimization and lower reaction time of the autonomous agent.
    Generating 2D and 3D Master Faces for Dictionary Attacks with a Network-Assisted Latent Space Evolution. (arXiv:2211.13964v2 [cs.CR] UPDATED)
    A master face is a face image that passes face-based identity authentication for a high percentage of the population. These faces can be used to impersonate, with a high probability of success, any user, without having access to any user information. We optimize these faces for 2D and 3D face verification models, by using an evolutionary algorithm in the latent embedding space of the StyleGAN face generator. For 2D face verification, multiple evolutionary strategies are compared, and we propose a novel approach that employs a neural network to direct the search toward promising samples, without adding fitness evaluations. The results we present demonstrate that it is possible to obtain a considerable coverage of the identities in the LFW or RFW datasets with less than 10 master faces, for six leading deep face recognition systems. In 3D, we generate faces using the 2D StyleGAN2 generator and predict a 3D structure using a deep 3D face reconstruction network. When employing two different 3D face recognition systems, we are able to obtain a coverage of 40%-50%. Additionally, we present the generation of paired 2D RGB and 3D master faces, which simultaneously match 2D and 3D models with high impersonation rates.
    Detecting DeFi Securities Violations from Token Smart Contract Code. (arXiv:2112.02731v3 [cs.LG] UPDATED)
    Decentralized Finance (DeFi) is a system of financial products and services built and delivered through smart contracts on various blockchains. In the past year, DeFi has gained popularity and market capitalization. However, it has also been connected to crime, in particular, various types of securities violations. The lack of Know Your Customer requirements in DeFi poses challenges to governments trying to mitigate potential offending in this space. This study aims to uncover whether this problem is suited to a machine learning approach, namely, whether we can identify DeFi projects potentially engaging in securities violations based on their tokens' smart contract code. We adapt prior work on detecting specific types of securities violations across Ethereum, building classifiers based on features extracted from DeFi projects' tokens' smart contract code. The final logistic regression model achieves a 98.9% F-1 score; the final random forest classifier achieves a 98.6% F1-score. From further feature-level analysis, we find a single feature makes this a highly detectable problem. The high reliance on a single feature means that, at this stage, a complex machine learning model may not be necessary or desirable for this problem. However, this may change as DeFi securities violations become more sophisticated. Another contribution of our study is a new dataset, comprised of (a) a verified ground truth dataset for tokens involved in securities violations and (b) a set of legitimate tokens from a reputable DeFi aggregator. This paper further discusses the potential use of a model like ours by prosecutors in enforcement efforts and connects it to the wider legal context.
    Deep representation learning: Fundamentals, Perspectives, Applications, and Open Challenges. (arXiv:2211.14732v1 [cs.LG])
    Machine Learning algorithms have had a profound impact on the field of computer science over the past few decades. These algorithms performance is greatly influenced by the representations that are derived from the data in the learning process. The representations learned in a successful learning process should be concise, discrete, meaningful, and able to be applied across a variety of tasks. A recent effort has been directed toward developing Deep Learning models, which have proven to be particularly effective at capturing high-dimensional, non-linear, and multi-modal characteristics. In this work, we discuss the principles and developments that have been made in the process of learning representations, and converting them into desirable applications. In addition, for each framework or model, the key issues and open challenges, as well as the advantages, are examined.
    Doubly robust nearest neighbors in factor models. (arXiv:2211.14297v2 [stat.ML] UPDATED)
    In this technical note, we introduce an improved variant of nearest neighbors for counterfactual inference in panel data settings where multiple units are assigned multiple treatments over multiple time points, each sampled with constant probabilities. We call this estimator a doubly robust nearest neighbor estimator and provide a high probability non-asymptotic error bound for the mean parameter corresponding to each unit at each time. Our guarantee shows that the doubly robust estimator provides a (near-)quadratic improvement in the error compared to nearest neighbor estimators analyzed in prior work for these settings.
    A Comprehensive Study of Radiomics-based Machine Learning for Fibrosis Detection. (arXiv:2211.14396v1 [cs.CV])
    Objectives: Early detection of liver fibrosis can help cure the disease or prevent disease progression. We perform a comprehensive study of machine learning-based fibrosis detection in CT images using radiomic features to develop a non-invasive approach to fibrosis detection. Methods: Two sets of radiomic features were extracted from spherical ROIs in CT images of 182 patients who underwent simultaneous liver biopsy and CT examinations, one set corresponding to biopsy locations and another distant from biopsy locations. Combinations of contrast, normalization, machine learning model, feature selection method, bin width, and kernel radius were investigated, each of which were trained and evaluated 100 times with randomized development and test cohorts. The best settings were evaluated based on their mean test AUC and the best features were determined based on their frequency among the best settings. Results: Logistic regression models with NC images normalized using Gamma correction with $\gamma = 1.5$ performed best for fibrosis detection. Boruta was the best for radiomic feature selection method. Training a model using these optimal settings and features consisting of first order energy, first order kurtosis, and first order skewness, resulted in a model that achieved mean test AUCs of 0.7549 and 0.7166 on biopsy-based and non-biopsy ROIs respectively, outperforming a baseline and best models found during the initial study. Conclusions: Logistic regression models trained on radiomic features from NC images normalized using Gamma correction with $\gamma = 1.5$ that underwent Boruta feature selection are effective for liver fibrosis detection. Energy, kurtosis, and skewness are particularly effective features for fibrosis detection.
    PatchBlender: A Motion Prior for Video Transformers. (arXiv:2211.14449v1 [cs.CV])
    Transformers have become one of the dominant architectures in the field of computer vision. However, there are yet several challenges when applying such architectures to video data. Most notably, these models struggle to model the temporal patterns of video data effectively. Directly targeting this issue, we introduce PatchBlender, a learnable blending function that operates over patch embeddings across the temporal dimension of the latent space. We show that our method is successful at enabling vision transformers to encode the temporal component of video data. On Something-Something v2 and MOVi-A, we show that our method improves the performance of a ViT-B. PatchBlender has the advantage of being compatible with almost any Transformer architecture and since it is learnable, the model can adaptively turn on or off the prior. It is also extremely lightweight compute-wise, 0.005% the GFLOPs of a ViT-B.
    Identifying Chemicals Through Dimensionality Reduction. (arXiv:2211.14708v1 [q-bio.QM])
    Civilizations have tried to make drinking water safe to consume for thousands of years. The process of determining water contaminants has evolved with the complexity of the contaminants due to pesticides and heavy metals. The routine procedure to determine water safety is to use targeted analysis which searches for specific substances from some known list; however, we do not explicitly know which substances should be on this list. Before experimentally determining which substances are contaminants, how do we answer the sampling problem of identifying all the substances in the water? Here, we present an approach that builds on the work of Jaanus Liigand et al., which used non-targeted analysis that conducts a broader search on the sample to develop a random-forest regression model, to predict the names of all the substances in a sample, as well as their respective concentrations[1]. This work utilizes techniques from dimensionality reduction and linear decompositions to present a more accurate model using data from the European Massbank Metabolome Library to produce a global list of chemicals that researchers can then identify and test for when purifying water.
  • Open

    Convergence Rate Analysis for Optimal Computing Budget Allocation Algorithms. (arXiv:2211.14722v1 [stat.ML])
    Ordinal optimization (OO) is a widely-studied technique for optimizing discrete-event dynamic systems (DEDS). It evaluates the performance of the system designs in a finite set by sampling and aims to correctly make ordinal comparison of the designs. A well-known method in OO is the optimal computing budget allocation (OCBA). It builds the optimality conditions for the number of samples allocated to each design, and the sample allocation that satisfies the optimality conditions is shown to asymptotically maximize the probability of correct selection for the best design. In this paper, we investigate two popular OCBA algorithms. With known variances for samples of each design, we characterize their convergence rates with respect to different performance measures. We first demonstrate that the two OCBA algorithms achieve the optimal convergence rate under measures of probability of correct selection and expected opportunity cost. It fills the void of convergence analysis for OCBA algorithms. Next, we extend our analysis to the measure of cumulative regret, a main measure studied in the field of machine learning. We show that with minor modification, the two OCBA algorithms can reach the optimal convergence rate under cumulative regret. It indicates the potential of broader use of algorithms designed based on the OCBA optimality conditions.  ( 2 min )
    Transductive Kernels for Gaussian Processes on Graphs. (arXiv:2211.15322v1 [cs.LG])
    Kernels on graphs have had limited options for node-level problems. To address this, we present a novel, generalized kernel for graphs with node feature data for semi-supervised learning. The kernel is derived from a regularization framework by treating the graph and feature data as two Hilbert spaces. We also show how numerous kernel-based models on graphs are instances of our design. A kernel defined this way has transductive properties, and this leads to improved ability to learn on fewer training points, as well as better handling of highly non-Euclidean data. We demonstrate these advantages using synthetic data where the distribution of the whole graph can inform the pattern of the labels. Finally, by utilizing a flexible polynomial of the graph Laplacian within the kernel, the model also performed effectively in semi-supervised classification on graphs of various levels of homophily.  ( 2 min )
    Why Neural Networks Work. (arXiv:2211.14632v1 [cs.LG])
    We argue that many properties of fully-connected feedforward neural networks (FCNNs), also called multi-layer perceptrons (MLPs), are explainable from the analysis of a single pair of operations, namely a random projection into a higher-dimensional space than the input, followed by a sparsification operation. For convenience, we call this pair of successive operations expand-and-sparsify following the terminology of Dasgupta. We show how expand-and-sparsify can explain the observed phenomena that have been discussed in the literature, such as the so-called Lottery Ticket Hypothesis, the surprisingly good performance of randomly-initialized untrained neural networks, the efficacy of Dropout in training and most importantly, the mysterious generalization ability of overparameterized models, first highlighted by Zhang et al. and subsequently identified even in non-neural network models by Belkin et al.
    Asymptotic Optimality of Myopic Ranking and Selection Procedures. (arXiv:2211.14723v1 [stat.ML])
    Ranking and selection (R&S) is a popular model for studying discrete-event dynamic systems. It aims to select the best design (the design with the largest mean performance) from a finite set, where the mean of each design is unknown and has to be learned by samples. Great research efforts have been devoted to this problem in the literature for developing procedures with superior empirical performance and showing their optimality. In these efforts, myopic procedures were popular. They select the best design using a 'naive' mechanism of iteratively and myopically improving an approximation of the objective measure. Although they are based on simple heuristics and lack theoretical support, they turned out highly effective, and often achieved competitive empirical performance compared to procedures that were proposed later and shown to be asymptotically optimal. In this paper, we theoretically analyze these myopic procedures and prove that they also satisfy the optimality conditions of R&S, just like some other popular R&S methods. It explains the good performance of myopic procedures in various numerical tests, and provides good insight into the structure and theoretical development of efficient R&S procedures.
    A Theoretical Study of Inductive Biases in Contrastive Learning. (arXiv:2211.14699v1 [cs.LG])
    Understanding self-supervised learning is important but challenging. Previous theoretical works study the role of pretraining losses, and view neural networks as general black boxes. However, the recent work of Saunshi et al. argues that the model architecture -- a component largely ignored by previous works -- also has significant influences on the downstream performance of self-supervised learning. In this work, we provide the first theoretical analysis of self-supervised learning that incorporates the effect of inductive biases originating from the model class. In particular, we focus on contrastive learning -- a popular self-supervised learning method that is widely used in the vision domain. We show that when the model has limited capacity, contrastive representations would recover certain special clustering structures that are compatible with the model architecture, but ignore many other clustering structures in the data distribution. As a result, our theory can capture the more realistic setting where contrastive representations have much lower dimensionality than the number of clusters in the data distribution. We instantiate our theory on several synthetic data distributions, and provide empirical evidence to support the theory.
    Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials. (arXiv:2206.03688v2 [cs.LG] UPDATED)
    A recent goal in the theory of deep learning is to identify how neural networks can escape the "lazy training," or Neural Tangent Kernel (NTK) regime, where the network is coupled with its first order Taylor expansion at initialization. While the NTK is minimax optimal for learning dense polynomials (Ghorbani et al, 2021), it cannot learn features, and hence has poor sample complexity for learning many classes of functions including sparse polynomials. Recent works have thus aimed to identify settings where gradient based algorithms provably generalize better than the NTK. One such example is the "QuadNTK" approach of Bai and Lee (2020), which analyzes the second-order term in the Taylor expansion. Bai and Lee (2020) show that the second-order term can learn sparse polynomials efficiently; however, it sacrifices the ability to learn general dense polynomials. In this paper, we analyze how gradient descent on a two-layer neural network can escape the NTK regime by utilizing a spectral characterization of the NTK (Montanari and Zhong, 2020) and building on the QuadNTK approach. We first expand upon the spectral analysis to identify "good" directions in parameter space in which we can move without harming generalization. Next, we show that a wide two-layer neural network can jointly use the NTK and QuadNTK to fit target functions consisting of a dense low-degree term and a sparse high-degree term -- something neither the NTK nor the QuadNTK can do on their own. Finally, we construct a regularizer which encourages our parameter vector to move in the "good" directions, and show that gradient descent on the regularized loss will converge to a global minimizer, which also has low test error. This yields an end to end convergence and generalization guarantee with provable sample complexity improvement over both the NTK and QuadNTK on their own.
    Characterization and Greedy Learning of Gaussian Structural Causal Models under Unknown Interventions. (arXiv:2211.14897v1 [stat.ME])
    We consider the problem of recovering the causal structure underlying observations from different experimental conditions when the targets of the interventions in each experiment are unknown. We assume a linear structural causal model with additive Gaussian noise and consider interventions that perturb their targets while maintaining the causal relationships in the system. Different models may entail the same distributions, offering competing causal explanations for the given observations. We fully characterize this equivalence class and offer identifiability results, which we use to derive a greedy algorithm called GnIES to recover the equivalence class of the data-generating model without knowledge of the intervention targets. In addition, we develop a novel procedure to generate semi-synthetic data sets with known causal ground truth but distributions closely resembling those of a real data set of choice. We leverage this procedure and evaluate the performance of GnIES on synthetic, real, and semi-synthetic data sets. Despite the strong Gaussian distributional assumption, GnIES is robust to an array of model violations and competitive in recovering the causal structure in small- to large-sample settings. We provide, in the Python packages "gnies" and "sempler", implementations of GnIES and our semi-synthetic data generation procedure.
    Label Alignment Regularization for Distribution Shift. (arXiv:2211.14960v1 [cs.LG])
    Recent work reported the label alignment property in a supervised learning setting: the vector of all labels in the dataset is mostly in the span of the top few singular vectors of the data matrix. Inspired by this observation, we derive a regularization method for unsupervised domain adaptation. Instead of regularizing representation learning as done by popular domain adaptation methods, we regularize the classifier so that the target domain predictions can to some extent ``align" with the top singular vectors of the unsupervised data matrix from the target domain. In a linear regression setting, we theoretically justify the label alignment property and characterize the optimality of the solution of our regularization by bounding its distance to the optimal solution. We conduct experiments to show that our method can work well on the label shift problems, where classic domain adaptation methods are known to fail. We also report mild improvement over domain adaptation baselines on a set of commonly seen MNIST-USPS domain adaptation tasks and on cross-lingual sentiment analysis tasks.
    KSD Aggregated Goodness-of-fit Test. (arXiv:2202.00824v4 [stat.ML] UPDATED)
    We investigate properties of goodness-of-fit tests based on the Kernel Stein Discrepancy (KSD). We introduce a strategy to construct a test, called KSDAgg, which aggregates multiple tests with different kernels. KSDAgg avoids splitting the data to perform kernel selection (which leads to a loss in test power), and rather maximises the test power over a collection of kernels. We provide non-asymptotic guarantees on the power of KSDAgg: we show it achieves the smallest uniform separation rate of the collection, up to a logarithmic term. For compactly supported densities with bounded model score function, we derive the rate for KSDAgg over restricted Sobolev balls; this rate corresponds to the minimax optimal rate over unrestricted Sobolev balls, up to an iterated logarithmic term. KSDAgg can be computed exactly in practice as it relies either on a parametric bootstrap or on a wild bootstrap to estimate the quantiles and the level corrections. In particular, for the crucial choice of bandwidth of a fixed kernel, it avoids resorting to arbitrary heuristics (such as median or standard deviation) or to data splitting. We find on both synthetic and real-world data that KSDAgg outperforms other state-of-the-art quadratic-time adaptive KSD-based goodness-of-fit testing procedures.
    Accelerated Gradient Methods for Sparse Statistical Learning with Nonconvex Penalties. (arXiv:2009.10629v4 [math.OC] UPDATED)
    Nesterov's accelerated gradient (AG) is a popular technique to optimize objective functions comprising two components: a convex loss and a penalty function. While AG methods perform well for convex penalties, such as the LASSO, convergence issues may arise when it is applied to nonconvex penalties, such as SCAD. A recent proposal generalizes Nesterov's AG method to the nonconvex setting. The proposed algorithm requires specification of several hyperparameters for its practical application. Aside from some general conditions, there is no explicit rule for selecting the hyperparameters, and how different selection can affect convergence of the algorithm. In this article, we propose a hyperparameter setting based on the complexity upper bound to accelerate convergence, and consider the application of this nonconvex AG algorithm to high-dimensional linear and logistic sparse learning problems. We further establish the rate of convergence and present a simple and useful bound to characterize our proposed optimal damping sequence. Simulation studies show that convergence can be made, on average, considerably faster than that of the conventional proximal gradient algorithm. Our experiments also show that the proposed method generally outperforms the current state-of-the-art methods in terms of signal recovery.
    Synergies Between Disentanglement and Sparsity: a Multi-Task Learning Perspective. (arXiv:2211.14666v1 [cs.LG])
    Although disentangled representations are often said to be beneficial for downstream tasks, current empirical and theoretical understanding is limited. In this work, we provide evidence that disentangled representations coupled with sparse base-predictors improve generalization. In the context of multi-task learning, we prove a new identifiability result that provides conditions under which maximally sparse base-predictors yield disentangled representations. Motivated by this theoretical result, we propose a practical approach to learn disentangled representations based on a sparsity-promoting bi-level optimization problem. Finally, we explore a meta-learning version of this algorithm based on group Lasso multiclass SVM base-predictors, for which we derive a tractable dual formulation. It obtains competitive results on standard few-shot classification benchmarks, while each task is using only a fraction of the learned representations.
    Spectral Diffusion Processes. (arXiv:2209.14125v2 [stat.ML] UPDATED)
    Score-based generative modelling (SGM) has proven to be a very effective method for modelling densities on finite-dimensional spaces. In this work we propose to extend this methodology to learn generative models over functional spaces. To do so, we represent functional data in spectral space to dissociate the stochastic part of the processes from their space-time part. Using dimensionality reduction techniques we then sample from their stochastic component using finite dimensional SGM. We demonstrate our method's effectiveness for modelling various multimodal datasets.
    Statistical Learning and Inverse Problems: A Stochastic Gradient Approach. (arXiv:2209.14967v3 [stat.ML] UPDATED)
    Inverse problems are paramount in Science and Engineering. In this paper, we consider the setup of Statistical Inverse Problem (SIP) and demonstrate how Stochastic Gradient Descent (SGD) algorithms can be used in the linear SIP setting. We provide consistency and finite sample bounds for the excess risk. We also propose a modification for the SGD algorithm where we leverage machine learning methods to smooth the stochastic gradients and improve empirical performance. We exemplify the algorithm in a setting of great interest nowadays: the Functional Linear Regression model. In this case we consider a synthetic data example and examples with a real data classification problem.
    Looking at the posterior: on the origin of uncertainty in neural-network classification. (arXiv:2211.14605v1 [cs.LG])
    Bayesian inference can quantify uncertainty in the predictions of neural networks using posterior distributions for model parameters and network output. By looking at these posterior distributions, one can separate the origin of uncertainty into aleatoric and epistemic. We use the joint distribution of predictive uncertainty and epistemic uncertainty to quantify how this interpretation of uncertainty depends upon model architecture, dataset complexity, and data distributional shifts in image classification tasks. We conclude that the origin of uncertainty is subjective to each neural network and that the quantification of the induced uncertainty from data distributional shifts depends on the complexity of the underlying dataset. Furthermore, we show that the joint distribution of predictive and epistemic uncertainty can be used to identify data domains where the model is most accurate. To arrive at these results, we use two common posterior approximation methods, Monte-Carlo dropout and deep ensembles, for fully-connected, convolutional and attention-based neural networks.
    FaiREE: Fair Classification with Finite-Sample and Distribution-Free Guarantee. (arXiv:2211.15072v1 [stat.ML])
    Algorithmic fairness plays an increasingly critical role in machine learning research. Several group fairness notions and algorithms have been proposed. However, the fairness guarantee of existing fair classification methods mainly depends on specific data distributional assumptions, often requiring large sample sizes, and fairness could be violated when there is a modest number of samples, which is often the case in practice. In this paper, we propose FaiREE, a fair classification algorithm that can satisfy group fairness constraints with finite-sample and distribution-free theoretical guarantees. FaiREE can be adapted to satisfy various group fairness notions (e.g., Equality of Opportunity, Equalized Odds, Demographic Parity, etc.) and achieve the optimal accuracy. These theoretical guarantees are further supported by experiments on both synthetic and real data. FaiREE is shown to have favorable performance over state-of-the-art algorithms.
    Constrained Pure Exploration Multi-Armed Bandits with a Fixed Budget. (arXiv:2211.14768v1 [cs.LG])
    We consider a constrained, pure exploration, stochastic multi-armed bandit formulation under a fixed budget. Each arm is associated with an unknown, possibly multi-dimensional distribution and is described by multiple attributes that are a function of this distribution. The aim is to optimize a particular attribute subject to user-defined constraints on the other attributes. This framework models applications such as financial portfolio optimization, where it is natural to perform risk-constrained maximization of mean return. We assume that the attributes can be estimated using samples from the arms' distributions and that these estimators satisfy suitable concentration inequalities. We propose an algorithm called \textsc{Constrained-SR} based on the Successive Rejects framework, which recommends an optimal arm and flags the instance as being feasible or infeasible. A key feature of this algorithm is that it is designed on the basis of an information theoretic lower bound for two-armed instances. We characterize an instance-dependent upper bound on the probability of error under \textsc{Constrained-SR}, that decays exponentially with respect to the budget. We further show that the associated decay rate is nearly optimal relative to an information theoretic lower bound in certain special cases.
    Incentive-Aware Recommender Systems in Two-Sided Markets. (arXiv:2211.15381v1 [cs.IR])
    Online platforms in the Internet Economy commonly incorporate recommender systems that recommend arms (e.g., products) to agents (e.g., users). In such platforms, a myopic agent has a natural incentive to exploit, by choosing the best product given the current information rather than to explore various alternatives to collect information that will be used for other agents. We propose a novel recommender system that respects agents' incentives and enjoys asymptotically optimal performances expressed by the regret in repeated games. We model such an incentive-aware recommender system as a multi-agent bandit problem in a two-sided market which is equipped with an incentive constraint induced by agents' opportunity costs. If the opportunity costs are known to the principal, we show that there exists an incentive-compatible recommendation policy, which pools recommendations across a genuinely good arm and an unknown arm via a randomized and adaptive approach. On the other hand, if the opportunity costs are unknown to the principal, we propose a policy that randomly pools recommendations across all arms and uses each arm's cumulative loss as feedback for exploration. We show that both policies also satisfy an ex-post fairness criterion, which protects agents from over-exploitation.
    Efficient Aggregated Kernel Tests using Incomplete $U$-statistics. (arXiv:2206.09194v2 [stat.ML] UPDATED)
    We propose a series of computationally efficient nonparametric tests for the two-sample, independence, and goodness-of-fit problems, using the Maximum Mean Discrepancy (MMD), Hilbert Schmidt Independence Criterion (HSIC), and Kernel Stein Discrepancy (KSD), respectively. Our test statistics are incomplete $U$-statistics, with a computational cost that interpolates between linear time in the number of samples, and quadratic time, as associated with classical $U$-statistic tests. The three proposed tests aggregate over several kernel bandwidths to detect departures from the null on various scales: we call the resulting tests MMDAggInc, HSICAggInc and KSDAggInc. This procedure provides a solution to the fundamental kernel selection problem as we can aggregate a large number of kernels with several bandwidths without incurring a significant loss of test power. For the test thresholds, we derive a quantile bound for wild bootstrapped incomplete $U$-statistics, which is of independent interest. We derive non-asymptotic uniform separation rates for MMDAggInc and HSICAggInc, and quantify exactly the trade-off between computational efficiency and the attainable rates: this result is novel for tests based on incomplete $U$-statistics, to our knowledge. We further show that in the quadratic-time case, the wild bootstrap incurs no penalty to test power over the more widespread permutation-based approach, since both attain the same minimax optimal rates (which in turn match the rates that use oracle quantiles). We support our claims with numerical experiments on the trade-off between computational efficiency and test power. In all three testing frameworks, the linear-time versions of our proposed tests perform at least as well as the current linear-time state-of-the-art tests.
    Physics-informed neural networks with unknown measurement noise. (arXiv:2211.15498v1 [stat.ML])
    Physics-informed neural networks (PINNs) constitute a flexible approach to both finding solutions and identifying parameters of partial differential equations. Most works on the topic assume noiseless data, or data contaminated by weak Gaussian noise. We show that the standard PINN framework breaks down in case of non-Gaussian noise. We give a way of resolving this fundamental issue and we propose to jointly train an energy-based model (EBM) to learn the correct noise distribution. We illustrate the improved performance of our approach using multiple examples.
    Using Sequential Statistical Tests for Efficient Hyperparameter Tuning. (arXiv:2112.12438v2 [cs.LG] UPDATED)
    Hyperparameter tuning is one of the the most time-consuming parts in machine learning. Despite the existence of modern optimization algorithms that minimize the number of evaluations needed, evaluations of a single setting may still be expensive. Usually a resampling technique is used, where the machine learning method has to be fitted a fixed number of k times on different training datasets. The respective mean performance of the k fits is then used as performance estimator. Many hyperparameter settings could be discarded after less than k resampling iterations if they are clearly inferior to high-performing settings. However, resampling is often performed until the very end, wasting a lot of computational effort. To this end, we propose the Sequential Random Search (SQRS) which extends the regular random search algorithm by a sequential testing procedure aimed at detecting and eliminating inferior parameter configurations early. We compared our SQRS with regular random search using multiple publicly available regression and classification datasets. Our simulation study showed that the SQRS is able to find similarly well-performing parameter settings while requiring noticeably fewer evaluations. Our results underscore the potential for integrating sequential tests into hyperparameter tuning.
    Copula Density Neural Estimation. (arXiv:2211.15353v1 [cs.LG])
    Probability density estimation from observed data constitutes a central task in statistics. Recent advancements in machine learning offer new tools but also pose new challenges. The big data era demands analysis of long-range spatial and long-term temporal dependencies in large collections of raw data, rendering neural networks an attractive solution for density estimation. In this paper, we exploit the concept of copula to explicitly build an estimate of the probability density function associated to any observed data. In particular, we separate univariate marginal distributions from the joint dependence structure in the data, the copula itself, and we model the latter with a neural network-based method referred to as copula density neural estimation (CODINE). Results show that the novel learning approach is capable of modeling complex distributions and it can be applied for mutual information estimation and data generation.
    A Permutation-free Kernel Two-Sample Test. (arXiv:2211.14908v1 [stat.ME])
    The kernel Maximum Mean Discrepancy~(MMD) is a popular multivariate distance metric between distributions that has found utility in two-sample testing. The usual kernel-MMD test statistic is a degenerate U-statistic under the null, and thus it has an intractable limiting distribution. Hence, to design a level-$\alpha$ test, one usually selects the rejection threshold as the $(1-\alpha)$-quantile of the permutation distribution. The resulting nonparametric test has finite-sample validity but suffers from large computational cost, since every permutation takes quadratic time. We propose the cross-MMD, a new quadratic-time MMD test statistic based on sample-splitting and studentization. We prove that under mild assumptions, the cross-MMD has a limiting standard Gaussian distribution under the null. Importantly, we also show that the resulting test is consistent against any fixed alternative, and when using the Gaussian kernel, it has minimax rate-optimal power against local alternatives. For large sample sizes, our new cross-MMD provides a significant speedup over the MMD, for only a slight loss in power.
    Synthetic Principal Component Design: Fast Covariate Balancing with Synthetic Controls. (arXiv:2211.15241v1 [econ.EM])
    The optimal design of experiments typically involves solving an NP-hard combinatorial optimization problem. In this paper, we aim to develop a globally convergent and practically efficient optimization algorithm. Specifically, we consider a setting where the pre-treatment outcome data is available and the synthetic control estimator is invoked. The average treatment effect is estimated via the difference between the weighted average outcomes of the treated and control units, where the weights are learned from the observed data. {Under this setting, we surprisingly observed that the optimal experimental design problem could be reduced to a so-called \textit{phase synchronization} problem.} We solve this problem via a normalized variant of the generalized power method with spectral initialization. On the theoretical side, we establish the first global optimality guarantee for experiment design when pre-treatment data is sampled from certain data-generating processes. Empirically, we conduct extensive experiments to demonstrate the effectiveness of our method on both the US Bureau of Labor Statistics and the Abadie-Diemond-Hainmueller California Smoking Data. In terms of the root mean square error, our algorithm surpasses the random design by a large margin.
    Lightning Fast Video Anomaly Detection via Adversarial Knowledge Distillation. (arXiv:2211.15597v1 [cs.CV])
    We propose a very fast frame-level model for anomaly detection in video, which learns to detect anomalies by distilling knowledge from multiple highly accurate object-level teacher models. To improve the fidelity of our student, we distill the low-resolution anomaly maps of the teachers by jointly applying standard and adversarial distillation, introducing an adversarial discriminator for each teacher to distinguish between target and generated anomaly maps. We conduct experiments on three benchmarks (Avenue, ShanghaiTech, UCSD Ped2), showing that our method is over 7 times faster than the fastest competing method, and between 28 and 62 times faster than object-centric models, while obtaining comparable results to recent methods. Our evaluation also indicates that our model achieves the best trade-off between speed and accuracy, due to its previously unheard-of speed of 1480 FPS. In addition, we carry out a comprehensive ablation study to justify our architectural design choices.
    Heterogeneous Treatment Effect Bounds under Sample Selection with an Application to the Effects of Social Media on Political Polarization. (arXiv:2209.04329v2 [econ.EM] UPDATED)
    We propose a method for estimation and inference for bounds for heterogeneous causal effect parameters in general sample selection models where the treatment can affect whether an outcome is observed and no exclusion restrictions are available. The method provides conditional effect bounds as functions of policy relevant pre-treatment variables. It allows for conducting valid statistical inference on the unidentified conditional effects. We use a flexible debiased/double machine learning approach that can accommodate non-linear functional forms and high-dimensional confounders. Easily verifiable high-level conditions for estimation and misspecification robust inference guarantees are provided as well. Re-analyzing data from a large scale field experiment on Facebook, we find significant depolarization effects of counter-attitudinal news subscription nudges. The effect bounds are highly heterogeneous and suggest strong depolarization effects for moderates, conservatives, and younger users.
    Meta-analysis of individualized treatment rules via sign-coherency. (arXiv:2211.15476v1 [stat.ML])
    Medical treatments tailored to a patient's baseline characteristics hold the potential of improving patient outcomes while reducing negative side effects. Learning individualized treatment rules (ITRs) often requires aggregation of multiple datasets(sites); however, current ITR methodology does not take between-site heterogeneity into account, which can hurt model generalizability when deploying back to each site. To address this problem, we develop a method for individual-level meta-analysis of ITRs, which jointly learns site-specific ITRs while borrowing information about feature sign-coherency via a scientifically-motivated directionality principle. We also develop an adaptive procedure for model tuning, using information criteria tailored to the ITR learning problem. We study the proposed methods through numerical experiments to understand their performance under different levels of between-site heterogeneity and apply the methodology to estimate ITRs in a large multi-center database of electronic health records. This work extends several popular methodologies for estimating ITRs (A-learning, weighted learning) to the multiple-sites setting.
    Traditional Classification Neural Networks are Good Generators: They are Competitive with DDPMs and GANs. (arXiv:2211.14794v1 [cs.CV])
    Classifiers and generators have long been separated. We break down this separation and showcase that conventional neural network classifiers can generate high-quality images of a large number of categories, being comparable to the state-of-the-art generative models (e.g., DDPMs and GANs). We achieve this by computing the partial derivative of the classification loss function with respect to the input to optimize the input to produce an image. Since it is widely known that directly optimizing the inputs is similar to targeted adversarial attacks incapable of generating human-meaningful images, we propose a mask-based stochastic reconstruction module to make the gradients semantic-aware to synthesize plausible images. We further propose a progressive-resolution technique to guarantee fidelity, which produces photorealistic images. Furthermore, we introduce a distance metric loss and a non-trivial distribution loss to ensure classification neural networks can synthesize diverse and high-fidelity images. Using traditional neural network classifiers, we can generate good-quality images of 256$\times$256 resolution on ImageNet. Intriguingly, our method is also applicable to text-to-image generation by regarding image-text foundation models as generalized classifiers. Proving that classifiers have learned the data distribution and are ready for image generation has far-reaching implications, for classifiers are much easier to train than generative models like DDPMs and GANs. We don't even need to train classification models because tons of public ones are available for download. Also, this holds great potential for the interpretability and robustness of classifiers.
    Multivariate rank via entropic optimal transport: sample efficiency and generative modeling. (arXiv:2111.00043v3 [stat.ML] UPDATED)
    The framework of optimal transport has been leveraged to extend the notion of rank to the multivariate setting while preserving desirable properties of the resulting goodness-of-fit (GoF) statistics. In particular, the rank energy (RE) and rank maximum mean discrepancy (RMMD) are distribution-free under the null, exhibit high power in statistical testing, and are robust to outliers. In this paper, we point to and alleviate some of the practical shortcomings of these proposed GoF statistics, namely their high computational cost, high statistical sample complexity, and lack of differentiability with respect to the data. We show that all these practically important issues are addressed by considering entropy-regularized optimal transport maps in place of the rank map, which we refer to as the soft rank. We consequently propose two new statistics, the soft rank energy (sRE) and soft rank maximum mean discrepancy (sRMMD), which exhibit several desirable properties. Given $n$ sample data points, we provide non-asymptotic convergence rates for the sample estimate of the entropic transport map to its population version that are essentially of the order $n^{-1/2}$ when the starting measure is subgaussian and the target measure has compact support. This result is novel compared to existing results which achieve a rate of $n^{-1}$ but crucially rely on both measures having compact support. We leverage this result to demonstrate fast convergence of sample sRE and sRMMD to their population version making them useful for high-dimensional GoF testing. Our statistics are differentiable and amenable to popular machine learning frameworks that rely on gradient methods. We leverage these properties towards showcasing the utility of the proposed statistics for generative modeling on two important problems: image generation and generating valid knockoffs for controlled feature selection.
    Hessian Averaging in Stochastic Newton Methods Achieves Superlinear Convergence. (arXiv:2204.09266v2 [math.OC] UPDATED)
    We consider minimizing a smooth and strongly convex objective function using a stochastic Newton method. At each iteration, the algorithm is given an oracle access to a stochastic estimate of the Hessian matrix. The oracle model includes popular algorithms such as Subsampled Newton and Newton Sketch. Despite using second-order information, these existing methods do not exhibit superlinear convergence, unless the stochastic noise is gradually reduced to zero during the iteration, which would lead to a computational blow-up in the per-iteration cost. We propose to address this limitation with Hessian averaging: instead of using the most recent Hessian estimate, our algorithm maintains an average of all the past estimates. This reduces the stochastic noise while avoiding the computational blow-up. We show that this scheme exhibits local $Q$-superlinear convergence with a non-asymptotic rate of $(\Upsilon\sqrt{\log (t)/t}\,)^{t}$, where $\Upsilon$ is proportional to the level of stochastic noise in the Hessian oracle. A potential drawback of this (uniform averaging) approach is that the averaged estimates contain Hessian information from the global phase of the method, i.e., before the iterates converge to a local neighborhood. This leads to a distortion that may substantially delay the superlinear convergence until long after the local neighborhood is reached. To address this drawback, we study a number of weighted averaging schemes that assign larger weights to recent Hessians, so that the superlinear convergence arises sooner, albeit with a slightly slower rate. Remarkably, we show that there exists a universal weighted averaging scheme that transitions to local convergence at an optimal stage, and still exhibits a superlinear convergence rate nearly (up to a logarithmic factor) matching that of uniform Hessian averaging.
    Causal Deep Reinforcement Learning using Observational Data. (arXiv:2211.15355v1 [cs.LG])
    Deep reinforcement learning (DRL) requires the collection of plenty of interventional data, which is sometimes expensive and even unethical in the real world, such as in the autonomous driving and the medical field. Offline reinforcement learning promises to alleviate this issue by exploiting the vast amount of observational data available in the real world. However, observational data may mislead the learning agent to undesirable outcomes if the behavior policy that generates the data depends on unobserved random variables (i.e., confounders). In this paper, we propose two deconfounding methods in DRL to address this problem. The methods first calculate the importance degree of different samples based on the causal inference technique, and then adjust the impact of different samples on the loss function by reweighting or resampling the offline dataset to ensure its unbiasedness. These deconfounding methods can be flexibly combined with the existing model-free DRL algorithms such as soft actor-critic and deep Q-learning, provided that a weak condition can be satisfied by the loss functions of these algorithms. We prove the effectiveness of our deconfounding methods and validate them experimentally.
    Boundary Graph Neural Networks for 3D Simulations. (arXiv:2106.11299v4 [cs.LG] UPDATED)
    The abundance of data has given machine learning considerable momentum in natural sciences and engineering, though modeling of physical processes is often difficult. A particularly tough problem is the efficient representation of geometric boundaries. Triangularized geometric boundaries are well understood and ubiquitous in engineering applications. However, it is notoriously difficult to integrate them into machine learning approaches due to their heterogeneity with respect to size and orientation. In this work, we introduce an effective theory to model particle-boundary interactions, which leads to our new Boundary Graph Neural Networks (BGNNs) that dynamically modify graph structures to obey boundary conditions. The new BGNNs are tested on complex 3D granular flow processes of hoppers, rotating drums and mixers, which are all standard components of modern industrial machinery but still have complicated geometry. BGNNs are evaluated in terms of computational efficiency as well as prediction accuracy of particle flows and mixing entropies. BGNNs are able to accurately reproduce 3D granular flows within simulation uncertainties over hundreds of thousands of simulation timesteps. Most notably, in our experiments, particles stay within the geometric objects without using handcrafted conditions or restrictions.
    Differentiable Meta Multigraph Search with Partial Message Propagation on Heterogeneous Information Networks. (arXiv:2211.14752v1 [cs.LG])
    Heterogeneous information networks (HINs) are widely employed for describing real-world data with intricate entities and relationships. To automatically utilize their semantic information, graph neural architecture search has recently been developed on various tasks of HINs. Existing works, on the other hand, show weaknesses in instability and inflexibility. To address these issues, we propose a novel method called Partial Message Meta Multigraph search (PMMM) to automatically optimize the neural architecture design on HINs. Specifically, to learn how graph neural networks (GNNs) propagate messages along various types of edges, PMMM adopts an efficient differentiable framework to search for a meaningful meta multigraph, which can capture more flexible and complex semantic relations than a meta graph. The differentiable search typically suffers from performance instability, so we further propose a stable algorithm called partial message search to ensure that the searched meta multigraph consistently surpasses the manually designed meta-structures, i.e., meta-paths. Extensive experiments on six benchmark datasets over two representative tasks, including node classification and recommendation, demonstrate the effectiveness of the proposed method. Our approach outperforms the state-of-the-art heterogeneous GNNs, finds out meaningful meta multigraphs, and is significantly more stable.
    Optimal Approximation Rates for Deep ReLU Neural Networks on Sobolev Spaces. (arXiv:2211.14400v1 [stat.ML])
    We study the problem of how efficiently, in terms of the number of parameters, deep neural networks with the ReLU activation function can approximate functions in the Sobolev space $W^s(L_q(\Omega))$ on a bounded domain $\Omega$, where the error is measured in $L_p(\Omega)$. This problem is important for studying the application of neural networks in scientific computing and has previously been solved only in the case $p=q=\infty$. Our contribution is to provide a solution for all $1\leq p,q\leq \infty$ and $s > 0$. Our results show that deep ReLU networks significantly outperform classical methods of approximation, but that this comes at the cost of parameters which are not encodable.  ( 2 min )
    Distribution Free Prediction Sets for Node Classification. (arXiv:2211.14555v1 [stat.ML])
    Graph Neural Networks (GNNs) are able to achieve high classification accuracy on many large real world datasets, but provide no rigorous notion of predictive uncertainty. We leverage recent advances in conformal prediction to construct prediction sets for node classification in inductive learning scenarios, and verify the efficacy of our approach across standard benchmark datasets using popular GNN models. The code is available at \href{https://github.com/jase-clarkson/graph_cp}{this link}.  ( 2 min )
    On the Robustness of Median Sampling in Noisy Evolutionary Optimization. (arXiv:1907.13100v2 [cs.NE] UPDATED)
    Evolutionary algorithms (EAs) are a sort of nature-inspired metaheuristics, which have wide applications in various practical optimization problems. In these problems, objective evaluations are usually inaccurate, because noise is almost inevitable in real world, and it is a crucial issue to weaken the negative effect caused by noise. Sampling is a popular strategy, which evaluates the objective a couple of times, and employs the mean of these evaluation results as an estimate of the objective value. In this work, we introduce a novel sampling method, median sampling, into EAs, and illustrate its properties and usefulness theoretically by solving OneMax, the problem of maximizing the number of 1s in a bit string. Instead of the mean, median sampling employs the median of the evaluation results as an estimate. Through rigorous theoretical analysis on OneMax under the commonly used onebit noise, we show that median sampling reduces the expected runtime exponentially. Next, through two special noise models, we show that when the 2-quantile of the noisy fitness increases with the true fitness, median sampling can be better than mean sampling; otherwise, it may fail and mean sampling can be better. The results may guide us to employ median sampling properly in practical applications.  ( 2 min )
    Interval-censored Hawkes processes. (arXiv:2104.07932v4 [cs.LG] UPDATED)
    Interval-censored data solely records the aggregated counts of events during specific time intervals - such as the number of patients admitted to the hospital or the volume of vehicles passing traffic loop detectors - and not the exact occurrence time of the events. It is currently not understood how to fit the Hawkes point processes to this kind of data. Its typical loss function (the point process log-likelihood) cannot be computed without exact event times. Furthermore, it does not have the independent increments property to use the Poisson likelihood. This work builds a novel point process, a set of tools, and approximations for fitting Hawkes processes within interval-censored data scenarios. First, we define the Mean Behavior Poisson process (MBPP), a novel Poisson process with a direct parameter correspondence to the popular self-exciting Hawkes process. We fit MBPP in the interval-censored setting using an interval-censored Poisson log-likelihood (IC-LL). We use the parameter equivalence to uncover the parameters of the associated Hawkes process. Second, we introduce two novel exogenous functions to distinguish the exogenous from the endogenous events. We propose the multi-impulse exogenous function - for when the exogenous events are observed as event time - and the latent homogeneous Poisson process exogenous function - for when the exogenous events are presented as interval-censored volumes. Third, we provide several approximation methods to estimate the intensity and compensator function of MBPP when no analytical solution exists. Fourth and finally, we connect the interval-censored loss of MBPP to a broader class of Bregman divergence-based functions. Using the connection, we show that the popularity estimation algorithm Hawkes Intensity Process (HIP) is a particular case of the MBPP. We verify our models through empirical testing on synthetic data and real-world data.  ( 3 min )
    Linear Classification of Neural Manifolds with Correlated Variability. (arXiv:2211.14961v1 [q-bio.NC])
    Understanding how the statistical and geometric properties of neural activations relate to network performance is a key problem in theoretical neuroscience and deep learning. In this letter, we calculate how correlations between object representations affect the capacity, a measure of linear separability. We show that for spherical object manifolds, introducing correlations between centroids effectively pushes the spheres closer together, while introducing correlations between the spheres' axes effectively shrinks their radii, revealing a duality between neural correlations and geometry. We then show that our results can be used to accurately estimate the capacity with real neural data.  ( 2 min )
    Transfer learning with high-dimensional quantile regression. (arXiv:2211.14578v1 [stat.ML])
    Transfer learning has become an essential technique to exploit information from the source domain to boost performance of the target task. Despite the prevalence in high-dimensional data, heterogeneity and/or heavy tails tend to be discounted in current transfer learning approaches and thus may undermine the resulting performance. We propose a transfer learning procedure in the framework of high-dimensional quantile regression models to accommodate the heterogeneity and heavy tails in the source and target domains. We establish error bounds of the transfer learning estimator based on delicately selected transferable source domains, showing that lower error bounds can be achieved for critical selection criterion and larger sample size of source tasks. We further propose valid confidence interval and hypothesis test procedures for individual component of quantile regression coefficients by advocating a one-step debiased estimator of transfer learning estimator wherein the consistent variance estimation is proposed via the technique of transfer learning again. Simulation results demonstrate that the proposed method exhibits some favorable performances.  ( 2 min )
    On the Sample Complexity of Representation Learning in Multi-task Bandits with Global and Local structure. (arXiv:2211.15129v1 [stat.ML])
    We investigate the sample complexity of learning the optimal arm for multi-task bandit problems. Arms consist of two components: one that is shared across tasks (that we call representation) and one that is task-specific (that we call predictor). The objective is to learn the optimal (representation, predictor)-pair for each task, under the assumption that the optimal representation is common to all tasks. Within this framework, efficient learning algorithms should transfer knowledge across tasks. We consider the best-arm identification problem for a fixed confidence, where, in each round, the learner actively selects both a task, and an arm, and observes the corresponding reward. We derive instance-specific sample complexity lower bounds satisfied by any $(\delta_G,\delta_H)$-PAC algorithm (such an algorithm identifies the best representation with probability at least $1-\delta_G$, and the best predictor for a task with probability at least $1-\delta_H$). We devise an algorithm OSRL-SC whose sample complexity approaches the lower bound, and scales at most as $H(G\log(1/\delta_G)+ X\log(1/\delta_H))$, with $X,G,H$ being, respectively, the number of tasks, representations and predictors. By comparison, this scaling is significantly better than the classical best-arm identification algorithm that scales as $HGX\log(1/\delta)$.  ( 2 min )
    Beyond Invariance: Test-Time Label-Shift Adaptation for Distributions with "Spurious" Correlations. (arXiv:2211.15646v1 [stat.ML])
    Spurious correlations, or correlations that change across domains where a model can be deployed, present significant challenges to real-world applications of machine learning models. However, such correlations are not always "spurious"; often, they provide valuable prior information for a prediction beyond what can be extracted from the input alone. Here, we present a test-time adaptation method that exploits the spurious correlation phenomenon, in contrast to recent approaches that attempt to eliminate spurious correlations through invariance. We consider situations where the prior distribution $p(y, z)$, which models the marginal dependence between the class label $y$ and the nuisance factors $z$, may change across domains, but the generative model for features $p(\mathbf{x}|y, z)$ is constant. We note that this is an expanded version of the label shift assumption, where the labels now also include the nuisance factors $z$. Based on this observation, we train a classifier to predict $p(y, z|\mathbf{x})$ on the source distribution, and implement a test-time label shift correction that adapts to changes in the marginal distribution $p(y, z)$ using unlabeled samples from the target domain. We call our method "Test-Time Label-Shift Adaptation" or TTLSA. We apply our method to two different image datasets -- the CheXpert chest X-ray dataset and the colored MNIST dataset -- and show that it gives better downstream results than methods that try to train classifiers which are invariant to the changes in prior distribution. Code reproducing experiments is available at https://github.com/nalzok/test-time-label-shift .  ( 2 min )
    Domain Generalization for Robust Model-Based Offline Reinforcement Learning. (arXiv:2211.14827v1 [cs.LG])
    Existing offline reinforcement learning (RL) algorithms typically assume that training data is either: 1) generated by a known policy, or 2) of entirely unknown origin. We consider multi-demonstrator offline RL, a middle ground where we know which demonstrators generated each dataset, but make no assumptions about the underlying policies of the demonstrators. This is the most natural setting when collecting data from multiple human operators, yet remains unexplored. Since different demonstrators induce different data distributions, we show that this can be naturally framed as a domain generalization problem, with each demonstrator corresponding to a different domain. Specifically, we propose Domain-Invariant Model-based Offline RL (DIMORL), where we apply Risk Extrapolation (REx) (Krueger et al., 2020) to the process of learning dynamics and rewards models. Our results show that models trained with REx exhibit improved domain generalization performance when compared with the natural baseline of pooling all demonstrators' data. We observe that the resulting models frequently enable the learning of superior policies in the offline model-based RL setting, can improve the stability of the policy learning process, and potentially enable increased exploration.  ( 2 min )
    Online Kernel CUSUM for Change-Point Detection. (arXiv:2211.15070v1 [stat.ME])
    We develop an online kernel Cumulative Sum (CUSUM) procedure, which consists of a parallel set of kernel statistics with different window sizes to account for the unknown change-point location. Compared with many existing sliding window-based kernel change-point detection procedures, which correspond to the Shewhart chart-type procedure, the proposed procedure is more sensitive to small changes. We further present a recursive computation of detection statistics, which is crucial for online procedures to achieve a constant computational and memory complexity, such that we do not need to calculate and remember the entire Gram matrix, which can be a computational bottleneck otherwise. We obtain precise analytic approximations of the two fundamental performance metrics, the Average Run Length (ARL) and Expected Detection Delay (EDD). Furthermore, we establish the optimal window size on the order of $\log ({\rm ARL})$ such that there is nearly no power loss compared with an oracle procedure, which is analogous to the classic result for window-limited Generalized Likelihood Ratio (GLR) procedure. We present extensive numerical experiments to validate our theoretical results and the competitive performance of the proposed method.  ( 2 min )

  • Open

    DSC Weekly 29 Nov 2022 – Winter is Coming
    I live near the top of one of the foothills of the Cascade Mountains in the Puget Sound, my first year of living here after spending the last decade in a valley near sea level, only a few miles away. Warily, we're watching the heavy, sullen clouds move in even as a cold weather system presses in from the north, a guaranteed recipe for snow. Winter is coming. The post DSC Weekly 29 Nov 2022 – Winter is Coming appeared first on Data Science Central.  ( 21 min )
    Top 10 Blockchain Development Companies in India 2023
    The demand for cutting-edge, custom and futuristic blockchain applications is expanding in India. The number of blockchain development service providers has also increased due to a significant demand increase. The post Top 10 Blockchain Development Companies in India 2023 appeared first on Data Science Central.  ( 21 min )
    How to Save Money on Freight Rates Shipping
    As shipping costs have increased in recent years, it is essential to understand how to save money on freight rates. There are various ways you can reduce your costs while still providing quality customer service. The post How to Save Money on Freight Rates Shipping appeared first on Data Science Central.  ( 20 min )
    How to Check all the Existing SQL Constraints on a Table?
    In this article, we will learn about the constraints in SQL and how to check all the existing SQL Constraints in the table along with examples to understand the concept much better. The post How to Check all the Existing SQL Constraints on a Table? appeared first on Data Science Central.  ( 22 min )
    Cicero from meta may foreshadow hybrid AI future architectures
    Last week,meta announced a new game changing announcement called Cicero that points to a possible new future for AI. The post Cicero from meta may foreshadow hybrid AI future architectures appeared first on Data Science Central.  ( 19 min )
    Accounting Trends To Expect In 2023: Cloud Computing, Automation, And More
    Like many other industries, accounting depends on technological advancements to maintain a competitive edge. The efficiency and precision of accounting and associated duties have greatly benefited from the advent of digital technology and the widespread availability of specialist software. Consequently, accountants will have more time to devote to strategic planning and new product development, which… Read More »Accounting Trends To Expect In 2023: Cloud Computing, Automation, And More The post Accounting Trends To Expect In 2023: Cloud Computing, Automation, And More appeared first on Data Science Central.  ( 20 min )
    Cloud Computing Services, Features and Benefits
    Cloud computing services have grown in popularity significantly over the years. Many sectors are moving to cloud computing services for business operations. Cloud computing enables businesses to store, manage, and process essential data using remote servers hosted on the internet. The post Cloud Computing Services, Features and Benefits appeared first on Data Science Central.  ( 19 min )
    5 Tech Solutions to Lessen the Impact of Physician Burnout
    It’s no surprise that the Coronavirus global health emergency had pushed physicians and healthcare workers to a breaking point. Physician burnout isn’t a new phenomenon. It has been a problem long before 2020. The COVID-19 pandemic has significantly made it worse. It created new challenges for healthcare providers, like remote patient care, which never existed… Read More »5 Tech Solutions to Lessen the Impact of Physician Burnout The post 5 Tech Solutions to Lessen the Impact of Physician Burnout appeared first on Data Science Central.  ( 22 min )
  • Open

    AI Dream 122 - 94sec Stuck in Maze of Time TEASER
    submitted by /u/LordPewPew777 [link] [comments]  ( 45 min )
    What will Gpt-4 mean for developers?
    I know this post has been done before, but looking for fresh opinions since everything seems to be changing so fast. I'm a mid-level developer and I can't help but to feel that GPT-4 will be my doom. Am I crazy? submitted by /u/SylviaSelva [link] [comments]  ( 48 min )
    Recommendations for CV makers
    What are some of the best free/paid CV or resume generators on the market. submitted by /u/jav4script [link] [comments]  ( 47 min )
    Multivariate Normal Distribution Explained
    submitted by /u/Personal-Trainer-541 [link] [comments]  ( 48 min )
    Amazing Depth Map To Image Extension In Stable Diffusion!
    submitted by /u/PuppetHere [link] [comments]  ( 76 min )
    How to Understand the Pros and Cons of AI Writing in 7 Steps
    Are you interested in understanding the pros and cons of AI writing? Artificial Intelligence (AI) has become a powerful tool in the world of writing, offering numerous advantages and disadvantages. In this blog post, we will explore the various advantages and disadvantages of AI writing and help you understand how you can use it in your writing projects. Step 1: What is AI Writing? AI writing is the use of artificial intelligence to produce written content. AI writing can be used to generate content quickly and efficiently, with minimal effort on the part of the writer. AI writing can also be used to detect errors in written content and suggest corrections. Step 2: Advantages of AI Writing One of the main advantages of AI writing is its ability to produce high-quality content quickly a…  ( 50 min )
    New Machine Learning HD Video Transformer AI | New Neuralink Brain Computer Interface Rival Uses Photonics To Transmit Information Through The Retina | New AI Invents Millions of New Materials
    submitted by /u/kenickh [link] [comments]  ( 45 min )
    What are the best chatbots available (end of 2022)
    I'm very impressed with OpenAi's Playground chatbot (it uses GPT-3). Do you know if there are any other similarly good chatbots available right now? I suppose that Google must have a great one but I can't find a place where you can test it. submitted by /u/Best_Strawberry_2255 [link] [comments]  ( 45 min )
    What is the difference between machine learning and deep learning?
    submitted by /u/sergiCrack9 [link] [comments]  ( 48 min )
    Philosopher discusses epistemology, ethics, and human AI friendship w GPT-3 over the course of a year.
    submitted by /u/picardstrikesback [link] [comments]  ( 52 min )
    OpenAI's latest GPT-3 model generates better and longer texts
    submitted by /u/Zirius_Sadfaces [link] [comments]  ( 44 min )
    Looking for an AI Text Summarizer in german
    Hey Guys, i stumbled upon AI Notion a couple of days ago. I am already on the waitlist, but that will probably take some months. Wondering if anyone can recommend some other AI that can help me summarize text in german? I have seen openai.com, but that seems to have some flaws with german text. Thank you :) submitted by /u/Dabudda93 [link] [comments]  ( 46 min )
    Looking for an AI to write a first draft of a novel
    I've been reading descriptions of AI Writing programs for the last few days, and they all seem to focus on generating ideas and plotlines. That isn't what I need. I have more ideas and plotlines than I'll ever be able to use. But 25 years ago I had a little brain episode that left me unable to write stories. I can write other stuff. I can blog, for example. But when I try to write fiction everything freezes up on me. So what I'm looking for is a program where I can basically type in an outline and then have it give me a first draft. Does anything like that exist yet? submitted by /u/ZeitgeistClown [link] [comments]  ( 47 min )
    MIT researchers have developed a technique for enabling artificial intelligence agents to think much farther into the future, which can improve the long-term performance of cooperative or competitive AI agents.
    submitted by /u/qptbook [link] [comments]  ( 45 min )
    What's New in GPT-3? Unpacking the Latest Update
    OpenAI released an update to their davinci model in GPT-3 (the text-davinci-003 model is the new one). I wanted to have some fun with testing it out, so I gave it this prompt: Describe what qualities would make post titles perform well and perform poorly on the /r/artificial subreddit. Give an example of a good post title and bad post title for a post about the latest update to GPT-3. response: Good Post Title: "What's New in GPT-3? Unpacking the Latest Update" Bad Post Title: "Look What I Found! The Latest GPT-3 Version Out Now" Good post titles on /r/artificial have some combination of the following qualities: they are descriptive and to the point, they demonstrate knowledge of the technology discussed, they are written in a conversational style, they are explanatory rather than promotional, they use relevant keywords, and they provide something of value to the readers. Poor post titles on /r/artificial may contain clickbait language, appear to be written for the purpose of promotion, lack punchy words, be too vague, or include offensive language. submitted by /u/lnfinity [link] [comments]  ( 46 min )
    made some symmetric eyes
    submitted by /u/MoistOutrage4453 [link] [comments]  ( 45 min )
    How Is Artificial Intelligence Changing Art History?
    submitted by /u/rosemarycrackle [link] [comments]  ( 47 min )
    Robert Quinn - CPU Breakthroughs & Semiconductor Trends
    submitted by /u/timothy-ventura [link] [comments]  ( 50 min )
    Machine-Learning Model Reveals Protein-Folding Physics
    submitted by /u/codingai [link] [comments]  ( 47 min )
    Opinions about Artificial Intelligence: A modern approach book
    What do you think about the book “Artificial Intelligence: A modern approach”? I’m considering buying this book (4th edition) to learn and improve my knowledge about the foundations of AI and make sure I understand the concepts of Convolutional Neural Networks, Genetic Algorithms, and Neural Networks in general. Because sometimes I wonder if I fully understand the math and logic behind those fields. submitted by /u/linear_xp [link] [comments]  ( 50 min )
  • Open

    [D] Building pytorch with dynamically linked CUDA?
    I've been trying to build Pytorch from source with dynamically linked CUDA in order to save ~ 6 GB in my Dockerfile. However, the issue is, while this is not too difficult--I want my build options to almost exactly match the official build options for Pytorch, so I'm not missing out on some hidden speedup. From looking around: https://discuss.pytorch.org/t/what-is-the-official-release-build-options/43317, it seems like people don't know the official build options for Pytorch releases. I was wondering if anyone here knows of a "official" / "endorsed" Dockerfile that will build dynamically-linked Pytorch with all the important optimizations enabled. submitted by /u/vanilla-acc [link] [comments]  ( 64 min )
    [P] torchegranate: a PyTorch rewrite of the pomegranate library for probabilistic modeling
    Hello all! A while ago, I used to advertise a library for probabilistic modeling called pomegranate that I was writing. Now, I'm here to advertise torchegranate, which is a temporary repository for a pomegranate rewrite using PyTorch as the backend. The results are fantastic: huge speed improvements for individual probability distributions, as well as for mixture models and hidden Markov models. There were three goals for the rewrite: (1) speed, particularly making use of GPU-based calculations, (2) community contribution, because PyTorch is way easier to understand and write in than Cython, and (3) interoperability, allowing the probabilistic models in pomegranate to seamlessly integrate with deep learning models implemented in PyTorch as loss functions or internal components. I've redesigned the API a bit to be less cumbersome and to match scikit-learn at key places. I'm looking for user feedback to help guide the project going forward, so please give it a whirl with pip install torchegranate. Check out the GitHub repo: https://github.com/jmschrei/torchegranate Check out this release thread: https://twitter.com/jmschreiber91/status/1597653345623474176?s=20&t=d_2C1YsSEbVcgUhUFCoAeQ Thanks!!! submitted by /u/ants_rock [link] [comments]  ( 67 min )
    [D] Looking for papers on bitext word alignment
    Information on this task seems a little sparse. I have found this https://arxiv.org/pdf/2101.08231.pdf But I'm also looking around for different techniques, whether it be statistical or neural. I tried looking around on the NLP Progress Github page, but couldn't find anything. The paper I linked uses mBERT and has achieved high accuracy with zero-shot performance from the supported languages. However, I'm looking for something that is specifically for two languages. Like aligning English-Spanish, hopefully something that is more light weight than a mBERT. Any resources would be great. Thanks for your help. submitted by /u/itsyourboiirow [link] [comments]  ( 69 min )
    [R][P] An arxiv-sanity-like view of NeurIPS 2022 papers
    I like to browse conference proceedings similar to what's done in arxiv-sanity: an image thumbnail of a paper accompanied by an abstract and other metadata. I have done similar overviews previously for NeurIPS 2021 and ICLR, and now ordered all NeurIPS 2022 papers (both from the main conference and the datasets track) based on average review scores in the form of a thumbnail, abstract and other metadata (such as the "tldr" section). https://preview.redd.it/ole0lvws2x2a1.png?width=1219&format=png&auto=webp&s=36013138ba820c67849cec11e49d967fad8c7822 The overview is available here: https://www.confviews.com/neurips2022/ The code is here: https://github.com/tanelp/confviews submitted by /u/tanelai [link] [comments]  ( 63 min )
    [D] Are problems with massive amount of input features feasible?
    Hello, I am trying to figure out a classification problem with non-trivial quantity of input features. Right now I am looking at binary classification of long videos ~million frames. Right now I am stuck at barely 70 000 frames. is there some trick to dealing with these types of problems? The only thing that comes to my mind at this point is to somewhat compress/decimate my frames to shrink the input features in a way that ML can still predict something from these. Other way would be to manually label a lot of frames one-by-one and construct some sort of meta algorithm, but I'd like to try something less labour intensive first. submitted by /u/Vae94 [link] [comments]  ( 68 min )
    [r] The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable - LessWrong
    https://www.lesswrong.com/posts/mkbGjzxD8d8XqKHzA/the-singular-value-decompositions-of-transformer-weight If we take the SVD of the weight matrices of the OV circuit and of MLP layers of GPT models, and project them to token embedding space, we notice this results in highly interpretable semantic clusters. This means that the network learns to align the principal directions of each MLP weight matrix or attention head to read from or write to semantically interpretable directions in the residual stream. We can use this to both improve our understanding of transformer language models and edit their representations. We use this finding to design both a natural language query locator, where you can write a set of natural language concepts and find all weight directions in the network which correspond to it, and also to edit the network's representations by deleting specific singular vectors, which results in relatively large effects on the logits related to the semantics of that vector and relatively small effects on semantically different clusters Looks like a thoughtful article and it has nice visuals. submitted by /u/visarga [link] [comments]  ( 66 min )
    [R] Swin v2 Sequential self-attention computation
    Hello, I just finished reading the Swin v2 paper and there is one detail I didn’t understand. They spoke about a technique to reduce memory consumption that is called Sequential self-attention computation. Here is their explication when they speak of a large model With a large resolution and window size: The self attention module constitutes a bottleneck. To alleviate this problem, we implement self-attention computation sequentially, instead of using the previous batch computation approach. This optimization is applied to the layers in the first two stages and has little impact on the overall training speed. So what is the previous batch computation approach ? And isn’t the self attention done sequentially anyway ? Thanks 🙏 submitted by /u/Meddhouib10 [link] [comments]  ( 64 min )
    [N] Towards Deep Learning for Relational Databases
    Generalizing deep learning architectures for natural integration with principles and practice of relational databases. In this article, we go through the topic of deep relational learning with a concrete example on relational databases. ​ https://preview.redd.it/tgiarzshmv2a1.jpg?width=1400&format=pjpg&auto=webp&s=e81fdffc568b290a39615424bc19fc746f5c217f submitted by /u/Lukas_Zahradnik [link] [comments]  ( 60 min )
    [R] On-Device Training Under 256KB Memory @ NeurIPS'22
    Historically, DNNs training happens on the cloud due to the huge memory cost. Edge platforms used to only perform inference, but it is difficult to learn to adapt to the new sensory data. Can we train on the edge to make a device continually improve its prediction? In this work, we enable on-device training under 256KB SRAM and 1MB Flash, using less than 1/1000 memory of PyTorch while matching the accuracy on the visual wake words application. It enables the model to adapt to newly collected sensor data and users can enjoy customized services without uploading the data to the cloud thus protecting privacy. Details below: Website:https://tinytraining.mit.edu/ Paper:https://arxiv.org/abs/2206.15472 Demo: https://youtu.be/XaDCO8YtmBw Code: https://github.com/mit-han-lab/tiny-training On-…  ( 69 min )
    [D] Very vague concept, but I feel like my idea might be cool. Someone hear me out :)
    Context: on Wikipedia they have a section labeled vital articles which includes 5 levels. ​ (To give u an idea of what I am describing) Now level 1 includes these articles: The arts, Earth, Human, Human history, Life, Mathematics, Philosophy, Science, Society, Technology. As the levels progress the number of articles increases, so level 2 has 101 and includes sections for the articles like: History (9 articles), Geography (12), etc. So, here is the idea: we feed articles from level 1 to level 3 maybe, (idk how much they can process.) Can my A.I. be smart enough to draw conclusions from the info I've given it? Is this even possible to do ? What would I use to make something like this? and where could I start? If you took the time to read all of this, (I love you), if u respond with help thanks in advance. Also, no idea what I want to accomplish, just wanted to see how smart we can make a 'blank slate' if we made it read wiki. submitted by /u/anonymousmoonkey [link] [comments]  ( 66 min )
  • Open

    Busy GPUs: Sampling and pipelining method speeds up deep learning on large graphs
    New technique significantly reduces training and inference time on extensive datasets to keep pace with fast-moving data in finance, social networks, and fraud detection in cryptocurrency.  ( 11 min )
    Breaking the scaling limits of analog computing
    New technique could diminish errors that hamper the performance of super-fast analog optical neural networks.  ( 10 min )
  • Open

    "Melting Pot 2.0", Agapiou et al 2022 {DM} (more enviroments + pretrained agents for multi-agent/population RL evaluation)
    submitted by /u/gwern [link] [comments]  ( 56 min )
    Ideas for Reinforcement Learning project( in robotics) ?
    Hi everyone, I am final year computer science student and my final semester comprises of doing a project. Project duration is 5 months. I am good at ML, Deep Learning, done some projects on Reinforcement Learning like playing atari game using deep q learning, etc. ​ Can someone recommend me some reinforcement learning projects good enough? If you can, can you recommend me some tasks in robotics rl? submitted by /u/Santhosh999 [link] [comments]  ( 55 min )
    When there are multiple envs running in parallel, how do you deal with the fact that one of those environment might be done before the maximum number of steps while the others are still running?
    Right now, there is a bug in my code because when one of those environment is done, I do reset it, BUT instead of sending the observation to a new episode (which is what I should do), I send it to the same episode (basically re-starting it after the reset). This is because I'm not sure how to implement the case where env1 might be already running episode 3, while env2 might still be running episode 2. Could any of you help? submitted by /u/No_Possibility_7588 [link] [comments]  ( 54 min )
    Wrapper of Stable-baselines3 for IsaacGym?
    Hi, has anybody tried to use Stable-Baselines3 with the recent version of Isaac Gym preview and can guide me with any relevant github-repo? Thank you submitted by /u/Fun-Moose-3841 [link] [comments]  ( 57 min )
    Robot magician Vs UR3
    Why do people use the UR3 for research instead of cheaper alternatives like the magician? What extra capabilities do these robots provide that justify the expensive price? I'm looking to do Reinforcement Learning research using a robot arm in python. submitted by /u/SuperDuperDooken [link] [comments]  ( 57 min )
    Proof by existence that extrinsic reward is not enough to achieve human-level learning:
    I read the DishBrain Paper, and replicated their experiment in an RL environment where the goal was to play pong, but instead of extrinsic reward the state space had random noise applied as a penalty. Given this, if our RL learning algo is similar to that of biological systems, then it should learn to play pong when put in this environment just like that in the DishBrain paper. I implemented Deep Active Inference which was basically just SAC with a state-predictive intrinsic reward (friston/VFE minimization theory). This was able to learn on the environment whereas a pure extrinsic reward maximization algo could never learn on the env (extrinsic reward was just 0). Kinda a trivial project at the implementation level, but on the theory side it convinced me that a predictive signal is necessary for human-level learning. Curious what y’all think, personally it had inspired me to look into the power of future state predictive objectives, such as Forward Mutual Information State abstractions and emergent social learning in MARL envs. submitted by /u/jms4607 [link] [comments]  ( 62 min )
    How to deal with the situation that multi-agent reinforcement learning decision (action) is non-synchronous?
    while designing a multi-agent reinforcement learning environment, I found that the number of agents in function "step(action)" is not fixed. Some agents transport to destination earlier, while some are later, hence they will pick action at different time. In this situation, how to design environment for reinforcement learning? submitted by /u/Low_Letterhead_23 [link] [comments]  ( 54 min )
  • Open

    Research Focus: Week of November 28, 2022
    This special edition of Research Focus highlights some of the 100+ papers from Microsoft Research that were accepted for publication at NeurIPS 2022 – the thirty-sixth annual Conference on Neural Information Processing Systems. Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models Dongkuan Xu, Subhabrata Mukherjee, Xiaodong Liu, Debadeepta Dey, Wenhui Wang, Xiang Zhang, Ahmed […] The post Research Focus: Week of November 28, 2022 appeared first on Microsoft Research.  ( 13 min )
  • Open

    Better Language Models Without Massive Compute
    Posted by Jason Wei and Yi Tay, Research Scientists, Google Research, Brain Team In recent years, language models (LMs) have become more prominent in natural language processing (NLP) research and are also becoming increasingly impactful in practice. Scaling up LMs has been shown to improve performance across a range of NLP tasks. For instance, scaling up language models can improve perplexity across seven orders of magnitude of model sizes, and new abilities such as multi-step reasoning have been observed to arise as a result of model scale. However, one of the challenges of continued scaling is that training new, larger models requires great amounts of computational resources. Moreover, new models are often trained from scratch and do not leverage the weights from previously existing mo…  ( 92 min )
  • Open

    Siemens Taps Omniverse Replicator on AWS for Synthetic Data Generation to Accelerate Defect Detection Model Development by 5X
    Industrial leader Siemens is accelerating development of defect detection models with 3D synthetic data generation from NVIDIA Omniverse, the latest manufacturing gains to emerge from an extended partnership for the industrial metaverse that aims to advance digital twins. The Siemens Xcelerator and NVIDIA Omniverse platforms are building connections to enable full-design-fidelity, live digital twins that Read article > The post Siemens Taps Omniverse Replicator on AWS for Synthetic Data Generation to Accelerate Defect Detection Model Development by 5X appeared first on NVIDIA Blog.  ( 6 min )
    3D Artist and Educator Hsin-Chien Huang Takes VR to the World Stage This Week ‘In the NVIDIA Studio’
    3D artist, virtual reality expert, storyteller and educator Hsin-Chien Huang shares his unique creator journey and award-winning artwork Samsara this week In the NVIDIA Studio. The post 3D Artist and Educator Hsin-Chien Huang Takes VR to the World Stage This Week ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.  ( 6 min )
  • Open

    New Machine Learning HD Video Transformer AI | New Neuralink Brain Computer Interface Rival Uses Photonics To Transmit Information Through The Retina | New AI Invents Millions of New Materials
    submitted by /u/kenickh [link] [comments]  ( 44 min )
  • Open

    AWS Unveils New AI Service Features and Enhancements at re:Invent 2022
    Over the last 5 years, artificial intelligence (AI) and machine learning (ML) have evolved from a niche activity to a rapidly growing mainstream endeavor. Today, more than 100,000 customers across numerous industries rely on AWS for ML and AI initiatives that infuse AI into a broad range of business use cases to automate repetitive and […]  ( 9 min )
  • Open

    On the Robustness of Average Losses for Partial-Label Learning. (arXiv:2106.06152v2 [cs.LG] UPDATED)
    Partial-label learning (PLL) utilizes instances with PLs, where a PL includes several candidate labels but only one is the true label (TL). In PLL, identification-based strategy (IBS) purifies each PL on the fly to select the (most likely) TL for training; average-based strategy (ABS) treats all candidate labels equally for training and let trained models be able to predict TL. Although PLL research has focused on IBS for better performance, ABS is also worthy of study since modern IBS behaves like ABS in the beginning of training to prepare for PL purification and TL selection. In this paper, we analyze why ABS was unsatisfactory and propose how to improve it. Theoretically, we formalize five problem settings of PLL and prove that average PL losses (APLLs) with bounded multi-class losses are always robust, while APLLs with unbounded losses may be non-robust, which is the first robustness analysis for PLL. Experimentally, we have two promising findings: ABS using bounded losses can match/exceed state-of-the-art performance of IBS using unbounded losses; after using robust APLLs to warm start, IBS can further improve upon itself. Our work draws attention to ABS research, which can in turn boost IBS and push forward the whole PLL.  ( 2 min )
    RLOps: Development Life-cycle of Reinforcement Learning Aided Open RAN. (arXiv:2111.06978v2 [cs.NI] UPDATED)
    Radio access network (RAN) technologies continue to evolve, with Open RAN gaining the most recent momentum. In the O-RAN specifications, the RAN intelligent controllers (RICs) are software-defined orchestration and automation functions for the intelligent management of RAN. This article introduces principles for machine learning (ML), in particular, reinforcement learning (RL) applications in the O-RAN stack. Furthermore, we review the state-of-the-art research in wireless networks and cast it onto the RAN framework and the hierarchy of the O-RAN architecture. We provide a taxonomy for the challenges faced by ML/RL models throughout the development life-cycle: from the system specification to production deployment (data acquisition, model design, testing and management, etc.). To address the challenges, we integrate a set of existing MLOps principles with unique characteristics when RL agents are considered. This paper discusses a systematic model development, testing and validation life-cycle, termed: RLOps. We discuss fundamental parts of RLOps, which include: model specification, development, production environment serving, operations monitoring and safety/security. Based on these principles, we propose the best practices for RLOps to achieve an automated and reproducible model development process. At last, a holistic data analytics platform rooted in the O-RAN deployment is designed and implemented, aiming to embrace and fulfil the aforementioned principles and best practices of RLOps.  ( 3 min )
    Temporal Representation Learning on Monocular Videos for 3D Human Pose Estimation. (arXiv:2012.01511v5 [cs.CV] UPDATED)
    In this paper we propose an unsupervised feature extraction method to capture temporal information on monocular videos, where we detect and encode subject of interest in each frame and leverage contrastive self-supervised (CSS) learning to extract rich latent vectors. Instead of simply treating the latent features of nearby frames as positive pairs and those of temporally-distant ones as negative pairs as in other CSS approaches, we explicitly disentangle each latent vector into a time-variant component and a time-invariant one. We then show that applying contrastive loss only to the time-variant features and encouraging a gradual transition on them between nearby and away frames while also reconstructing the input, extract rich temporal features, well-suited for human pose estimation. Our approach reduces error by about 50% compared to the standard CSS strategies, outperforms other unsupervised single-view methods and matches the performance of multi-view techniques. When 2D pose is available, our approach can extract even richer latent features and improve the 3D pose estimation accuracy, outperforming other state-of-the-art weakly supervised methods.  ( 2 min )
    Multi-Class Uncertainty Calibration via Mutual Information Maximization-based Binning. (arXiv:2006.13092v7 [cs.LG] UPDATED)
    Post-hoc multi-class calibration is a common approach for providing high-quality confidence estimates of deep neural network predictions. Recent work has shown that widely used scaling methods underestimate their calibration error, while alternative Histogram Binning (HB) methods often fail to preserve classification accuracy. When classes have small prior probabilities, HB also faces the issue of severe sample-inefficiency after the conversion into K one-vs-rest class-wise calibration problems. The goal of this paper is to resolve the identified issues of HB in order to provide calibrated confidence estimates using only a small holdout calibration dataset for bin optimization while preserving multi-class ranking accuracy. From an information-theoretic perspective, we derive the I-Max concept for binning, which maximizes the mutual information between labels and quantized logits. This concept mitigates potential loss in ranking performance due to lossy quantization, and by disentangling the optimization of bin edges and representatives allows simultaneous improvement of ranking and calibration performance. To improve the sample efficiency and estimates from a small calibration set, we propose a shared class-wise (sCW) calibration strategy, sharing one calibrator among similar classes (e.g., with similar class priors) so that the training sets of their class-wise calibration problems can be merged to train the single calibrator. The combination of sCW and I-Max binning outperforms the state of the art calibration methods on various evaluation metrics across different benchmark datasets and models, using a small calibration set (e.g., 1k samples for ImageNet).  ( 3 min )
    ifMixup: Interpolating Graph Pair to Regularize Graph Classification. (arXiv:2110.09344v3 [cs.LG] UPDATED)
    We present a simple and yet effective interpolation-based regularization technique, aiming to improve the generalization of Graph Neural Networks (GNNs) on supervised graph classification. We leverage Mixup, an effective regularizer for vision, where random sample pairs and their labels are interpolated to create synthetic images for training. Unlike images with grid-like coordinates, graphs have arbitrary structure and topology, which can be very sensitive to any modification that alters the graph's semantic meanings. This posts two unanswered questions for Mixup-like regularization schemes: Can we directly mix up a pair of graph inputs? If so, how well does such mixing strategy regularize the learning of GNNs? To answer these two questions, we propose ifMixup, which first adds dummy nodes to make two graphs have the same input size and then simultaneously performs linear interpolation between the aligned node feature vectors and the aligned edge representations of the two graphs. We empirically show that such simple mixing schema can effectively regularize the classification learning, resulting in superior predictive accuracy to popular graph augmentation and GNN methods.  ( 2 min )
    The Effect of Diversity in Meta-Learning. (arXiv:2201.11775v3 [cs.LG] UPDATED)
    Recent studies show that task distribution plays a vital role in the meta-learner's performance. Conventional wisdom is that task diversity should improve the performance of meta-learning. In this work, we find evidence to the contrary; (i) our experiments draw into question the efficacy of our learned models: similar manifolds can be learned with a subset of the data (lower task diversity). This finding questions the advantage of providing more data to the model, and (ii) adding diversity to the task distribution (higher task diversity) sometimes hinders the model and does not lead to a significant improvement in performance as previously believed. To strengthen our findings, we provide both empirical and theoretical evidence.  ( 2 min )
    Exploiting Higher Order Smoothness in Derivative-free Optimization and Continuous Bandits. (arXiv:2006.07862v4 [cs.LG] UPDATED)
    We study the problem of zero-order optimization of a strongly convex function. The goal is to find the minimizer of the function by a sequential exploration of its values, under measurement noise. We study the impact of higher order smoothness properties of the function on the optimization error and on the cumulative regret. To solve this problem we consider a randomized approximation of the projected gradient descent algorithm. The gradient is estimated by a randomized procedure involving two function evaluations and a smoothing kernel. We derive upper bounds for this algorithm both in the constrained and unconstrained settings and prove minimax lower bounds for any sequential search method. Our results imply that the zero-order algorithm is nearly optimal in terms of sample complexity and the problem parameters. Based on this algorithm, we also propose an estimator of the minimum value of the function achieving almost sharp oracle behavior. We compare our results with the state-of-the-art, highlighting a number of key improvements.  ( 2 min )
    FairFed: Enabling Group Fairness in Federated Learning. (arXiv:2110.00857v3 [cs.LG] UPDATED)
    Training ML models which are fair across different demographic groups is of critical importance due to the increased integration of ML in crucial decision-making scenarios such as healthcare and recruitment. Federated learning has been viewed as a promising solution for collaboratively training machine learning models among multiple parties while maintaining the privacy of their local data. However, federated learning also poses new challenges in mitigating the potential bias against certain populations (e.g., demographic groups), as this typically requires centralized access to the sensitive information (e.g., race, gender) of each datapoint. Motivated by the importance and challenges of group fairness in federated learning, in this work, we propose FairFed, a novel algorithm for fairness-aware aggregation to enhance group fairness in federated learning. Our proposed approach is server-side and agnostic to the applied local debiasing thus allowing for flexible use of different local debiasing methods across clients. We evaluate FairFed empirically versus common baselines for fair ML and federated learning, and demonstrate that it provides fairer models particularly under highly heterogeneous data distributions across clients. We also demonstrate the benefits of FairFed in scenarios involving naturally distributed real-life data collected from different geographical locations or departments within an organization.  ( 2 min )
    Design of Turing Systems with Physics-Informed Neural Networks. (arXiv:2211.13464v1 [cs.LG])
    Reaction-diffusion (Turing) systems are fundamental to the formation of spatial patterns in nature and engineering. These systems are governed by a set of non-linear partial differential equations containing parameters that determine the rate of constituent diffusion and reaction. Critically, these parameters, such as diffusion coefficient, heavily influence the mode and type of the final pattern, and quantitative characterization and knowledge of these parameters can aid in bio-mimetic design or understanding of real-world systems. However, the use of numerical methods to infer these parameters can be difficult and computationally expensive. Typically, adjoint solvers may be used, but they are frequently unstable for very non-linear systems. Alternatively, massive amounts of iterative forward simulations are used to find the best match, but this is extremely effortful. Recently, physics-informed neural networks have been proposed as a means for data-driven discovery of partial differential equations, and have seen success in various applications. Thus, we investigate the use of physics-informed neural networks as a tool to infer key parameters in reaction-diffusion systems in the steady-state for scientific discovery or design. Our proof-of-concept results show that the method is able to infer parameters for different pattern modes and types with errors of less than 10\%. In addition, the stochastic nature of this method can be exploited to provide multiple parameter alternatives to the desired pattern, highlighting the versatility of this method for bio-mimetic design. This work thus demonstrates the utility of physics-informed neural networks for inverse parameter inference of reaction-diffusion systems to enhance scientific discovery and design.  ( 2 min )
    On the Efficiency of the Neuro-Fuzzy Classifier for User Knowledge Modeling Systems. (arXiv:1910.12025v1 [cs.AI] CROSS LISTED)
    User knowledge modeling systems are used as the most effective technology for grabbing new user's attention. Moreover, the quality of service (QOS) is increased by these intelligent services. This paper proposes two user knowledge classifiers based on artificial neural networks used as one of the influential parts of knowledge modeling systems. We employed multi-layer perceptron (MLP) and adaptive neural fuzzy inference system (ANFIS) as the classifiers. Moreover, we used real data contains the user's degree of study time, repetition number, their performance in exam, as well as the learning percentage, as our classifier's inputs. Compared with well-known methods like KNN and Bayesian classifiers used in other research with the same data sets, our experiments present better performance. Although, the number of samples in the train set is not large enough, the performance of the neuro-fuzzy classifier in the test set is 98.6% which is the best result in comparison with others. However, the comparison of MLP toward the ANFIS results presents performance reduction, although the MLP performance is more efficient than other methods like Bayesian and KNN. As our goal is evaluating and reporting the efficiency of a neuro-fuzzy classifier for user knowledge modeling systems, we utilized many different evaluation metrics such as Receiver Operating Characteristic and the Area Under its Curve, Total Accuracy, and Kappa statistics.  ( 2 min )
    Deep unfolding as iterative regularization for imaging inverse problems. (arXiv:2211.13452v1 [math.OC])
    Recently, deep unfolding methods that guide the design of deep neural networks (DNNs) through iterative algorithms have received increasing attention in the field of inverse problems. Unlike general end-to-end DNNs, unfolding methods have better interpretability and performance. However, to our knowledge, their accuracy and stability in solving inverse problems cannot be fully guaranteed. To bridge this gap, we modified the training procedure and proved that the unfolding method is an iterative regularization method. More precisely, we jointly learn a convex penalty function adversarially by an input-convex neural network (ICNN) to characterize the distance to a real data manifold and train a DNN unfolded from the proximal gradient descent algorithm with this learned penalty. Suppose the real data manifold intersects the inverse problem solutions with only the unique real solution. We prove that the unfolded DNN will converge to it stably. Furthermore, we demonstrate with an example of MRI reconstruction that the proposed method outperforms conventional unfolding methods and traditional regularization methods in terms of reconstruction quality, stability and convergence speed.  ( 2 min )
    Self-Contrastive Learning: Single-viewed Supervised Contrastive Framework using Sub-network. (arXiv:2106.15499v6 [cs.LG] UPDATED)
    Contrastive loss has significantly improved performance in supervised classification tasks by using a multi-viewed framework that leverages augmentation and label information. The augmentation enables contrast with another view of a single image but enlarges training time and memory usage. To exploit the strength of multi-views while avoiding the high computation cost, we introduce a multi-exit architecture that outputs multiple features of a single image in a single-viewed framework. To this end, we propose Self-Contrastive (SelfCon) learning, which self-contrasts within multiple outputs from the different levels of a single network. The multi-exit architecture efficiently replaces multi-augmented images and leverages various information from different layers of a network. We demonstrate that SelfCon learning improves the classification performance of the encoder network, and empirically analyze its advantages in terms of the single-view and the sub-network. Furthermore, we provide theoretical evidence of the performance increase based on the mutual information bound. For ImageNet classification on ResNet-50, SelfCon improves accuracy by +0.6% with 59% memory and 48% time of Supervised Contrastive learning, and a simple ensemble of multi-exit outputs boosts performance up to +1.5%. Our code is available at https://github.com/raymin0223/self-contrastive-learning.  ( 2 min )
    DHGE: Dual-view Hyper-Relational Knowledge Graph Embedding for Link Prediction and Entity Typing. (arXiv:2207.08562v2 [cs.AI] UPDATED)
    In the field of representation learning on knowledge graphs (KGs), a hyper-relational fact consists of a main triple and several auxiliary attribute-value descriptions, which is considered more comprehensive and specific than a triple-based fact. However, currently available hyper-relational KG embedding methods in a single view are limited in application because they weaken the hierarchical structure that represents the affiliation between entities. To overcome this limitation, we propose a dual-view hyper-relational KG structure (DH-KG) that contains a hyper-relational instance view for entities and a hyper-relational ontology view for concepts that are abstracted hierarchically from the entities. This paper defines link prediction and entity typing tasks on DH-KG for the first time and constructs two DH-KG datasets, JW44K-6K, extracted from Wikidata, and HTDM based on medical data. Furthermore, we propose DHGE, a DH-KG embedding model based on GRAN encoders, HGNNs, and joint learning. DHGE outperforms baseline models on DH-KG, according to experimental results. Finally, we provide an example of how this technology can be used to treat hypertension. Our model and new datasets are publicly available.  ( 2 min )
    Fast Sampling of Diffusion Models via Operator Learning. (arXiv:2211.13449v1 [cs.LG])
    Diffusion models have found widespread adoption in various areas. However, sampling from them is slow because it involves emulating a reverse process with hundreds-to-thousands of network evaluations. Inspired by the success of neural operators in accelerating differential equations solving, we approach this problem by solving the underlying neural differential equation from an operator learning perspective. We examine probability flow ODE trajectories in diffusion models and observe a compact energy spectrum that can be learned efficiently in Fourier space. With this insight, we propose diffusion Fourier neural operator (DFNO) with temporal convolution in Fourier space to parameterize the operator that maps initial condition to the solution trajectory, which is a continuous function in time. DFNO can be applied to any diffusion model and generate high-quality samples in one model forward call. Our method achieves the state-of-the-art FID of 4.72 on CIFAR-10 using only one model evaluation.  ( 2 min )
    On the Complexity of Counterfactual Reasoning. (arXiv:2211.13447v1 [cs.AI])
    We study the computational complexity of counterfactual reasoning in relation to the complexity of associational and interventional reasoning on structural causal models (SCMs). We show that counterfactual reasoning is no harder than associational or interventional reasoning on fully specified SCMs in the context of two computational frameworks. The first framework is based on the notion of treewidth and includes the classical variable elimination and jointree algorithms. The second framework is based on the more recent and refined notion of causal treewidth which is directed towards models with functional dependencies such as SCMs. Our results are constructive and based on bounding the (causal) treewidth of twin networks -- used in standard counterfactual reasoning that contemplates two worlds, real and imaginary -- to the (causal) treewidth of the underlying SCM structure. In particular, we show that the latter (causal) treewidth is no more than twice the former plus one. Hence, if associational or interventional reasoning is tractable on a fully specified SCM then counterfactual reasoning is tractable too. We extend our results to general counterfactual reasoning that requires contemplating more than two worlds and discuss applications of our results to counterfactual reasoning with a partially specified SCM that is coupled with data. We finally present empirical results that measure the gap between the complexities of counterfactual reasoning and associational/interventional reasoning on random SCMs.  ( 2 min )
    JAWS: Auditing Predictive Uncertainty Under Covariate Shift. (arXiv:2207.10716v2 [cs.LG] UPDATED)
    We propose \textbf{JAWS}, a series of wrapper methods for distribution-free uncertainty quantification tasks under covariate shift, centered on the core method \textbf{JAW}, the \textbf{JA}ckknife+ \textbf{W}eighted with data-dependent likelihood-ratio weights. JAWS also includes computationally efficient \textbf{A}pproximations of JAW using higher-order influence functions: \textbf{JAWA}. Theoretically, we show that JAW relaxes the jackknife+'s assumption of data exchangeability to achieve the same finite-sample coverage guarantee even under covariate shift. JAWA further approaches the JAW guarantee in the limit of the sample size or the influence function order under common regularity assumptions. Moreover, we propose a general approach to repurposing predictive interval-generating methods and their guarantees to the reverse task: estimating the probability that a prediction is erroneous, based on user-specified error criteria such as a safe or acceptable tolerance threshold around the true label. We then propose \textbf{JAW-E} and \textbf{JAWA-E} as the repurposed proposed methods for this \textbf{E}rror assessment task. Practically, JAWS outperform state-of-the-art predictive inference baselines in a variety of biased real world data sets for interval-generation and error-assessment predictive uncertainty auditing tasks.  ( 2 min )
    MixMask: Revisiting Masked Siamese Self-supervised Learning in Asymmetric Distance. (arXiv:2210.11456v2 [cs.CV] UPDATED)
    Recent advances in self-supervised learning integrate Masked Modeling and Siamese Networks into a single framework to fully reap the advantages of both the two techniques. However, the previous erase-based masking scheme in masked image modeling is more aligned with the patchifying mechanism of ViT, it is not originally designed for siamese networks of ConvNet. Existing approaches simply inherit the default loss design from previous siamese networks and ignore the information loss after employing masking operation in the frameworks. In this paper, we propose a filling-based masking strategy called MixMask to prevent information loss due to the randomly erased areas of an image in the vanilla masking method. We further introduce a flexible loss function design that takes into account semantic distance change between two different mixed views for adapting the integrated architecture and avoiding mismatches between transformed input and objective in Masked Siamese ConvNets (MSCN). The flexible loss distance is calculated according to the proposed mix-masking scheme. Extensive experiments are conducted on various datasets of CIFAR-100, Tiny-ImageNet, and ImageNet-1K. The results demonstrate that the proposed framework can achieve better accuracy on linear probing, semi-supervised, and supervised finetuning, which outperforms the state-of-the-art MSCN by a significant margin. We also show the superiority on the downstream tasks of object detection and segmentation. Our source code is available at https://github.com/LightnessOfBeing/MixMask.  ( 2 min )
    COCO-DR: Combating Distribution Shifts in Zero-Shot Dense Retrieval with Contrastive and Distributionally Robust Learning. (arXiv:2210.15212v2 [cs.CL] UPDATED)
    We present a new zero-shot dense retrieval (ZeroDR) method, COCO-DR, to improve the generalization ability of dense retrieval by combating the distribution shifts between source training tasks and target scenarios. To mitigate the impact of document differences, COCO-DR continues pretraining the language model on the target corpora to adapt the model to target distributions via COtinuous COtrastive learning. To prepare for unseen target queries, COCO-DR leverages implicit Distributionally Robust Optimization (iDRO) to reweight samples from different source query clusters for improving model robustness over rare queries during fine-tuning. COCO-DR achieves superior average performance on BEIR, the zero-shot retrieval benchmark. At BERT Base scale, COCO-DR Base outperforms other ZeroDR models with 60x larger size. At BERT Large scale, COCO-DR Large outperforms the giant GPT-3 embedding model which has 500x more parameters. Our analysis show the correlation between COCO-DR's effectiveness in combating distribution shifts and improving zero-shot accuracy. Our code and model can be found at \url{https://github.com/OpenMatch/COCO-DR}.  ( 2 min )
    Solving Bilevel Knapsack Problem using Graph Neural Networks. (arXiv:2211.13436v1 [cs.AI])
    The Bilevel Optimization Problem is a hierarchical optimization problem with two agents, a leader and a follower. The leader make their own decisions first, and the followers make the best choices accordingly. The leader knows the information of the followers, and the goal of the problem is to find the optimal solution by considering the reactions of the followers from the leader's point of view. For the Bilevel Optimization Problem, there are no general and efficient algorithms or commercial solvers to get an optimal solution, and it is very difficult to get a good solution even for a simple problem. In this paper, we propose a deep learning approach using Graph Neural Networks to solve the bilevel knapsack problem. We train the model to predict the leader's solution and use it to transform the hierarchical optimization problem into a single-level optimization problem to get the solution. Our model found the feasible solution that was about 500 times faster than the exact algorithm with $1.7\%$ optimal gap. Also, our model performed well on problems of different size from the size it was trained on.  ( 2 min )
    Multi-Job Intelligent Scheduling with Cross-Device Federated Learning. (arXiv:2211.13430v1 [cs.DC])
    Recent years have witnessed a large amount of decentralized data in various (edge) devices of end-users, while the decentralized data aggregation remains complicated for machine learning jobs because of regulations and laws. As a practical approach to handling decentralized data, Federated Learning (FL) enables collaborative global machine learning model training without sharing sensitive raw data. The servers schedule devices to jobs within the training process of FL. In contrast, device scheduling with multiple jobs in FL remains a critical and open problem. In this paper, we propose a novel multi-job FL framework, which enables the training process of multiple jobs in parallel. The multi-job FL framework is composed of a system model and a scheduling method. The system model enables a parallel training process of multiple jobs, with a cost model based on the data fairness and the training time of diverse devices during the parallel training process. We propose a novel intelligent scheduling approach based on multiple scheduling methods, including an original reinforcement learning-based scheduling method and an original Bayesian optimization-based scheduling method, which corresponds to a small cost while scheduling devices to multiple jobs. We conduct extensive experimentation with diverse jobs and datasets. The experimental results reveal that our proposed approaches significantly outperform baseline approaches in terms of training time (up to 12.73 times faster) and accuracy (up to 46.4% higher).  ( 2 min )
    Collaborative Training of Medical Artificial Intelligence Models with non-uniform Labels. (arXiv:2211.13606v1 [cs.LG])
    Artificial intelligence (AI) methods are revolutionizing medical image analysis. However, robust AI models require large multi-site datasets for training. While multiple stakeholders have provided publicly available datasets, the ways in which these data are labeled differ widely. For example, one dataset of chest radiographs might contain labels denoting the presence of metastases in the lung, while another dataset of chest radiograph might focus on the presence of pneumonia. With conventional approaches, these data cannot be used together to train a single AI model. We propose a new framework that we call flexible federated learning (FFL) for collaborative training on such data. Using publicly available data of 695,000 chest radiographs from five institutions - each with differing labels - we demonstrate that large and heterogeneously labeled datasets can be used to train one big AI model with this framework. We find that models trained with FFL are superior to models that are trained on matching annotations only. This may pave the way for training of truly large-scale AI models that make efficient use of all existing data.
    FairAutoML: Embracing Unfairness Mitigation in AutoML. (arXiv:2111.06495v2 [cs.LG] UPDATED)
    In this work, we propose an Automated Machine Learning (AutoML) system to search for models not only with good prediction accuracy but also fair. We first investigate the necessity and impact of unfairness mitigation in the AutoML context. We establish the FairAutoML framework. The framework provides a novel design based on pragmatic abstractions, which makes it convenient to incorporate existing fairness definitions, unfairness mitigation techniques, and hyperparameter search methods into the model search and evaluation process. Following this framework, we develop a fair AutoML system based on an existing AutoML system. The augmented system includes a resource allocation strategy to dynamically decide when and on which models to conduct unfairness mitigation according to the prediction accuracy, fairness, and resource consumption on the fly. Extensive empirical evaluation shows that our system can achieve a good `fair accuracy' and high resource efficiency.
    On designing light-weight object trackers through network pruning: Use CNNs or transformers?. (arXiv:2211.13769v1 [cs.CV])
    Object trackers deployed on low-power devices need to be light-weight, however, most of the current state-of-the-art (SOTA) methods rely on using compute-heavy backbones built using CNNs or transformers. Large sizes of such models do not allow their deployment in low-power conditions and designing compressed variants of large tracking models is of great importance. This paper demonstrates how highly compressed light-weight object trackers can be designed using neural architectural pruning of large CNN and transformer based trackers. Further, a comparative study on architectural choices best suited to design light-weight trackers is provided. A comparison between SOTA trackers using CNNs, transformers as well as the combination of the two is presented to study their stability at various compression ratios. Finally results for extreme pruning scenarios going as low as 1% in some cases are shown to study the limits of network pruning in object tracking. This work provides deeper insights into designing highly efficient trackers from existing SOTA methods.
    Improving Multi-task Learning via Seeking Task-based Flat Regions. (arXiv:2211.13723v1 [cs.LG])
    Multi-Task Learning (MTL) is a widely-used and powerful learning paradigm for training deep neural networks that allows learning more than one objective by a single backbone. Compared to training tasks separately, MTL significantly reduces computational costs, improves data efficiency, and potentially enhances model performance by leveraging knowledge across tasks. Hence, it has been adopted in a variety of applications, ranging from computer vision to natural language processing and speech recognition. Among them, there is an emerging line of work in MTL that focuses on manipulating the task gradient to derive an ultimate gradient descent direction to benefit all tasks. Despite achieving impressive results on many benchmarks, directly applying these approaches without using appropriate regularization techniques might lead to suboptimal solutions on real-world problems. In particular, standard training that minimizes the empirical loss on the training data can easily suffer from overfitting to low-resource tasks or be spoiled by noisy-labeled ones, which can cause negative transfer between tasks and overall performance drop. To alleviate such problems, we propose to leverage a recently introduced training method, named Sharpness-aware Minimization, which can enhance model generalization ability on single-task learning. Accordingly, we present a novel MTL training methodology, encouraging the model to find task-based flat minima for coherently improving its generalization capability on all tasks. Finally, we conduct comprehensive experiments on a variety of applications to demonstrate the merit of our proposed approach to existing gradient-based MTL methods, as suggested by our developed theory.
    Generative Joint Source-Channel Coding for Semantic Image Transmission. (arXiv:2211.13772v1 [eess.IV])
    Recent works have shown that joint source-channel coding (JSCC) schemes using deep neural networks (DNNs), called DeepJSCC, provide promising results in wireless image transmission. However, these methods mostly focus on the distortion of the reconstructed signals with respect to the input image, rather than their perception by humans. However, focusing on traditional distortion metrics alone does not necessarily result in high perceptual quality, especially in extreme physical conditions, such as very low bandwidth compression ratio (BCR) and low signal-to-noise ratio (SNR) regimes. In this work, we propose two novel JSCC schemes that leverage the perceptual quality of deep generative models (DGMs) for wireless image transmission, namely InverseJSCC and GenerativeJSCC. While the former is an inverse problem approach to DeepJSCC, the latter is an end-to-end optimized JSCC scheme. In both, we optimize a weighted sum of mean squared error (MSE) and learned perceptual image patch similarity (LPIPS) losses, which capture more semantic similarities than other distortion metrics. InverseJSCC performs denoising on the distorted reconstructions of a DeepJSCC model by solving an inverse optimization problem using style-based generative adversarial network (StyleGAN). Our simulation results show that InverseJSCC significantly improves the state-of-the-art (SotA) DeepJSCC in terms of perceptual quality in edge cases. In GenerativeJSCC, we carry out end-to-end training of an encoder and a StyleGAN-based decoder, and show that GenerativeJSCC significantly outperforms DeepJSCC both in terms of distortion and perceptual quality.
    A Non-Classical Parameterization for Density Estimation Using Sample Moments. (arXiv:2201.04786v4 [stat.ML] UPDATED)
    Moment methods are an important means of density estimation, but they are generally strongly dependent on the choice of feasible functions, which severely affects the performance. In this paper, which is a very preliminary version, we propose a non-classical parametrization for density estimation using the sample moments, which does not require the choice of such functions. The parametrization is induced by the squared Hellinger distance, and the solution of it, which is proved to exist and be unique subject to simple prior that does not depend on data, can be obtained by convex optimization. Simulation results show the performance of the proposed estimator in estimating multi-modal densities which are mixtures of different types of functions, with a comparison to the prevailing methods.
    Time delay estimation of traffic congestion propagation due to accidents based on statistical causality. (arXiv:2108.06717v3 [stat.ML] UPDATED)
    The accurate estimation of time delays is crucial in traffic congestion analysis, as this information can be used to address fundamental questions regarding the origin and propagation of traffic congestion. However, the exact measurement of time delays during congestion remains a challenge owing to the complex propagation process between roads and high uncertainty regarding future behavior. To overcome this challenge, we propose a novel time delay estimation method for the propagation of traffic congestion due to accidents using lag-specific transfer entropy (TE). The proposed method adopts Markov bootstrap techniques to quantify uncertainty in the time delay estimator. To the best of our knowledge, our proposed method is the first to estimate time delays based on causal relationships between adjacent roads. We validated the method's efficacy using simulated data, as well as real user trajectory data obtained from a major GPS navigation system in South Korea.
    Modelling Direct Messaging Networks with Multiple Recipients for Cyber Deception. (arXiv:2111.11932v2 [cs.CR] UPDATED)
    Cyber deception is emerging as a promising approach to defending networks and systems against attackers and data thieves. However, despite being relatively cheap to deploy, the generation of realistic content at scale is very costly, due to the fact that rich, interactive deceptive technologies are largely hand-crafted. With recent improvements in Machine Learning, we now have the opportunity to bring scale and automation to the creation of realistic and enticing simulated content. In this work, we propose a framework to automate the generation of email and instant messaging-style group communications at scale. Such messaging platforms within organisations contain a lot of valuable information inside private communications and document attachments, making them an enticing target for an adversary. We address two key aspects of simulating this type of system: modelling when and with whom participants communicate, and generating topical, multi-party text to populate simulated conversation threads. We present the LogNormMix-Net Temporal Point Process as an approach to the first of these, building upon the intensity-free modeling approach of Shchur et al. to create a generative model for unicast and multi-cast communications. We demonstrate the use of fine-tuned, pre-trained language models to generate convincing multi-party conversation threads. A live email server is simulated by uniting our LogNormMix-Net TPP (to generate the communication timestamp, sender and recipients) with the language model, which generates the contents of the multi-party email threads. We evaluate the generated content with respect to a number of realism-based properties, that encourage a model to learn to generate content that will engage the attention of an adversary to achieve a deception outcome.
    Susceptibility to Image Resolution in Face Recognition and Trainings Strategies. (arXiv:2107.03769v2 [cs.CV] UPDATED)
    Face recognition approaches often rely on equal image resolution for verifying faces on two images. However, in practical applications, those image resolutions are usually not in the same range due to different image capture mechanisms or sources. In this work, we first analyze the impact of image resolutions on face verification performance with a state-of-the-art face recognition model. For images synthetically reduced to $5\,\times\,5$ px resolution, the verification performance drops from $99.23\%$ increasingly down to almost $55\%$. Especially for cross-resolution image pairs (one high- and one low-resolution image), the verification accuracy decreases even further. We investigate this behavior more in-depth by looking at the feature distances for every 2-image test pair. To tackle this problem, we propose the following two methods: 1) Train a state-of-the-art face-recognition model straightforwardly with $50\%$ low-resolution images directly within each batch. 2) Train a siamese-network structure and add a cosine distance feature loss between high- and low-resolution features. Both methods show an improvement for cross-resolution scenarios and can increase the accuracy at very low resolution to approximately $70\%$. However, a disadvantage is that a specific model needs to be trained for every resolution pair. Thus, we extend the aforementioned methods by training them with multiple image resolutions at once. The performances for particular testing image resolutions are slightly worse, but the advantage is that this model can be applied to arbitrary resolution images and achieves an overall better performance ($97.72\%$ compared to $96.86\%$). Due to the lack of a benchmark for arbitrary resolution images for the cross-resolution and equal-resolution task, we propose an evaluation protocol for five well-known datasets, focusing on high, mid, and low-resolution images.
    Does Joint Training Really Help Cascaded Speech Translation?. (arXiv:2210.13700v2 [eess.AS] UPDATED)
    Currently, in speech translation, the straightforward approach - cascading a recognition system with a translation system - delivers state-of-the-art results. However, fundamental challenges such as error propagation from the automatic speech recognition system still remain. To mitigate these problems, recently, people turn their attention to direct data and propose various joint training methods. In this work, we seek to answer the question of whether joint training really helps cascaded speech translation. We review recent papers on the topic and also investigate a joint training criterion by marginalizing the transcription posterior probabilities. Our findings show that a strong cascaded baseline can diminish any improvements obtained using joint training, and we suggest alternatives to joint training. We hope this work can serve as a refresher of the current speech translation landscape, and motivate research in finding more efficient and creative ways to utilize the direct data for speech translation.
    Active Learning and Novel Model Calibration Measurements for Automated Visual Inspection in Manufacturing. (arXiv:2209.05486v2 [cs.LG] UPDATED)
    Quality control is a crucial activity performed by manufacturing enterprises to ensure that their products meet quality standards and avoid potential damage to the brand's reputation. The decreased cost of sensors and connectivity enabled increasing digitalization of manufacturing. In addition, artificial intelligence enables higher degrees of automation, reducing overall costs and time required for defect inspection. This research compares three active learning approaches, having single and multiple oracles, to visual inspection. Six new metrics are proposed to assess the quality of calibration without the need for ground truth. Furthermore, this research explores whether existing calibrators can improve their performance by leveraging an approximate ground truth to enlarge the calibration set. The experiments were performed on real-world data provided by Philips Consumer Lifestyle BV. Our results show that the explored active learning settings can reduce the data labeling effort by between three and four percent without detriment to the overall quality goals, considering a threshold of p=0.95. Furthermore, the results show that the proposed calibration metrics successfully capture relevant information otherwise available to metrics used up to date only through ground truth data. Therefore, the proposed metrics can be used to estimate the quality of models' probability calibration without committing to a labeling effort to obtain ground truth data.
    Efficient Integration of Multi-Order Dynamics and Internal Dynamics in Stock Movement Prediction. (arXiv:2211.07400v2 [q-fin.ST] UPDATED)
    Advances in deep neural network (DNN) architectures have enabled new prediction techniques for stock market data. Unlike other multivariate time-series data, stock markets show two unique characteristics: (i) \emph{multi-order dynamics}, as stock prices are affected by strong non-pairwise correlations (e.g., within the same industry); and (ii) \emph{internal dynamics}, as each individual stock shows some particular behaviour. Recent DNN-based methods capture multi-order dynamics using hypergraphs, but rely on the Fourier basis in the convolution, which is both inefficient and ineffective. In addition, they largely ignore internal dynamics by adopting the same model for each stock, which implies a severe information loss. In this paper, we propose a framework for stock movement prediction to overcome the above issues. Specifically, the framework includes temporal generative filters that implement a memory-based mechanism onto an LSTM network in an attempt to learn individual patterns per stock. Moreover, we employ hypergraph attentions to capture the non-pairwise correlations. Here, using the wavelet basis instead of the Fourier basis, enables us to simplify the message passing and focus on the localized convolution. Experiments with US market data over six years show that our framework outperforms state-of-the-art methods in terms of profit and stability. Our source code and data are available at \url{https://github.com/thanhtrunghuynh93/estimate}.
    Local Context-Aware Active Domain Adaptation. (arXiv:2208.12856v2 [cs.LG] UPDATED)
    Active Domain Adaptation (ADA) queries the labels of a small number of selected target samples to help adapting a model from a source domain to a target domain. The local context of queried data is important, especially when the domain gap is large. However, this has not been fully explored by existing ADA works. In this paper, we propose a Local context-aware ADA framework, named LADA, to address this issue. To select informative target samples, we devise a novel criterion based on the local inconsistency of model predictions. Since the labeling budget is usually small, fine-tuning model on only queried data can be inefficient. We progressively augment labeled target data with the confident neighbors in a class-balanced manner. Experiments validate that the proposed criterion chooses more informative target samples than existing active selection strategies. Furthermore, our full method surpasses recent ADA arts on various benchmarks. Code is available at https://github.com/tsun/LADA.
    Adjusting Pleasure-Arousal-Dominance for Continuous Emotional Text-to-speech Synthesizer. (arXiv:1906.05507v1 [eess.AS] CROSS LISTED)
    Emotion is not limited to discrete categories of happy, sad, angry, fear, disgust, surprise, and so on. Instead, each emotion category is projected into a set of nearly independent dimensions, named pleasure (or valence), arousal, and dominance, known as PAD. The value of each dimension varies from -1 to 1, such that the neutral emotion is in the center with all-zero values. Training an emotional continuous text-to-speech (TTS) synthesizer on the independent dimensions provides the possibility of emotional speech synthesis with unlimited emotion categories. Our end-to-end neural speech synthesizer is based on the well-known Tacotron. Empirically, we have found the optimum network architecture for injecting the 3D PADs. Moreover, the PAD values are adjusted for the speech synthesis purpose.
    Capturing Failures of Large Language Models via Human Cognitive Biases. (arXiv:2202.12299v2 [cs.CL] UPDATED)
    Large language models generate complex, open-ended outputs: instead of outputting a class label they write summaries, generate dialogue, or produce working code. In order to asses the reliability of these open-ended generation systems, we aim to identify qualitative categories of erroneous behavior, beyond identifying individual errors. To hypothesize and test for such qualitative errors, we draw inspiration from human cognitive biases -- systematic patterns of deviation from rational judgement. Specifically, we use cognitive biases as motivation to (i) generate hypotheses for problems that models may have, and (ii) develop experiments that elicit these problems. Using code generation as a case study, we find that OpenAI's Codex errs predictably based on how the input prompt is framed, adjusts outputs towards anchors, and is biased towards outputs that mimic frequent training examples. We then use our framework to elicit high-impact errors such as incorrectly deleting files. Our results indicate that experimental methodology from cognitive science can help characterize how machine learning systems behave.
    Improving Multi-Task Generalization via Regularizing Spurious Correlation. (arXiv:2205.09797v2 [cs.LG] UPDATED)
    Multi-Task Learning (MTL) is a powerful learning paradigm to improve generalization performance via knowledge sharing. However, existing studies find that MTL could sometimes hurt generalization, especially when two tasks are less correlated. One possible reason that hurts generalization is spurious correlation, i.e., some knowledge is spurious and not causally related to task labels, but the model could mistakenly utilize them and thus fail when such correlation changes. In MTL setup, there exist several unique challenges of spurious correlation. First, the risk of having non-causal knowledge is higher, as the shared MTL model needs to encode all knowledge from different tasks, and causal knowledge for one task could be potentially spurious to the other. Second, the confounder between task labels brings in a different type of spurious correlation to MTL. We theoretically prove that MTL is more prone to taking non-causal knowledge from other tasks than single-task learning, and thus generalize worse. To solve this problem, we propose Multi-Task Causal Representation Learning framework, aiming to represent multi-task knowledge via disentangled neural modules, and learn which module is causally related to each task via MTL-specific invariant regularization. Experiments show that it could enhance MTL model's performance by 5.5% on average over Multi-MNIST, MovieLens, Taskonomy, CityScape, and NYUv2, via alleviating spurious correlation problem.
    UDC: Unified DNAS for Compressible TinyML Models. (arXiv:2201.05842v3 [cs.LG] UPDATED)
    Deploying TinyML models on low-cost IoT hardware is very challenging, due to limited device memory capacity. Neural processing unit (NPU) hardware address the memory challenge by using model compression to exploit weight quantization and sparsity to fit more parameters in the same footprint. However, designing compressible neural networks (NNs) is challenging, as it expands the design space across which we must make balanced trade-offs. This paper demonstrates Unified DNAS for Compressible (UDC) NNs, which explores a large search space to generate state-of-the-art compressible NNs for NPU. ImageNet results show UDC networks are up to $3.35\times$ smaller (iso-accuracy) or 6.25% more accurate (iso-model size) than previous work.
    Backdoor Attack and Defense in Federated Generative Adversarial Network-based Medical Image Synthesis. (arXiv:2210.10886v2 [cs.CV] UPDATED)
    Deep Learning-based image synthesis techniques have been applied in healthcare research for generating medical images to support open research and augment medical datasets. Training generative adversarial neural networks (GANs) usually require large amounts of training data. Federated learning (FL) provides a way of training a central model using distributed data while keeping raw data locally. However, given that the FL server cannot access the raw data, it is vulnerable to backdoor attacks, an adversarial by poisoning training data. Most backdoor attack strategies focus on classification models and centralized domains. It is still an open question if the existing backdoor attacks can affect GAN training and, if so, how to defend against the attack in the FL setting. In this work, we investigate the overlooked issue of backdoor attacks in federated GANs (FedGANs). The success of this attack is subsequently determined to be the result of some local discriminators overfitting the poisoned data and corrupting the local GAN equilibrium, which then further contaminates other clients when averaging the generator's parameters and yields high generator loss. Therefore, we proposed FedDetect, an efficient and effective way of defending against the backdoor attack in the FL setting, which allows the server to detect the client's adversarial behavior based on their losses and block the malicious clients. Our extensive experiments on two medical datasets with different modalities demonstrate the backdoor attack on FedGANs can result in synthetic images with low fidelity. After detecting and suppressing the detected malicious clients using the proposed defense strategy, we show that FedGANs can synthesize high-quality medical datasets (with labels) for data augmentation to improve classification models' performance.
    PI-QT-Opt: Predictive Information Improves Multi-Task Robotic Reinforcement Learning at Scale. (arXiv:2210.08217v2 [cs.RO] UPDATED)
    The predictive information, the mutual information between the past and future, has been shown to be a useful representation learning auxiliary loss for training reinforcement learning agents, as the ability to model what will happen next is critical to success on many control tasks. While existing studies are largely restricted to training specialist agents on single-task settings in simulation, in this work, we study modeling the predictive information for robotic agents and its importance for general-purpose agents that are trained to master a large repertoire of diverse skills from large amounts of data. Specifically, we introduce Predictive Information QT-Opt (PI-QT-Opt), a QT-Opt agent augmented with an auxiliary loss that learns representations of the predictive information to solve up to 297 vision-based robot manipulation tasks in simulation and the real world with a single set of parameters. We demonstrate that modeling the predictive information significantly improves success rates on the training tasks and leads to better zero-shot transfer to unseen novel tasks. Finally, we evaluate PI-QT-Opt on real robots, achieving substantial and consistent improvement over QT-Opt in multiple experimental settings of varying environments, skills, and multi-task configurations.
    Medical Diffusion -- Denoising Diffusion Probabilistic Models for 3D Medical Image Generation. (arXiv:2211.03364v2 [eess.IV] UPDATED)
    Recent advances in computer vision have shown promising results in image generation. Diffusion probabilistic models in particular have generated realistic images from textual input, as demonstrated by DALL-E 2, Imagen and Stable Diffusion. However, their use in medicine, where image data typically comprises three-dimensional volumes, has not been systematically evaluated. Synthetic images may play a crucial role in privacy preserving artificial intelligence and can also be used to augment small datasets. Here we show that diffusion probabilistic models can synthesize high quality medical imaging data, which we show for Magnetic Resonance Images (MRI) and Computed Tomography (CT) images. We provide quantitative measurements of their performance through a reader study with two medical experts who rated the quality of the synthesized images in three categories: Realistic image appearance, anatomical correctness and consistency between slices. Furthermore, we demonstrate that synthetic images can be used in a self-supervised pre-training and improve the performance of breast segmentation models when data is scarce (dice score 0.91 vs. 0.95 without vs. with synthetic data).
    Graph Coloring with Physics-Inspired Graph Neural Networks. (arXiv:2202.01606v3 [cs.LG] UPDATED)
    We show how graph neural networks can be used to solve the canonical graph coloring problem. We frame graph coloring as a multi-class node classification problem and utilize an unsupervised training strategy based on the statistical physics Potts model. Generalizations to other multi-class problems such as community detection, data clustering, and the minimum clique cover problem are straightforward. We provide numerical benchmark results and illustrate our approach with an end-to-end application for a real-world scheduling use case within a comprehensive encode-process-decode framework. Our optimization approach performs on par or outperforms existing solvers, with the ability to scale to problems with millions of variables.
    Machine Learning based Extraction of Boundary Conditions from Doppler Echo Images for Patient Specific Coarctation of the Aorta: Computational Fluid Dynamics Study. (arXiv:2209.09139v3 [cs.CE] UPDATED)
    Purpose- Coarctation of the Aorta (CoA) patient-specific computational fluid dynamics (CFD) studies in resource constrained settings are limited by the available imaging modalities for geometry and velocity data acquisition. Doppler echocardiography has been seen as a suitable velocity acquisition modality due to its higher availability and safety. This study aimed to investigate the application of classical machine learning (ML) methods to create an adequate and robust approach for obtaining boundary conditions (BCs) from Doppler Echocardiography images, for haemodynamic modeling using CFD. Methods- Our proposed approach combines ML and CFD to model haemodynamic flow within the region of interest. With the key feature of the approach being the use of ML models to calibrate the inlet and outlet boundary conditions (BCs) of the CFD model. The key input variable for the ML model was the patients heart rate as this was the parameter that varied in time across the measured vessels within the study. ANSYS Fluent was used for the CFD component of the study whilst the scikit-learn python library was used for the ML component. Results- We validated our approach against a real clinical case of severe CoA before intervention. The maximum coarctation velocity of our simulations were compared to the measured maximum coarctation velocity obtained from the patient whose geometry is used within the study. Of the 5 ML models used to obtain BCs the top model was within 5\% of the measured maximum coarctation velocity. Conclusion- The framework demonstrated that it was capable of taking variations of the patients heart rate between measurements into account. Thus, enabling the calculation of BCs that were physiologically realistic when the heart rate was scaled across each vessel whilst providing a reasonably accurate solution.
    Neural Graph Databases. (arXiv:2209.09732v2 [cs.LG] UPDATED)
    Graph databases (GDBs) enable processing and analysis of unstructured, complex, rich, and usually vast graph datasets. Despite the large significance of GDBs in both academia and industry, little effort has been made into integrating them with the predictive power of graph neural networks (GNNs). In this work, we show how to seamlessly combine nearly any GNN model with the computational capabilities of GDBs. For this, we observe that the majority of these systems are based on, or support, a graph data model called the Labeled Property Graph (LPG), where vertices and edges can have arbitrarily complex sets of labels and properties. We then develop LPG2vec, an encoder that transforms an arbitrary LPG dataset into a representation that can be directly used with a broad class of GNNs, including convolutional, attentional, message-passing, and even higher-order or spectral models. In our evaluation, we show that the rich information represented as LPG labels and properties is properly preserved by LPG2vec, and it increases the accuracy of predictions regardless of the targeted learning task or the used GNN model, by up to 34% compared to graphs with no LPG labels/properties. In general, LPG2vec enables combining predictive power of the most powerful GNNs with the full scope of information encoded in the LPG model, paving the way for neural graph databases, a class of systems where the vast complexity of maintained data will benefit from modern and future graph machine learning methods.
    Graph Neural Network Expressivity and Meta-Learning for Molecular Property Regression. (arXiv:2209.13410v2 [cs.LG] UPDATED)
    We demonstrate the applicability of model-agnostic algorithms for meta-learning, specifically Reptile, to GNN models in molecular regression tasks. Using meta-learning we are able to learn new chemical prediction tasks with only a few model updates, as compared to using randomly initialized GNNs which require learning each regression task from scratch. We experimentally show that GNN layer expressivity is correlated to improved meta-learning. Additionally, we also experiment with GNN emsembles which yield best performance and rapid convergence for k-shot learning.
    Deep Inventory Management. (arXiv:2210.03137v2 [cs.LG] UPDATED)
    This work provides a Deep Reinforcement Learning approach to solving a periodic review inventory control system with stochastic vendor lead times, lost sales, correlated demand, and price matching. While this dynamic program has historically been considered intractable, our results show that several policy learning approaches are competitive with or outperform classical methods. In order to train these algorithms, we develop novel techniques to convert historical data into a simulator. On the theoretical side, we present learnability results on a subclass of inventory control problems, where we provide a provable reduction of the reinforcement learning problem to that of supervised learning. On the algorithmic side, we present a model-based reinforcement learning procedure (Direct Backprop) to solve the periodic review inventory control problem by constructing a differentiable simulator. Under a variety of metrics Direct Backprop outperforms model-free RL and newsvendor baselines, in both simulations and real-world deployments.
    A-Optimal Active Learning. (arXiv:2110.09585v2 [cs.LG] UPDATED)
    In this work we discuss the problem of active learning. We present an approach that is based on A-optimal experimental design of ill-posed problems and show how one can optimally label a data set by partially probing it, and use it to train a deep network. We present two approaches that make different assumptions on the data set. The first is based on a Bayesian interpretation of the semi-supervised learning problem with the graph Laplacian that is used for the prior distribution and the second is based on a frequentist approach, that updates the estimation of the bias term based on the recovery of the labels. We demonstrate that this approach can be highly efficient for estimating labels and training a deep network.
    Efficient identification of informative features in simulation-based inference. (arXiv:2210.11915v2 [cs.LG] UPDATED)
    Simulation-based Bayesian inference (SBI) can be used to estimate the parameters of complex mechanistic models given observed model outputs without requiring access to explicit likelihood evaluations. A prime example for the application of SBI in neuroscience involves estimating the parameters governing the response dynamics of Hodgkin-Huxley (HH) models from electrophysiological measurements, by inferring a posterior over the parameters that is consistent with a set of observations. To this end, many SBI methods employ a set of summary statistics or scientifically interpretable features to estimate a surrogate likelihood or posterior. However, currently, there is no way to identify how much each summary statistic or feature contributes to reducing posterior uncertainty. To address this challenge, one could simply compare the posteriors with and without a given feature included in the inference process. However, for large or nested feature sets, this would necessitate repeatedly estimating the posterior, which is computationally expensive or even prohibitive. Here, we provide a more efficient approach based on the SBI method neural likelihood estimation (NLE): We show that one can marginalize the trained surrogate likelihood post-hoc before inferring the posterior to assess the contribution of a feature. We demonstrate the usefulness of our method by identifying the most important features for inferring parameters of an example HH neuron model. Beyond neuroscience, our method is generally applicable to SBI workflows that rely on data features for inference used in other scientific fields.
    Robustness Analysis of Deep Learning Models for Population Synthesis. (arXiv:2211.13339v1 [cs.LG])
    Deep generative models have become useful for synthetic data generation, particularly population synthesis. The models implicitly learn the probability distribution of a dataset and can draw samples from a distribution. Several models have been proposed, but their performance is only tested on a single cross-sectional sample. The implementation of population synthesis on single datasets is seen as a drawback that needs further studies to explore the robustness of the models on multiple datasets. While comparing with the real data can increase trust and interpretability of the models, techniques to evaluate deep generative models' robustness for population synthesis remain underexplored. In this study, we present bootstrap confidence interval for the deep generative models, an approach that computes efficient confidence intervals for mean errors predictions to evaluate the robustness of the models to multiple datasets. Specifically, we adopt the tabular-based Composite Travel Generative Adversarial Network (CTGAN) and Variational Autoencoder (VAE), to estimate the distribution of the population, by generating agents that have tabular data using several samples over time from the same study area. The models are implemented on multiple travel diaries of Montreal Origin- Destination Survey of 2008, 2013, and 2018 and compare the predictive performance under varying sample sizes from multiple surveys. Results show that the predictive errors of CTGAN have narrower confidence intervals indicating its robustness to multiple datasets of the varying sample sizes when compared to VAE. Again, the evaluation of model robustness against varying sample size shows a minimal decrease in model performance with decrease in sample size. This study directly supports agent-based modelling by enabling finer synthetic generation of populations in a reliable environment.
    Explainable and Safe Reinforcement Learning for Autonomous Air Mobility. (arXiv:2211.13474v1 [cs.LG])
    Increasing traffic demands, higher levels of automation, and communication enhancements provide novel design opportunities for future air traffic controllers (ATCs). This article presents a novel deep reinforcement learning (DRL) controller to aid conflict resolution for autonomous free flight. Although DRL has achieved important advancements in this field, the existing works pay little attention to the explainability and safety issues related to DRL controllers, particularly the safety under adversarial attacks. To address those two issues, we design a fully explainable DRL framework wherein we: 1) decompose the coupled Q value learning model into a safety-awareness and efficiency (reach the target) one; and 2) use information from surrounding intruders as inputs, eliminating the needs of central controllers. In our simulated experiments, we show that by decoupling the safety-awareness and efficiency, we can exceed performance on free flight control tasks while dramatically improving explainability on practical. In addition, the safety Q learning module provides rich information about the safety situation of environments. To study the safety under adversarial attacks, we additionally propose an adversarial attack strategy that can impose both safety-oriented and efficiency-oriented attacks. The adversarial aims to minimize safety/efficiency by only attacking the agent at a few time steps. In the experiments, our attack strategy increases as many collisions as the uniform attack (i.e., attacking at every time step) by only attacking the agent four times less often, which provide insights into the capabilities and restrictions of the DRL in future ATC designs. The source code is publicly available at https://github.com/WLeiiiii/Gym-ATC-Attack-Project.
    Few-Shot Audio-Visual Learning of Environment Acoustics. (arXiv:2206.04006v2 [cs.SD] UPDATED)
    Room impulse response (RIR) functions capture how the surrounding physical environment transforms the sounds heard by a listener, with implications for various applications in AR, VR, and robotics. Whereas traditional methods to estimate RIRs assume dense geometry and/or sound measurements throughout the environment, we explore how to infer RIRs based on a sparse set of images and echoes observed in the space. Towards that goal, we introduce a transformer-based method that uses self-attention to build a rich acoustic context, then predicts RIRs of arbitrary query source-receiver locations through cross-attention. Additionally, we design a novel training objective that improves the match in the acoustic signature between the RIR predictions and the targets. In experiments using a state-of-the-art audio-visual simulator for 3D environments, we demonstrate that our method successfully generates arbitrary RIRs, outperforming state-of-the-art methods and -- in a major departure from traditional methods -- generalizing to novel environments in a few-shot manner. Project: this http URL
    CLIP-PAE: Projection-Augmentation Embedding to Extract Relevant Features for a Disentangled, Interpretable, and Controllable Text-Guided Image Manipulation. (arXiv:2210.03919v3 [cs.CV] UPDATED)
    Recently introduced Contrastive Language-Image Pre-Training (CLIP) bridges images and text by embedding them into a joint latent space. This opens the door to ample literature that aims to manipulate an input image by providing a textual explanation. However, due to the discrepancy between image and text embeddings in the joint space, using text embeddings as the optimization target often introduces undesired artifacts in the resulting images. Disentanglement, interpretability, and controllability are also hard to guarantee for manipulation. To alleviate these problems, we propose to define corpus subspaces spanned by relevant prompts to capture specific image characteristics. We introduce CLIP Projection-Augmentation Embedding (PAE) as an optimization target to improve the performance of text-guided image manipulation. Our method is a simple and general paradigm that can be easily computed and adapted, and smoothly incorporated into any CLIP-based image manipulation algorithm. To demonstrate the effectiveness of our method, we conduct several theoretical and empirical studies. As a case study, we utilize the method for text-guided semantic face editing. We quantitatively and qualitatively demonstrate that PAE facilitates a more disentangled, interpretable, and controllable image manipulation with state-of-the-art quality and accuracy.
    A comparison of latent semantic analysis and correspondence analysis of document-term matrices. (arXiv:2108.06197v4 [cs.IR] UPDATED)
    Latent semantic analysis (LSA) and correspondence analysis (CA) are two techniques that use a singular value decomposition (SVD) for dimensionality reduction. LSA has been extensively used to obtain low-dimensional representations that capture relationships among documents and terms. In this article, we present a theoretical analysis and comparison of the two techniques in the context of document-term matrices. We show that CA has some attractive properties as compared to LSA, for instance that effects of margins, i.e. sums of row elements and column elements, arising from differing document-lengths and term-frequencies are effectively eliminated, so that the CA solution is optimally suited to focus on relationships among documents and terms. A unifying framework is proposed that includes both CA and LSA as special cases. We empirically compare CA to various LSA based methods on text categorization in English and authorship attribution on historical Dutch texts, and find that CA performs significantly better. We also apply CA to a long-standing question regarding the authorship of the Dutch national anthem Wilhelmus and provide further support that it can be attributed to the author Datheen, amongst several contenders.
    SPICE, A Dataset of Drug-like Molecules and Peptides for Training Machine Learning Potentials. (arXiv:2209.10702v2 [physics.chem-ph] UPDATED)
    Machine learning potentials are an important tool for molecular simulation, but their development is held back by a shortage of high quality datasets to train them on. We describe the SPICE dataset, a new quantum chemistry dataset for training potentials relevant to simulating drug-like small molecules interacting with proteins. It contains over 1.1 million conformations for a diverse set of small molecules, dimers, dipeptides, and solvated amino acids. It includes 15 elements, charged and uncharged molecules, and a wide range of covalent and non-covalent interactions. It provides both forces and energies calculated at the {\omega}B97M-D3(BJ)/def2-TZVPPD level of theory, along with other useful quantities such as multipole moments and bond orders. We train a set of machine learning potentials on it and demonstrate that they can achieve chemical accuracy across a broad region of chemical space. It can serve as a valuable resource for the creation of transferable, ready to use potential functions for use in molecular simulations.
    Distributed representations of graphs for drug pair scoring. (arXiv:2209.09383v2 [cs.LG] UPDATED)
    In this paper we study the practicality and usefulness of incorporating distributed representations of graphs into models within the context of drug pair scoring. We argue that the real world growth and update cycles of drug pair scoring datasets subvert the limitations of transductive learning associated with distributed representations. Furthermore, we argue that the vocabulary of discrete substructure patterns induced over drug sets is not dramatically large due to the limited set of atom types and constraints on bonding patterns enforced by chemistry. Under this pretext, we explore the effectiveness of distributed representations of the molecular graphs of drugs in drug pair scoring tasks such as drug synergy, polypharmacy, and drug-drug interaction prediction. To achieve this, we present a methodology for learning and incorporating distributed representations of graphs within a unified framework for drug pair scoring. Subsequently, we augment a number of recent and state-of-the-art models to utilise our embeddings. We empirically show that the incorporation of these embeddings improves downstream performance of almost every model across different drug pair scoring tasks, even those the original model was not designed for. We publicly release all of our drug embeddings for the DrugCombDB, DrugComb, DrugbankDDI, and TwoSides datasets.
    Blackbox Attacks via Surrogate Ensemble Search. (arXiv:2208.03610v2 [cs.LG] UPDATED)
    Blackbox adversarial attacks can be categorized into transfer- and query-based attacks. Transfer methods do not require any feedback from the victim model, but provide lower success rates compared to query-based methods. Query attacks often require a large number of queries for success. To achieve the best of both approaches, recent efforts have tried to combine them, but still require hundreds of queries to achieve high success rates (especially for targeted attacks). In this paper, we propose a novel method for Blackbox Attacks via Surrogate Ensemble Search (BASES) that can generate highly successful blackbox attacks using an extremely small number of queries. We first define a perturbation machine that generates a perturbed image by minimizing a weighted loss function over a fixed set of surrogate models. To generate an attack for a given victim model, we search over the weights in the loss function using queries generated by the perturbation machine. Since the dimension of the search space is small (same as the number of surrogate models), the search requires a small number of queries. We demonstrate that our proposed method achieves better success rate with at least 30x fewer queries compared to state-of-the-art methods on different image classifiers trained with ImageNet. In particular, our method requires as few as 3 queries per image (on average) to achieve more than a 90% success rate for targeted attacks and 1-2 queries per image for over a 99% success rate for untargeted attacks. Our method is also effective on Google Cloud Vision API and achieved a 91% untargeted attack success rate with 2.9 queries per image. We also show that the perturbations generated by our proposed method are highly transferable and can be adopted for hard-label blackbox attacks. We also show effectiveness of BASES for hiding attacks on object detectors.
    Network Security Modelling with Distributional Data. (arXiv:2211.13419v1 [cs.CR])
    We investigate the detection of botnet command and control (C2) hosts in massive IP traffic using machine learning methods. To this end, we use NetFlow data -- the industry standard for monitoring of IP traffic -- and ML models using two sets of features: conventional NetFlow variables and distributional features based on NetFlow variables. In addition to using static summaries of NetFlow features, we use quantiles of their IP-level distributions as input features in predictive models to predict whether an IP belongs to known botnet families. These models are used to develop intrusion detection systems to predict traffic traces identified with malicious attacks. The results are validated by matching predictions to existing denylists of published malicious IP addresses and deep packet inspection. The usage of our proposed novel distributional features, combined with techniques that enable modelling complex input feature spaces result in highly accurate predictions by our trained models.
    Biologically-Plausible Determinant Maximization Neural Networks for Blind Separation of Correlated Sources. (arXiv:2209.12894v2 [eess.SP] UPDATED)
    Extraction of latent sources of complex stimuli is critical for making sense of the world. While the brain solves this blind source separation (BSS) problem continuously, its algorithms remain unknown. Previous work on biologically-plausible BSS algorithms assumed that observed signals are linear mixtures of statistically independent or uncorrelated sources, limiting the domain of applicability of these algorithms. To overcome this limitation, we propose novel biologically-plausible neural networks for the blind separation of potentially dependent/correlated sources. Differing from previous work, we assume some general geometric, not statistical, conditions on the source vectors allowing separation of potentially dependent/correlated sources. Concretely, we assume that the source vectors are sufficiently scattered in their domains which can be described by certain polytopes. Then, we consider recovery of these sources by the Det-Max criterion, which maximizes the determinant of the output correlation matrix to enforce a similar spread for the source estimates. Starting from this normative principle, and using a weighted similarity matching approach that enables arbitrary linear transformations adaptable by local learning rules, we derive two-layer biologically-plausible neural network algorithms that can separate mixtures into sources coming from a variety of source domains. We demonstrate that our algorithms outperform other biologically-plausible BSS algorithms on correlated source separation problems.
    Synthetic Dataset Generation for Privacy-Preserving Machine Learning. (arXiv:2210.03205v3 [cs.CR] UPDATED)
    Machine Learning (ML) has achieved enormous success in solving a variety of problems in computer vision, speech recognition, object detection, to name a few. The principal reason for this success is the availability of huge datasets for training deep neural networks (DNNs). However, datasets cannot be publicly released if they contain sensitive information such as medical records, and data privacy becomes a major concern. Encryption methods could be a possible solution, however their deployment on ML applications seriously impacts classification accuracy and results in substantial computational overhead. Alternatively, obfuscation techniques could be used, but maintaining a good trade-off between visual privacy and accuracy is challenging. In this paper, we propose a method to generate secure synthetic datasets from the original private datasets. Given a network with Batch Normalization (BN) layers pretrained on the original dataset, we first record the class-wise BN layer statistics. Next, we generate the synthetic dataset by optimizing random noise such that the synthetic data match the layer-wise statistical distribution of original images. We evaluate our method on image classification datasets (CIFAR10, ImageNet) and show that synthetic data can be used in place of the original CIFAR10/ImageNet data for training networks from scratch, producing comparable classification performance. Further, to analyze visual privacy provided by our method, we use Image Quality Metrics and show high degree of visual dissimilarity between the original and synthetic images. Moreover, we show that our proposed method preserves data-privacy under various privacy-leakage attacks including Gradient Matching Attack, Model Memorization Attack, and GAN-based Attack.
    AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages. (arXiv:2211.03263v2 [cs.CL] UPDATED)
    In recent years, multilingual pre-trained language models have gained prominence due to their remarkable performance on numerous downstream Natural Language Processing tasks (NLP). However, pre-training these large multilingual language models requires a lot of training data, which is not available for African Languages. Active learning is a semi-supervised learning algorithm, in which a model consistently and dynamically learns to identify the most beneficial samples to train itself on, in order to achieve better optimization and performance on downstream tasks. Furthermore, active learning effectively and practically addresses real-world data scarcity. Despite all its benefits, active learning, in the context of NLP and especially multilingual language models pretraining, has received little consideration. In this paper, we present AfroLM, a multilingual language model pretrained from scratch on 23 African languages (the largest effort to date) using our novel self-active learning framework. Pretrained on a dataset significantly (14x) smaller than existing baselines, AfroLM outperforms many multilingual pretrained language models (AfriBERTa, XLMR-base, mBERT) on various NLP downstream tasks (NER, text classification, and sentiment analysis). Additional out-of-domain sentiment analysis experiments show that \textbf{AfroLM} is able to generalize well across various domains. We release the code source, and our datasets used in our framework at https://github.com/bonaventuredossou/MLM_AL.
    Skill-Based Reinforcement Learning with Intrinsic Reward Matching. (arXiv:2210.07426v3 [cs.LG] UPDATED)
    While unsupervised skill discovery has shown promise in autonomously acquiring behavioral primitives, there is still a large methodological disconnect between task-agnostic skill pretraining and downstream, task-aware finetuning. We present Intrinsic Reward Matching (IRM), which unifies these two phases of learning via the $\textit{skill discriminator}$, a pretraining model component often discarded during finetuning. Conventional approaches finetune pretrained agents directly at the policy level, often relying on expensive environment rollouts to empirically determine the optimal skill. However, often the most concise yet complete description of a task is the reward function itself, and skill learning methods learn an $\textit{intrinsic}$ reward function via the discriminator that corresponds to the skill policy. We propose to leverage the skill discriminator to $\textit{match}$ the intrinsic and downstream task rewards and determine the optimal skill for an unseen task without environment samples, consequently finetuning with greater sample-efficiency. Furthermore, we generalize IRM to sequence skills and solve more complex, long-horizon tasks. We demonstrate that IRM enables us to utilize pretrained skills far more effectively than previous skill selection methods on the Unsupervised Reinforcement Learning Benchmark and on challenging tabletop manipulation tasks.
    Statistical Optimality of Divide and Conquer Kernel-based Functional Linear Regression. (arXiv:2211.10968v2 [stat.ML] UPDATED)
    Previous analysis of regularized functional linear regression in a reproducing kernel Hilbert space (RKHS) typically requires the target function to be contained in this kernel space. This paper studies the convergence performance of divide-and-conquer estimators in the scenario that the target function does not necessarily reside in the underlying RKHS. As a decomposition-based scalable approach, the divide-and-conquer estimators of functional linear regression can substantially reduce the algorithmic complexities in time and memory. We develop an integral operator approach to establish sharp finite sample upper bounds for prediction with divide-and-conquer estimators under various regularity conditions of explanatory variables and target function. We also prove the asymptotic optimality of the derived rates by building the mini-max lower bounds. Finally, we consider the convergence of noiseless estimators and show that the rates can be arbitrarily fast under mild conditions.
    Analysis of (sub-)Riemannian PDE-G-CNNs. (arXiv:2210.00935v3 [cs.LG] UPDATED)
    Group equivariant convolutional neural networks (G-CNNs) have been successfully applied in geometric deep learning. Typically, G-CNNs have the advantage over CNNs that they do not waste network capacity on training symmetries that should have been hard-coded in the network. The recently introduced framework of PDE-based G-CNNs (PDE-G-CNNs) generalises G-CNNs. PDE-G-CNNs have the core advantages that they simultaneously 1) reduce network complexity, 2) increase classification performance, and 3) provide geometric interpretability. Their implementations primarily consist of linear and morphological convolutions with kernels. In this paper we show that the previously suggested approximative morphological kernels do not always accurately approximate the exact kernels accurately. More specifically, depending on the spatial anisotropy of the Riemannian metric, we argue that one must resort to sub-Riemannian approximations. We solve this problem by providing a new approximative kernel that works regardless of the anisotropy. We provide new theorems with better error estimates of the approximative kernels, and prove that they all carry the same reflectional symmetries as the exact ones. We test the effectiveness of multiple approximative kernels within the PDE-G-CNN framework on two datasets, and observe an improvement with the new approximative kernels. We report that the PDE-G-CNNs again allow for a considerable reduction of network complexity while having comparable or better performance than G-CNNs and CNNs on the two datasets. Moreover, PDE-G-CNNs have the advantage of better geometric interpretability over G-CNNs, as the morphological kernels are related to association fields from neurogeometry.
    Unveiling the Sampling Density in Non-Uniform Geometric Graphs. (arXiv:2210.08219v3 [cs.LG] UPDATED)
    A powerful framework for studying graphs is to consider them as geometric graphs: nodes are randomly sampled from an underlying metric space, and any pair of nodes is connected if their distance is less than a specified neighborhood radius. Currently, the literature mostly focuses on uniform sampling and constant neighborhood radius. However, real-world graphs are likely to be better represented by a model in which the sampling density and the neighborhood radius can both vary over the latent space. For instance, in a social network communities can be modeled as densely sampled areas, and hubs as nodes with larger neighborhood radius. In this work, we first perform a rigorous mathematical analysis of this (more general) class of models, including derivations of the resulting graph shift operators. The key insight is that graph shift operators should be corrected in order to avoid potential distortions introduced by the non-uniform sampling. Then, we develop methods to estimate the unknown sampling density in a self-supervised fashion. Finally, we present exemplary applications in which the learnt density is used to 1) correct the graph shift operator and improve performance on a variety of tasks, 2) improve pooling, and 3) extract knowledge from networks. Our experimental findings support our theory and provide strong evidence for our model.
    A Fully Time-domain Neural Model for Subband-based Speech Synthesizer. (arXiv:1810.05319v2 [eess.AS] CROSS LISTED)
    This paper introduces a deep neural network model for subband-based speech synthesizer. The model benefits from the short bandwidth of the subband signals to reduce the complexity of the time-domain speech generator. We employed the multi-level wavelet analysis/synthesis to decompose/reconstruct the signal into subbands in time domain. Inspired from the WaveNet, a convolutional neural network (CNN) model predicts subband speech signals fully in time domain. Due to the short bandwidth of the subbands, a simple network architecture is enough to train the simple patterns of the subbands accurately. In the ground truth experiments with teacher-forcing, the subband synthesizer outperforms the fullband model significantly in terms of both subjective and objective measures. In addition, by conditioning the model on the phoneme sequence using a pronunciation dictionary, we have achieved the fully time-domain neural model for subband-based text-to-speech (TTS) synthesizer, which is nearly end-to-end. The generated speech of the subband TTS shows comparable quality as the fullband one with a slighter network architecture for each subband.
    Tracking Dataset IP Use in Deep Neural Networks. (arXiv:2211.13535v1 [cs.CR])
    Training highly performant deep neural networks (DNNs) typically requires the collection of a massive dataset and the use of powerful computing resources. Therefore, unauthorized redistribution of private pre-trained DNNs may cause severe economic loss for model owners. For protecting the ownership of DNN models, DNN watermarking schemes have been proposed by embedding secret information in a DNN model and verifying its presence for model ownership. However, existing DNN watermarking schemes compromise the model utility and are vulnerable to watermark removal attacks because a model is modified with a watermark. Alternatively, a new approach dubbed DEEPJUDGE was introduced to measure the similarity between a suspect model and a victim model without modifying the victim model. However, DEEPJUDGE would only be designed to detect the case where a suspect model's architecture is the same as a victim model's. In this work, we propose a novel DNN fingerprinting technique dubbed DEEPTASTER to prevent a new attack scenario in which a victim's data is stolen to build a suspect model. DEEPTASTER can effectively detect such data theft attacks even when a suspect model's architecture differs from a victim model's. To achieve this goal, DEEPTASTER generates a few adversarial images with perturbations, transforms them into the Fourier frequency domain, and uses the transformed images to identify the dataset used in a suspect model. The intuition is that those adversarial images can be used to capture the characteristics of DNNs built on a specific dataset. We evaluated the detection accuracy of DEEPTASTER on three datasets with three model architectures under various attack scenarios, including transfer learning, pruning, fine-tuning, and data augmentation. Overall, DEEPTASTER achieves a balanced accuracy of 94.95%, which is significantly better than 61.11% achieved by DEEPJUDGE in the same settings.
    Prompt Conditioned VAE: Enhancing Generative Replay for Lifelong Learning in Task-Oriented Dialogue. (arXiv:2210.07783v2 [cs.CL] UPDATED)
    Lifelong learning (LL) is vital for advanced task-oriented dialogue (ToD) systems. To address the catastrophic forgetting issue of LL, generative replay methods are widely employed to consolidate past knowledge with generated pseudo samples. However, most existing generative replay methods use only a single task-specific token to control their models. This scheme is usually not strong enough to constrain the generative model due to insufficient information involved. In this paper, we propose a novel method, prompt conditioned VAE for lifelong learning (PCLL), to enhance generative replay by incorporating tasks' statistics. PCLL captures task-specific distributions with a conditional variational autoencoder, conditioned on natural language prompts to guide the pseudo-sample generation. Moreover, it leverages a distillation process to further consolidate past knowledge by alleviating the noise in pseudo samples. Experiments on natural language understanding tasks of ToD systems demonstrate that PCLL significantly outperforms competitive baselines in building LL models.
    Learning Temporally Extended Skills in Continuous Domains as Symbolic Actions for Planning. (arXiv:2207.05018v2 [cs.LG] UPDATED)
    Problems which require both long-horizon planning and continuous control capabilities pose significant challenges to existing reinforcement learning agents. In this paper we introduce a novel hierarchical reinforcement learning agent which links temporally extended skills for continuous control with a forward model in a symbolic discrete abstraction of the environment's state for planning. We term our agent SEADS for Symbolic Effect-Aware Diverse Skills. We formulate an objective and corresponding algorithm which leads to unsupervised learning of a diverse set of skills through intrinsic motivation given a known state abstraction. The skills are jointly learned with the symbolic forward model which captures the effect of skill execution in the state abstraction. After training, we can leverage the skills as symbolic actions using the forward model for long-horizon planning and subsequently execute the plan using the learned continuous-action control skills. The proposed algorithm learns skills and forward models that can be used to solve complex tasks which require both continuous control and long-horizon planning capabilities with high success rate. It compares favorably with other flat and hierarchical reinforcement learning baseline agents and is successfully demonstrated with a real robot.
    Towards Good Practices for Missing Modality Robust Action Recognition. (arXiv:2211.13916v1 [cs.CV])
    Standard multi-modal models assume the use of the same modalities in training and inference stages. However, in practice, the environment in which multi-modal models operate may not satisfy such assumption. As such, their performances degrade drastically if any modality is missing in the inference stage. We ask: how can we train a model that is robust to missing modalities? This paper seeks a set of good practices for multi-modal action recognition, with a particular interest in circumstances where some modalities are not available at an inference time. First, we study how to effectively regularize the model during training (e.g., data augmentation). Second, we investigate on fusion methods for robustness to missing modalities: we find that transformer-based fusion shows better robustness for missing modality than summation or concatenation. Third, we propose a simple modular network, ActionMAE, which learns missing modality predictive coding by randomly dropping modality features and tries to reconstruct them with the remaining modality features. Coupling these good practices, we build a model that is not only effective in multi-modal action recognition but also robust to modality missing. Our model achieves the state-of-the-arts on multiple benchmarks and maintains competitive performances even in missing modality scenarios. Codes are available at https://github.com/sangminwoo/ActionMAE.
    Is Reinforcement Learning (Not) for Natural Language Processing?: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization. (arXiv:2210.01241v2 [cs.CL] UPDATED)
    We tackle the problem of aligning pre-trained large language models (LMs) with human preferences. If we view text generation as a sequential decision-making problem, reinforcement learning (RL) appears to be a natural conceptual framework. However, using RL for LM-based generation faces empirical challenges, including training instability due to the combinatorial action space, as well as a lack of open-source libraries and benchmarks customized for LM alignment. Thus, a question rises in the research community: is RL a practical paradigm for NLP? To help answer this, we first introduce an open-source modular library, RL4LMs (Reinforcement Learning for Language Models), for optimizing language generators with RL. The library consists of on-policy RL algorithms that can be used to train any encoder or encoder-decoder LM in the HuggingFace library (Wolf et al. 2020) with an arbitrary reward function. Next, we present the GRUE (General Reinforced-language Understanding Evaluation) benchmark, a set of 6 language generation tasks which are supervised not by target strings, but by reward functions which capture automated measures of human preference.GRUE is the first leaderboard-style evaluation of RL algorithms for NLP tasks. Finally, we introduce an easy-to-use, performant RL algorithm, NLPO (Natural Language Policy Optimization)} that learns to effectively reduce the combinatorial action space in language generation. We show 1) that RL techniques are generally better than supervised methods at aligning LMs to human preferences; and 2) that NLPO exhibits greater stability and performance than previous policy gradient methods (e.g., PPO (Schulman et al. 2017)), based on both automatic and human evaluation.
    AVCAffe: A Large Scale Audio-Visual Dataset of Cognitive Load and Affect for Remote Work. (arXiv:2205.06887v2 [cs.HC] UPDATED)
    We introduce AVCAffe, the first Audio-Visual dataset consisting of Cognitive load and Affect attributes. We record AVCAffe by simulating remote work scenarios over a video-conferencing platform, where subjects collaborate to complete a number of cognitively engaging tasks. AVCAffe is the largest originally collected (not collected from the Internet) affective dataset in English language. We recruit 106 participants from 18 different countries of origin, spanning an age range of 18 to 57 years old, with a balanced male-female ratio. AVCAffe comprises a total of 108 hours of video, equivalent to more than 58,000 clips along with task-based self-reported ground truth labels for arousal, valence, and cognitive load attributes such as mental demand, temporal demand, effort, and a few others. We believe AVCAffe would be a challenging benchmark for the deep learning research community given the inherent difficulty of classifying affect and cognitive load in particular. Moreover, our dataset fills an existing timely gap by facilitating the creation of learning systems for better self-management of remote work meetings, and further study of hypotheses regarding the impact of remote work on cognitive load and affective states.
    Deep Learning on a Healthy Data Diet: Finding Important Examples for Fairness. (arXiv:2211.11109v2 [cs.CL] UPDATED)
    Data-driven predictive solutions predominant in commercial applications tend to suffer from biases and stereotypes, which raises equity concerns. Prediction models may discover, use, or amplify spurious correlations based on gender or other protected personal characteristics, thus discriminating against marginalized groups. Mitigating gender bias has become an important research focus in natural language processing (NLP) and is an area where annotated corpora are available. Data augmentation reduces gender bias by adding counterfactual examples to the training dataset. In this work, we show that some of the examples in the augmented dataset can be not important or even harmful for fairness. We hence propose a general method for pruning both the factual and counterfactual examples to maximize the model's fairness as measured by the demographic parity, equality of opportunity, and equality of odds. The fairness achieved by our method surpasses that of data augmentation on three text classification datasets, using no more than half of the examples in the augmented dataset. Our experiments are conducted using models of varying sizes and pre-training settings.
    Asymptotic Properties for Bayesian Neural Network in Besov Space. (arXiv:2206.00241v3 [stat.ML] UPDATED)
    Neural networks have shown great predictive power when dealing with various unstructured data such as images and natural languages. The Bayesian neural network captures the uncertainty of prediction by putting a prior distribution for the parameter of the model and computing the posterior distribution. In this paper, we show that the Bayesian neural network using spike-and-slab prior has consistency with nearly minimax convergence rate when the true regression function is in the Besov space. Even when the smoothness of the regression function is unknown the same posterior convergence rate holds and thus the spike-and-slab prior is adaptive to the smoothness of the regression function. We also consider the shrinkage prior, which is more feasible than other priors, and show that it has the same convergence rate. In other words, we propose a practical Bayesian neural network with guaranteed asymptotic properties.
    Knowledge-Aware Federated Active Learning with Non-IID Data. (arXiv:2211.13579v1 [cs.LG])
    Federated learning enables multiple decentralized clients to learn collaboratively without sharing the local training data. However, the expensive annotation cost to acquire data labels on local clients remains an obstacle in utilizing local data. In this paper, we propose a federated active learning paradigm to efficiently learn a global model with limited annotation budget while protecting data privacy in a decentralized learning way. The main challenge faced by federated active learning is the mismatch between the active sampling goal of the global model on the server and that of the asynchronous local clients. This becomes even more significant when data is distributed non-IID across local clients. To address the aforementioned challenge, we propose Knowledge-Aware Federated Active Learning (KAFAL), which consists of Knowledge-Specialized Active Sampling (KSAS) and Knowledge-Compensatory Federated Update (KCFU). KSAS is a novel active sampling method tailored for the federated active learning problem. It deals with the mismatch challenge by sampling actively based on the discrepancies between local and global models. KSAS intensifies specialized knowledge in local clients, ensuring the sampled data to be informative for both the local clients and the global model. KCFU, in the meantime, deals with the client heterogeneity caused by limited data and non-IID data distributions. It compensates for each client's ability in weak classes by the assistance of the global model. Extensive experiments and analyses are conducted to show the superiority of KSAS over the state-of-the-art active learning methods and the efficiency of KCFU under the federated active learning framework.
    COPER: Continuous Patient State Perceiver. (arXiv:2208.03196v2 [cs.LG] UPDATED)
    In electronic health records (EHRs), irregular time-series (ITS) occur naturally due to patient health dynamics, reflected by irregular hospital visits, diseases/conditions and the necessity to measure different vitals signs at each visit etc. ITS present challenges in training machine learning algorithms which mostly are built on assumption of coherent fixed dimensional feature space. In this paper, we propose a novel COntinuous patient state PERceiver model, called COPER, to cope with ITS in EHRs. COPER uses Perceiver model and the concept of neural ordinary differential equations (ODEs) to learn the continuous time dynamics of patient state, i.e., continuity of input space and continuity of output space. The neural ODEs help COPER to generate regular time-series to feed to Perceiver model which has the capability to handle multi-modality large-scale inputs. To evaluate the performance of the proposed model, we use in-hospital mortality prediction task on MIMIC-III dataset and carefully design experiments to study irregularity. The results are compared with the baselines which prove the efficacy of the proposed model.
    Stress-Testing Point Cloud Registration on Automotive LiDAR. (arXiv:2204.07719v2 [cs.CV] UPDATED)
    Rigid Point Cloud Registration (PCR) algorithms aim to estimate the 6-DOF relative motion between two point clouds, which is important in various fields, including autonomous driving. Recent years have seen a significant improvement in global PCR algorithms, i.e. algorithms that can handle a large relative motion. This has been demonstrated in various scenarios, including indoor scenes, but has only been minimally tested in the Automotive setting, where point clouds are produced by vehicle-mounted LiDAR sensors. In this work, we aim to answer questions that are important for automotive applications, including: which of the new algorithms is the most accurate, and which is fastest? How transferable are deep-learning approaches, e.g. what happens when you train a network with data from Boston, and run it in a vehicle in Singapore? How small can the overlap between point clouds be before the algorithms start to deteriorate? To what extent are the algorithms rotation invariant? Our results are at times surprising. When comparing robust parameter estimation methods for registration, we find that the fastest and most accurate is not one of the newest approaches. Instead, it is a modern variant of the well known RANSAC technique. We also suggest a new outlier filtering method, Grid-Prioritized Filtering (GPF), to further improve it. An additional contribution of this work is an algorithm for selecting challenging sets of frame-pairs from automotive LiDAR datasets. This enables meaningful benchmarking in the Automotive LiDAR setting, and can also improve training for learning algorithms.
    PyTAIL: Interactive and Incremental Learning of NLP Models with Human in the Loop for Online Data. (arXiv:2211.13786v1 [cs.CL])
    Online data streams make training machine learning models hard because of distribution shift and new patterns emerging over time. For natural language processing (NLP) tasks that utilize a collection of features based on lexicons and rules, it is important to adapt these features to the changing data. To address this challenge we introduce PyTAIL, a python library, which allows a human in the loop approach to actively train NLP models. PyTAIL enhances generic active learning, which only suggests new instances to label by also suggesting new features like rules and lexicons to label. Furthermore, PyTAIL is flexible enough for users to accept, reject, or update rules and lexicons as the model is being trained. Finally, we simulate the performance of PyTAIL on existing social media benchmark datasets for text classification. We compare various active learning strategies on these benchmarks. The model closes the gap with as few as 10% of the training data. Finally, we also highlight the importance of tracking evaluation metric on remaining data (which is not yet merged with active learning) alongside the test dataset. This highlights the effectiveness of the model in accurately annotating the remaining dataset, which is especially suitable for batch processing of large unlabelled corpora. PyTAIL will be available at https://github.com/socialmediaie/pytail.
    Delving into Out-of-Distribution Detection with Vision-Language Representations. (arXiv:2211.13445v1 [cs.CV])
    Recognizing out-of-distribution (OOD) samples is critical for machine learning systems deployed in the open world. The vast majority of OOD detection methods are driven by a single modality (e.g., either vision or language), leaving the rich information in multi-modal representations untapped. Inspired by the recent success of vision-language pre-training, this paper enriches the landscape of OOD detection from a single-modal to a multi-modal regime. Particularly, we propose Maximum Concept Matching (MCM), a simple yet effective zero-shot OOD detection method based on aligning visual features with textual concepts. We contribute in-depth analysis and theoretical insights to understand the effectiveness of MCM. Extensive experiments demonstrate that MCM achieves superior performance on a wide variety of real-world tasks. MCM with vision-language features outperforms a common baseline with pure visual features on a hard OOD task with semantically similar classes by 13.1% (AUROC). Code is available at https://github.com/deeplearning-wisc/MCM.
    Balanced Product of Calibrated Experts for Long-Tailed Recognition. (arXiv:2206.05260v2 [cs.CV] UPDATED)
    Many real-world recognition problems are characterized by long-tailed label distributions. These distributions make representation learning highly challenging due to limited generalization over the tail classes. If the test distribution differs from the training distribution, e.g. uniform versus long-tailed, the problem of the distribution-shift needs to be addressed. A recent line of work proposes learning multiple diverse experts to tackle this issue. Ensemble diversity is encouraged by various techniques, e.g. by specializing different experts on the head and the tail classes. In this work, we take an analytical approach, and extend the notion of logit adjustment to ensembles to form a Balanced Product of Experts (BalPoE). BalPoE generalizes several previous approaches, and combines a family of experts with different test-time target distributions. We show how to properly define these distributions and combine the experts in order to achieve unbiased predictions, by proving that the ensemble is Fisher-consistent for minimizing the balanced error. Our theoretical analysis shows that our balanced ensemble requires calibrated experts, which we achieve in practice using mixup. We conduct extensive experiments and our method obtains new state-of-the-art results on three long-tailed datasets: CIFAR-100-LT, ImageNet-LT and iNaturalist-2018. Our code will be released upon paper acceptance.
    Interpretable by Design: Learning Predictors by Composing Interpretable Queries. (arXiv:2207.00938v2 [cs.CV] UPDATED)
    There is a growing concern about typically opaque decision-making with high-performance machine learning algorithms. Providing an explanation of the reasoning process in domain-specific terms can be crucial for adoption in risk-sensitive domains such as healthcare. We argue that machine learning algorithms should be interpretable by design and that the language in which these interpretations are expressed should be domain- and task-dependent. Consequently, we base our model's prediction on a family of user-defined and task-specific binary functions of the data, each having a clear interpretation to the end-user. We then minimize the expected number of queries needed for accurate prediction on any given input. As the solution is generally intractable, following prior work, we choose the queries sequentially based on information gain. However, in contrast to previous work, we need not assume the queries are conditionally independent. Instead, we leverage a stochastic generative model (VAE) and an MCMC algorithm (Unadjusted Langevin) to select the most informative query about the input based on previous query-answers. This enables the online determination of a query chain of whatever depth is required to resolve prediction ambiguities. Finally, experiments on vision and NLP tasks demonstrate the efficacy of our approach and its superiority over post-hoc explanations.
    Regret Bounds for Information-Directed Reinforcement Learning. (arXiv:2206.04640v2 [cs.LG] UPDATED)
    Information-directed sampling (IDS) has revealed its potential as a data-efficient algorithm for reinforcement learning (RL). However, theoretical understanding of IDS for Markov Decision Processes (MDPs) is still limited. We develop novel information-theoretic tools to bound the information ratio and cumulative information gain about the learning target. Our theoretical results shed light on the importance of choosing the learning target such that the practitioners can balance the computation and regret bounds. As a consequence, we derive prior-free Bayesian regret bounds for vanilla-IDS which learns the whole environment under tabular finite-horizon MDPs. In addition, we propose a computationally-efficient regularized-IDS that maximizes an additive form rather than the ratio form and show that it enjoys the same regret bound as vanilla-IDS. With the aid of rate-distortion theory, we improve the regret bound by learning a surrogate, less informative environment. Furthermore, we extend our analysis to linear MDPs and prove similar regret bounds for Thompson sampling as a by-product.
    WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. (arXiv:2207.01206v3 [cs.CL] UPDATED)
    Existing benchmarks for grounding language in interactive environments either lack real-world linguistic elements, or prove difficult to scale up due to substantial human involvement in the collection of data or feedback signals. To bridge this gap, we develop WebShop -- a simulated e-commerce website environment with $1.18$ million real-world products and $12,087$ crowd-sourced text instructions. Given a text instruction specifying a product requirement, an agent needs to navigate multiple types of webpages and issue diverse actions to find, customize, and purchase an item. WebShop provides several challenges for language grounding including understanding compositional instructions, query (re-)formulation, comprehending and acting on noisy text in webpages, and performing strategic exploration. We collect over $1,600$ human demonstrations for the task, and train and evaluate a diverse range of agents using reinforcement learning, imitation learning, and pre-trained image and language models. Our best model achieves a task success rate of $29\%$, which outperforms rule-based heuristics ($9.6\%$) but is far lower than human expert performance ($59\%$). We also analyze agent and human trajectories and ablate various model components to provide insights for developing future agents with stronger language understanding and decision making abilities. Finally, we show that agents trained on WebShop exhibit non-trivial sim-to-real transfer when evaluated on amazon.com and ebay.com, indicating the potential value of WebShop in developing practical web-based agents that can operate in the wild.
    Identifying Incorrect Annotations in Multi-Label Classification Data. (arXiv:2211.13895v1 [cs.LG])
    In multi-label classification, each example in a dataset may be annotated as belonging to one or more classes (or none of the classes). Example applications include image (or document) tagging where each possible tag either applies to a particular image (or document) or not. With many possible classes to consider, data annotators are likely to make errors when labeling such data in practice. Here we consider algorithms for finding mislabeled examples in multi-label classification datasets. We propose an extension of the Confident Learning framework to this setting, as well as a label quality score that ranks examples with label errors much higher than those which are correctly labeled. Both approaches can utilize any trained classifier. After demonstrating that our methodology empirically outperforms other algorithms for label error detection, we apply our approach to discover many label errors in the CelebA image tagging dataset.
    CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks. (arXiv:2206.09059v2 [cs.CL] UPDATED)
    Current state-of-the-art vision-and-language models are evaluated on tasks either individually or in a multi-task setting, overlooking the challenges of continually learning (CL) tasks as they arrive. Existing CL benchmarks have facilitated research on task adaptation and mitigating "catastrophic forgetting", but are limited to vision-only and language-only tasks. We present CLiMB, a benchmark to study the challenge of learning multimodal tasks in a CL setting, and to systematically evaluate how upstream continual learning can rapidly generalize to new multimodal and unimodal tasks. CLiMB includes implementations of several CL algorithms and a modified Vision-Language Transformer (ViLT) model that can be deployed on both multimodal and unimodal tasks. We find that common CL methods can help mitigate forgetting during multimodal task learning, but do not enable cross-task knowledge transfer. We envision that CLiMB will facilitate research on a new class of CL algorithms for this challenging multimodal setting.
    Operator Splitting Value Iteration. (arXiv:2211.13937v1 [cs.LG])
    We introduce new planning and reinforcement learning algorithms for discounted MDPs that utilize an approximate model of the environment to accelerate the convergence of the value function. Inspired by the splitting approach in numerical linear algebra, we introduce Operator Splitting Value Iteration (OS-VI) for both Policy Evaluation and Control problems. OS-VI achieves a much faster convergence rate when the model is accurate enough. We also introduce a sample-based version of the algorithm called OS-Dyna. Unlike the traditional Dyna architecture, OS-Dyna still converges to the correct value function in presence of model approximation error.
    Enhanced compound-protein binding affinity prediction by representing protein multimodal information via a coevolutionary strategy. (arXiv:2204.12586v3 [q-bio.BM] UPDATED)
    Due to the lack of a method to efficiently represent the multimodal information of a protein, including its structure and sequence information, predicting compound-protein binding affinity (CPA) still suffers from low accuracy when applying machine learning methods. To overcome this limitation, in a novel end-to-end architecture (named FeatNN), we develop a coevolutionary strategy to jointly represent the structure and sequence features of proteins and ultimately optimize the mathematical models for predicting CPA. Furthermore, from the perspective of data-driven approach, we proposed a rational method that can utilize both high- and low-quality databases to optimize the accuracy and generalization ability of FeatNN in CPA prediction tasks. Notably, we visually interpret the feature interaction process between sequence and structure in the rationally designed architecture. As a result, FeatNN considerably outperforms the state-of-the-art (SOTA) baseline in virtual drug screening tasks, indicating the feasibility of this approach for practical use. FeatNN provides an outstanding method for higher CPA prediction accuracy and better generalization ability by efficiently representing multimodal information of proteins via a coevolutionary strategy.
    Towards Practical Control of Singular Values of Convolutional Layers. (arXiv:2211.13771v1 [cs.LG])
    In general, convolutional neural networks (CNNs) are easy to train, but their essential properties, such as generalization error and adversarial robustness, are hard to control. Recent research demonstrated that singular values of convolutional layers significantly affect such elusive properties and offered several methods for controlling them. Nevertheless, these methods present an intractable computational challenge or resort to coarse approximations. In this paper, we offer a principled approach to alleviating constraints of the prior art at the expense of an insignificant reduction in layer expressivity. Our method is based on the tensor-train decomposition; it retains control over the actual singular values of convolutional mappings while providing structurally sparse and hardware-friendly representation. We demonstrate the improved properties of modern CNNs with our method and analyze its impact on the model performance, calibration, and adversarial robustness. The source code is available at: https://github.com/WhiteTeaDragon/practical_svd_conv
    Offline Reinforcement Learning with Differentiable Function Approximation is Provably Efficient. (arXiv:2210.00750v2 [cs.LG] UPDATED)
    Offline reinforcement learning, which aims at optimizing sequential decision-making strategies with historical data, has been extensively applied in real-life applications. State-Of-The-Art algorithms usually leverage powerful function approximators (e.g. neural networks) to alleviate the sample complexity hurdle for better empirical performances. Despite the successes, a more systematic understanding of the statistical complexity for function approximation remains lacking. Towards bridging the gap, we take a step by considering offline reinforcement learning with differentiable function class approximation (DFA). This function class naturally incorporates a wide range of models with nonlinear/nonconvex structures. Most importantly, we show offline RL with differentiable function approximation is provably efficient by analyzing the pessimistic fitted Q-learning (PFQL) algorithm, and our results provide the theoretical basis for understanding a variety of practical heuristics that rely on Fitted Q-Iteration style design. In addition, we further improve our guarantee with a tighter instance-dependent characterization. We hope our work could draw interest in studying reinforcement learning with differentiable function approximation beyond the scope of current research.
    Revisiting Active Sets for Gaussian Process Decoders. (arXiv:2209.04636v2 [stat.ML] UPDATED)
    Decoders built on Gaussian processes (GPs) are enticing due to the marginalisation over the non-linear function space. Such models (also known as GP-LVMs) are often expensive and notoriously difficult to train in practice, but can be scaled using variational inference and inducing points. In this paper, we revisit active set approximations. We develop a new stochastic estimate of the log-marginal likelihood based on recently discovered links to cross-validation, and propose a computationally efficient approximation thereof. We demonstrate that the resulting stochastic active sets (SAS) approximation significantly improves the robustness of GP decoder training while reducing computational cost. The SAS-GP obtains more structure in the latent space, scales to many datapoints and learns better representations than variational autoencoders, which is rarely the case for GP decoders.
    Dikaios: Privacy Auditing of Algorithmic Fairness via Attribute Inference Attacks. (arXiv:2202.02242v2 [cs.CR] UPDATED)
    Machine learning (ML) models have been deployed for high-stakes applications. Due to class imbalance in the sensitive attribute observed in the datasets, ML models are unfair on minority subgroups identified by a sensitive attribute, such as race and sex. In-processing fairness algorithms ensure model predictions are independent of sensitive attribute. Furthermore, ML models are vulnerable to attribute inference attacks where an adversary can identify the values of sensitive attribute by exploiting their distinguishable model predictions. Despite privacy and fairness being important pillars of trustworthy ML, the privacy risk introduced by fairness algorithms with respect to attribute leakage has not been studied. We identify attribute inference attacks as an effective measure for auditing blackbox fairness algorithms to enable model builder to account for privacy and fairness in the model design. We proposed Dikaios, a privacy auditing tool for fairness algorithms for model builders which leveraged a new effective attribute inference attack that account for the class imbalance in sensitive attributes through an adaptive prediction threshold. We evaluated Dikaios to perform a privacy audit of two in-processing fairness algorithms over five datasets. We show that our attribute inference attacks with adaptive prediction threshold significantly outperform prior attacks. We highlighted the limitations of in-processing fairness algorithms to ensure indistinguishable predictions across different values of sensitive attributes. Indeed, the attribute privacy risk of these in-processing fairness schemes is highly variable according to the proportion of the sensitive attributes in the dataset. This unpredictable effect of fairness mechanisms on the attribute privacy risk is an important limitation on their utilization which has to be accounted by the model builder.
    Nonlinear MCMC for Bayesian Machine Learning. (arXiv:2202.05621v2 [stat.ML] UPDATED)
    We explore the application of a nonlinear MCMC technique first introduced in [1] to problems in Bayesian machine learning. We provide a convergence guarantee in total variation that uses novel results for long-time convergence and large-particle ("propagation of chaos") convergence. We apply this nonlinear MCMC technique to sampling problems including a Bayesian neural network on CIFAR10.
    SCouT: Synthetic Counterfactuals via Spatiotemporal Transformers for Actionable Healthcare. (arXiv:2207.04208v2 [cs.AI] UPDATED)
    The Synthetic Control method has pioneered a class of powerful data-driven techniques to estimate the counterfactual reality of a unit from donor units. At its core, the technique involves a linear model fitted on the pre-intervention period that combines donor outcomes to yield the counterfactual. However, linearly combining spatial information at each time instance using time-agnostic weights fails to capture important inter-unit and intra-unit temporal contexts and complex nonlinear dynamics of real data. We instead propose an approach to use local spatiotemporal information before the onset of the intervention as a promising way to estimate the counterfactual sequence. To this end, we suggest a Transformer model that leverages particular positional embeddings, a modified decoder attention mask, and a novel pre-training task to perform spatiotemporal sequence-to-sequence modeling. Our experiments on synthetic data demonstrate the efficacy of our method in the typical small donor pool setting and its robustness against noise. We also generate actionable healthcare insights at the population and patient levels by simulating a state-wide public health policy to evaluate its effectiveness, an in silico trial for asthma medications to support randomized controlled trials, and a medical intervention for patients with Friedreich's ataxia to improve clinical decision-making and promote personalized therapy.
    How important are activation functions in regression and classification? A survey, performance comparison, and future directions. (arXiv:2209.02681v4 [cs.LG] UPDATED)
    Inspired by biological neurons, the activation functions play an essential part in the learning process of any artificial neural network commonly used in many real-world problems. Various activation functions have been proposed in the literature for classification as well as regression tasks. In this work, we survey the activation functions that have been employed in the past as well as the current state-of-the-art. In particular, we present various developments in activation functions over the years and the advantages as well as disadvantages or limitations of these activation functions. We also discuss classical (fixed) activation functions, including rectifier units, and adaptive activation functions. In addition to presenting the taxonomy of activation functions based on characterization, a taxonomy of activation functions based on applications is also presented. To this end, the systematic comparison of various fixed and adaptive activation functions is performed for classification data sets such as the MNIST, CIFAR-10, and CIFAR-100. In recent years, a physics-informed machine learning framework has emerged for solving problems related to scientific computations. To this purpose, we also discuss various requirements for activation functions that have been used in the physics-informed machine learning framework. Furthermore, various comparisons are made among different fixed and adaptive activation functions using various machine learning libraries such as TensorFlow, Pytorch, and JAX.
    Trust Your $\nabla$: Gradient-based Intervention Targeting for Causal Discovery. (arXiv:2211.13715v1 [stat.ML])
    Inferring causal structure from data is a challenging task of fundamental importance in science. Observational data are often insufficient to identify a system's causal structure uniquely. While conducting interventions (i.e., experiments) can improve the identifiability, such samples are usually challenging and expensive to obtain. Hence, experimental design approaches for causal discovery aim to minimize the number of interventions by estimating the most informative intervention target. In this work, we propose a novel Gradient-based Intervention Targeting method, abbreviated GIT, that 'trusts' the gradient estimator of a gradient-based causal discovery framework to provide signals for the intervention acquisition function. We provide extensive experiments in simulated and real-world datasets and demonstrate that GIT performs on par with competitive baselines, surpassing them in the low-data regime.
    Spherical Message Passing for 3D Graph Networks. (arXiv:2102.05013v5 [cs.LG] UPDATED)
    We consider representation learning of 3D molecular graphs in which each atom is associated with a spatial position in 3D. This is an under-explored area of research, and a principled message passing framework is currently lacking. In this work, we conduct analyses in the spherical coordinate system (SCS) for the complete identification of 3D graph structures. Based on such observations, we propose the spherical message passing (SMP) as a novel and powerful scheme for 3D molecular learning. SMP dramatically reduces training complexity, enabling it to perform efficiently on large-scale molecules. In addition, SMP is capable of distinguishing almost all molecular structures, and the uncovered cases may not exist in practice. Based on meaningful physically-based representations of 3D information, we further propose the SphereNet for 3D molecular learning. Experimental results demonstrate that the use of meaningful 3D information in SphereNet leads to significant performance improvements in prediction tasks. Our results also demonstrate the advantages of SphereNet in terms of capability, efficiency, and scalability. Our code is publicly available as part of the DIG library (https://github.com/divelab/DIG).
    Prototypical Fine-tuning: Towards Robust Performance Under Varying Data Sizes. (arXiv:2211.13638v1 [cs.CL])
    In this paper, we move towards combining large parametric models with non-parametric prototypical networks. We propose prototypical fine-tuning, a novel prototypical framework for fine-tuning pretrained language models (LM), which automatically learns a bias to improve predictive performance for varying data sizes, especially low-resource settings. Our prototypical fine-tuning approach can automatically adjust the model capacity according to the number of data points and the model's inherent attributes. Moreover, we propose four principles for effective prototype fine-tuning towards the optimal solution. Experimental results across various datasets show that our work achieves significant performance improvements under various low-resource settings, as well as comparable and usually better performances in high-resource scenarios.
    DKM: Dense Kernelized Feature Matching for Geometry Estimation. (arXiv:2202.00667v3 [cs.CV] UPDATED)
    Feature matching is a challenging computer vision task that involves finding correspondences between two images of a 3D scene. In this paper we consider the dense approach instead of the more common sparse paradigm, thus striving to find all correspondences. Perhaps counter-intuitively, dense methods have previously shown inferior performance to their sparse and semi-sparse counterparts for estimation of two-view geometry. This changes with our novel dense method, which outperforms both dense and sparse methods on geometry estimation. The novelty is threefold: First, we propose a kernel regression global matcher. Secondly, we propose warp refinement through stacked feature maps and depthwise convolution kernels. Thirdly, we propose learning dense confidence through consistent depth and a balanced sampling approach for dense confidence maps. Through extensive experiments we confirm that our proposed dense method, \textbf{D}ense \textbf{K}ernelized Feature \textbf{M}atching, sets a new state-of-the-art on multiple geometry estimation benchmarks. In particular, we achieve an improvement on MegaDepth-1500 of +4.9 and +8.9 AUC$@5^{\circ}$ compared to the best previous sparse method and dense method respectively. Our code is provided at https://github.com/Parskatt/dkm
    Learning Symmetric Rules with SATNet. (arXiv:2206.13998v2 [cs.AI] UPDATED)
    SATNet is a differentiable constraint solver with a custom backpropagation algorithm, which can be used as a layer in a deep-learning system. It is a promising proposal for bridging deep learning and logical reasoning. In fact, SATNet has been successfully applied to learn, among others, the rules of a complex logical puzzle, such as Sudoku, just from input and output pairs where inputs are given as images. In this paper, we show how to improve the learning of SATNet by exploiting symmetries in the target rules of a given but unknown logical puzzle or more generally a logical formula. We present SymSATNet, a variant of SATNet that translates the given symmetries of the target rules to a condition on the parameters of SATNet and requires that the parameters should have a particular parametric form that guarantees the condition. The requirement dramatically reduces the number of parameters to learn for the rules with enough symmetries, and makes the parameter learning of SymSATNet much easier than that of SATNet. We also describe a technique for automatically discovering symmetries of the target rules from examples. Our experiments with Sudoku and Rubik's cube show the substantial improvement of SymSATNet over the baseline SATNet.
    The Design Space of E(3)-Equivariant Atom-Centered Interatomic Potentials. (arXiv:2205.06643v2 [stat.ML] UPDATED)
    The rapid progress of machine learning interatomic potentials over the past couple of years produced a number of new architectures. Particularly notable among these are the Atomic Cluster Expansion (ACE), which unified many of the earlier ideas around atom density-based descriptors, and Neural Equivariant Interatomic Potentials (NequIP), a message passing neural network with equivariant features that showed state of the art accuracy. In this work, we construct a mathematical framework that unifies these models: ACE is generalised so that it can be recast as one layer of a multi-layer architecture. From another point of view, the linearised version of NequIP is understood as a particular sparsification of a much larger polynomial model. Our framework also provides a practical tool for systematically probing different choices in the unified design space. We demonstrate this by an ablation study of NequIP via a set of experiments looking at in- and out-of-domain accuracy and smooth extrapolation very far from the training data, and shed some light on which design choices are critical for achieving high accuracy. Finally, we present BOTNet (Body-Ordered-Tensor-Network), a much-simplified version of NequIP, which has an interpretable architecture and maintains accuracy on benchmark datasets.
    Efficient Zero-shot Visual Search via Target and Context-aware Transformer. (arXiv:2211.13470v1 [cs.CV])
    Visual search is a ubiquitous challenge in natural vision, including daily tasks such as finding a friend in a crowd or searching for a car in a parking lot. Human rely heavily on relevant target features to perform goal-directed visual search. Meanwhile, context is of critical importance for locating a target object in complex scenes as it helps narrow down the search area and makes the search process more efficient. However, few works have combined both target and context information in visual search computational models. Here we propose a zero-shot deep learning architecture, TCT (Target and Context-aware Transformer), that modulates self attention in the Vision Transformer with target and contextual relevant information to enable human-like zero-shot visual search performance. Target modulation is computed as patch-wise local relevance between the target and search images, whereas contextual modulation is applied in a global fashion. We conduct visual search experiments on TCT and other competitive visual search models on three natural scene datasets with varying levels of difficulty. TCT demonstrates human-like performance in terms of search efficiency and beats the SOTA models in challenging visual search tasks. Importantly, TCT generalizes well across datasets with novel objects without retraining or fine-tuning. Furthermore, we also introduce a new dataset to benchmark models for invariant visual search under incongruent contexts. TCT manages to search flexibly via target and context modulation, even under incongruent contexts.
    Improving dermatology classifiers across populations using images generated by large diffusion models. (arXiv:2211.13352v1 [eess.IV])
    Dermatological classification algorithms developed without sufficiently diverse training data may generalize poorly across populations. While intentional data collection and annotation offer the best means for improving representation, new computational approaches for generating training data may also aid in mitigating the effects of sampling bias. In this paper, we show that DALL$\cdot$E 2, a large-scale text-to-image diffusion model, can produce photorealistic images of skin disease across skin types. Using the Fitzpatrick 17k dataset as a benchmark, we demonstrate that augmenting training data with DALL$\cdot$E 2-generated synthetic images improves classification of skin disease overall and especially for underrepresented groups.
    Turning the Tables: Biased, Imbalanced, Dynamic Tabular Datasets for ML Evaluation. (arXiv:2211.13358v1 [cs.LG])
    Evaluating new techniques on realistic datasets plays a crucial role in the development of ML research and its broader adoption by practitioners. In recent years, there has been a significant increase of publicly available unstructured data resources for computer vision and NLP tasks. However, tabular data -- which is prevalent in many high-stakes domains -- has been lagging behind. To bridge this gap, we present Bank Account Fraud (BAF), the first publicly available 1 privacy-preserving, large-scale, realistic suite of tabular datasets. The suite was generated by applying state-of-the-art tabular data generation techniques on an anonymized,real-world bank account opening fraud detection dataset. This setting carries a set of challenges that are commonplace in real-world applications, including temporal dynamics and significant class imbalance. Additionally, to allow practitioners to stress test both performance and fairness of ML methods, each dataset variant of BAF contains specific types of data bias. With this resource, we aim to provide the research community with a more realistic, complete, and robust test bed to evaluate novel and existing methods.
    Data Provenance Inference in Machine Learning. (arXiv:2211.13416v1 [cs.LG])
    Unintended memorization of various information granularity has garnered academic attention in recent years, e.g. membership inference and property inference. How to inversely use this privacy leakage to facilitate real-world applications is a growing direction; the current efforts include dataset ownership inference and user auditing. Standing on the data lifecycle and ML model production, we propose an inference process named Data Provenance Inference, which is to infer the generation, collection or processing property of the ML training data, to assist ML developers in locating the training data gaps without maintaining strenuous metadata. We formularly define the data provenance and the data provenance inference task in ML training. Then we propose a novel inference strategy combining embedded-space multiple instance classification and shadow learning. Comprehensive evaluations cover language, visual and structured data in black-box and white-box settings, with diverse kinds of data provenance (i.e. business, county, movie, user). Our best inference accuracy achieves 98.96% in the white-box text model when "author" is the data provenance. The experimental results indicate that, in general, the inference performance positively correlated with the amount of reference data for inference, the depth and also the amount of the parameter of the accessed layer. Furthermore, we give a post-hoc statistical analysis of the data provenance definition to explain when our proposed method works well.
    MP-GELU Bayesian Neural Networks: Moment Propagation by GELU Nonlinearity. (arXiv:2211.13402v1 [cs.LG])
    Bayesian neural networks (BNNs) have been an important framework in the study of uncertainty quantification. Deterministic variational inference, one of the inference methods, utilizes moment propagation to compute the predictive distributions and objective functions. Unfortunately, deriving the moments requires computationally expensive Taylor expansion in nonlinear functions, such as a rectified linear unit (ReLU) or a sigmoid function. Therefore, a new nonlinear function that realizes faster moment propagation than conventional functions is required. In this paper, we propose a novel nonlinear function named moment propagating-Gaussian error linear unit (MP-GELU) that enables the fast derivation of first and second moments in BNNs. MP-GELU enables the analytical computation of moments by applying nonlinearity to the input statistics, thereby reducing the computationally expensive calculations required for nonlinear functions. In empirical experiments on regression tasks, we observed that the proposed MP-GELU provides higher prediction accuracy and better quality of uncertainty with faster execution than those of ReLU-based BNNs.
    Differentially Private Image Classification from Features. (arXiv:2211.13403v1 [cs.LG])
    Leveraging transfer learning has recently been shown to be an effective strategy for training large models with Differential Privacy (DP). Moreover, somewhat surprisingly, recent works have found that privately training just the last layer of a pre-trained model provides the best utility with DP. While past studies largely rely on algorithms like DP-SGD for training large models, in the specific case of privately learning from features, we observe that computational burden is low enough to allow for more sophisticated optimization schemes, including second-order methods. To that end, we systematically explore the effect of design parameters such as loss function and optimization algorithm. We find that, while commonly used logistic regression performs better than linear regression in the non-private setting, the situation is reversed in the private setting. We find that linear regression is much more effective than logistic regression from both privacy and computational aspects, especially at stricter epsilon values ($\epsilon < 1$). On the optimization side, we also explore using Newton's method, and find that second-order information is quite helpful even with privacy, although the benefit significantly diminishes with stricter privacy guarantees. While both methods use second-order information, least squares is effective at lower epsilons while Newton's method is effective at larger epsilon values. To combine the benefits of both, we propose a novel algorithm called DP-FC, which leverages feature covariance instead of the Hessian of the logistic regression loss and performs well across all $\epsilon$ values we tried. With this, we obtain new SOTA results on ImageNet-1k, CIFAR-100 and CIFAR-10 across all values of $\epsilon$ typically considered. Most remarkably, on ImageNet-1K, we obtain top-1 accuracy of 88\% under (8, $8 * 10^{-7}$)-DP and 84.3\% under (0.1, $8 * 10^{-7}$)-DP.
    DALL-E-Bot: Introducing Web-Scale Diffusion Models to Robotics. (arXiv:2210.02438v2 [cs.RO] UPDATED)
    We introduce the first work to explore web-scale diffusion models for robotics. DALL-E-Bot enables a robot to rearrange objects in a scene, by first inferring a text description of those objects, then generating an image representing a natural, human-like arrangement of those objects, and finally physically arranging the objects according to that image. The significance is that we achieve this zero-shot using DALL-E, without needing any further data collection or training. Encouraging real-world results with human studies show that this is a promising direction for the future of web-scale robot learning. We also propose a list of recommendations to the text-to-image community, to align further developments of these models with applications to robotics.
    Lifting Weak Supervision To Structured Prediction. (arXiv:2211.13375v1 [cs.LG])
    Weak supervision (WS) is a rich set of techniques that produce pseudolabels by aggregating easily obtained but potentially noisy label estimates from a variety of sources. WS is theoretically well understood for binary classification, where simple approaches enable consistent estimation of pseudolabel noise rates. Using this result, it has been shown that downstream models trained on the pseudolabels have generalization guarantees nearly identical to those trained on clean labels. While this is exciting, users often wish to use WS for structured prediction, where the output space consists of more than a binary or multi-class label set: e.g. rankings, graphs, manifolds, and more. Do the favorable theoretical properties of WS for binary classification lift to this setting? We answer this question in the affirmative for a wide range of scenarios. For labels taking values in a finite metric space, we introduce techniques new to weak supervision based on pseudo-Euclidean embeddings and tensor decompositions, providing a nearly-consistent noise rate estimator. For labels in constant-curvature Riemannian manifolds, we introduce new invariants that also yield consistent noise rate estimation. In both cases, when using the resulting pseudolabels in concert with a flexible downstream model, we obtain generalization guarantees nearly identical to those for models trained on clean data. Several of our results, which can be viewed as robustness guarantees in structured prediction with noisy labels, may be of independent interest. Empirical evaluation validates our claims and shows the merits of the proposed method.
    Graph Contrastive Learning for Materials. (arXiv:2211.13408v1 [cs.LG])
    Recent work has shown the potential of graph neural networks to efficiently predict material properties, enabling high-throughput screening of materials. Training these models, however, often requires large quantities of labelled data, obtained via costly methods such as ab initio calculations or experimental evaluation. By leveraging a series of material-specific transformations, we introduce CrystalCLR, a framework for constrastive learning of representations with crystal graph neural networks. With the addition of a novel loss function, our framework is able to learn representations competitive with engineered fingerprinting methods. We also demonstrate that via model finetuning, contrastive pretraining can improve the performance of graph neural networks for prediction of material properties and significantly outperform traditional ML models that use engineered fingerprints. Lastly, we observe that CrystalCLR produces material representations that form clusters by compound class.
    Reliability and Robustness analysis of Machine Learning based Phishing URL Detectors. (arXiv:2005.08454v3 [cs.CR] UPDATED)
    ML-based Phishing URL (MLPU) detectors serve as the first level of defence to protect users and organisations from being victims of phishing attacks. Lately, few studies have launched successful adversarial attacks against specific MLPU detectors raising questions about their practical reliability and usage. Nevertheless, the robustness of these systems has not been extensively investigated. Therefore, the security vulnerabilities of these systems, in general, remain primarily unknown which calls for testing the robustness of these systems. In this article, we have proposed a methodology to investigate the reliability and robustness of 50 representative state-of-the-art MLPU models. Firstly, we have proposed a cost-effective Adversarial URL generator URLBUG that created an Adversarial URL dataset. Subsequently, we reproduced 50 MLPU (traditional ML and Deep learning) systems and recorded their baseline performance. Lastly, we tested the considered MLPU systems on Adversarial Dataset and analyzed their robustness and reliability using box plots and heat maps. Our results showed that the generated adversarial URLs have valid syntax and can be registered at a median annual price of \$11.99. Out of 13\% of the already registered adversarial URLs, 63.94\% were used for malicious purposes. Moreover, the considered MLPU models Matthew Correlation Coefficient (MCC) dropped from a median 0.92 to 0.02 when tested against $Adv_\mathrm{data}$, indicating that the baseline MLPU models are unreliable in their current form. Further, our findings identified several security vulnerabilities of these systems and provided future directions for researchers to design dependable and secure MLPU systems.
    A Benchmark Environment Motivated by Industrial Control Problems. (arXiv:1709.09480v3 [cs.AI] UPDATED)
    In the research area of reinforcement learning (RL), frequently novel and promising methods are developed and introduced to the RL community. However, although many researchers are keen to apply their methods on real-world problems, implementing such methods in real industry environments often is a frustrating and tedious process. Generally, academic research groups have only limited access to real industrial data and applications. For this reason, new methods are usually developed, evaluated and compared by using artificial software benchmarks. On one hand, these benchmarks are designed to provide interpretable RL training scenarios and detailed insight into the learning process of the method on hand. On the other hand, they usually do not share much similarity with industrial real-world applications. For this reason we used our industry experience to design a benchmark which bridges the gap between freely available, documented, and motivated artificial benchmarks and properties of real industrial problems. The resulting industrial benchmark (IB) has been made publicly available to the RL community by publishing its Java and Python code, including an OpenAI Gym wrapper, on Github. In this paper we motivate and describe in detail the IB's dynamics and identify prototypic experimental settings that capture common situations in real-world industry control problems.
    Minimal Width for Universal Property of Deep RNN. (arXiv:2211.13866v1 [stat.ML])
    A recurrent neural network (RNN) is a widely used deep-learning network for dealing with sequential data. Imitating a dynamical system, an infinite-width RNN can approximate any open dynamical system in a compact domain. In general, deep networks with bounded widths are more effective than wide networks in practice; however, the universal approximation theorem for deep narrow structures has yet to be extensively studied. In this study, we prove the universality of deep narrow RNNs and show that the upper bound of the minimum width for universality can be independent of the length of the data. Specifically, we show that a deep RNN with ReLU activation can approximate any continuous function or $L^p$ function with the widths $d_x+d_y+2$ and $\max\{d_x+1,d_y\}$, respectively, where the target function maps a finite sequence of vectors in $\mathbb{R}^{d_x}$ to a finite sequence of vectors in $\mathbb{R}^{d_y}$. We also compute the additional width required if the activation function is $\tanh$ or more. In addition, we prove the universality of other recurrent networks, such as bidirectional RNNs. Bridging a multi-layer perceptron and an RNN, our theory and proof technique can be an initial step toward further research on deep RNNs.
    A Self-Attention Ansatz for Ab-initio Quantum Chemistry. (arXiv:2211.13672v1 [physics.chem-ph])
    We present a novel neural network architecture using self-attention, the Wavefunction Transformer (Psiformer), which can be used as an approximation (or Ansatz) for solving the many-electron Schr\"odinger equation, the fundamental equation for quantum chemistry and material science. This equation can be solved from first principles, requiring no external training data. In recent years, deep neural networks like the FermiNet and PauliNet have been used to significantly improve the accuracy of these first-principle calculations, but they lack an attention-like mechanism for gating interactions between electrons. Here we show that the Psiformer can be used as a drop-in replacement for these other neural networks, often dramatically improving the accuracy of the calculations. On larger molecules especially, the ground state energy can be improved by dozens of kcal/mol, a qualitative leap over previous methods. This demonstrates that self-attention networks can learn complex quantum mechanical correlations between electrons, and are a promising route to reaching unprecedented accuracy in chemical calculations on larger systems.
    Policy-Adaptive Estimator Selection for Off-Policy Evaluation. (arXiv:2211.13904v1 [cs.LG])
    Off-policy evaluation (OPE) aims to accurately evaluate the performance of counterfactual policies using only offline logged data. Although many estimators have been developed, there is no single estimator that dominates the others, because the estimators' accuracy can vary greatly depending on a given OPE task such as the evaluation policy, number of actions, and noise level. Thus, the data-driven estimator selection problem is becoming increasingly important and can have a significant impact on the accuracy of OPE. However, identifying the most accurate estimator using only the logged data is quite challenging because the ground-truth estimation accuracy of estimators is generally unavailable. This paper studies this challenging problem of estimator selection for OPE for the first time. In particular, we enable an estimator selection that is adaptive to a given OPE task, by appropriately subsampling available logged data and constructing pseudo policies useful for the underlying estimator selection task. Comprehensive experiments on both synthetic and real-world company data demonstrate that the proposed procedure substantially improves the estimator selection compared to a non-adaptive heuristic.
    Far3Det: Towards Far-Field 3D Detection. (arXiv:2211.13858v1 [cs.CV])
    We focus on the task of far-field 3D detection (Far3Det) of objects beyond a certain distance from an observer, e.g., $>$50m. Far3Det is particularly important for autonomous vehicles (AVs) operating at highway speeds, which require detections of far-field obstacles to ensure sufficient braking distances. However, contemporary AV benchmarks such as nuScenes underemphasize this problem because they evaluate performance only up to a certain distance (50m). One reason is that obtaining far-field 3D annotations is difficult, particularly for lidar sensors that produce very few point returns for far-away objects. Indeed, we find that almost 50% of far-field objects (beyond 50m) contain zero lidar points. Secondly, current metrics for 3D detection employ a "one-size-fits-all" philosophy, using the same tolerance thresholds for near and far objects, inconsistent with tolerances for both human vision and stereo disparities. Both factors lead to an incomplete analysis of the Far3Det task. For example, while conventional wisdom tells us that high-resolution RGB sensors should be vital for 3D detection of far-away objects, lidar-based methods still rank higher compared to RGB counterparts on the current benchmark leaderboards. As a first step towards a Far3Det benchmark, we develop a method to find well-annotated scenes from the nuScenes dataset and derive a well-annotated far-field validation set. We also propose a Far3Det evaluation protocol and explore various 3D detection methods for Far3Det. Our result convincingly justifies the long-held conventional wisdom that high-resolution RGB improves 3D detection in the far-field. We further propose a simple yet effective method that fuses detections from RGB and lidar detectors based on non-maximum suppression, which remarkably outperforms state-of-the-art 3D detectors in the far-field.
    Confidence Interval Construction for Multivariate time series using Long Short Term Memory Network. (arXiv:2211.13915v1 [stat.ME])
    In this paper we propose a novel procedure to construct a confidence interval for multivariate time series predictions using long short term memory network. The construction uses a few novel block bootstrap techniques. We also propose an innovative block length selection procedure for each of these schemes. Two novel benchmarks help us to compare the construction of this confidence intervals by different bootstrap techniques. We illustrate the whole construction through S\&P $500$ and Dow Jones Index datasets.
    End-to-End Stochastic Optimization with Energy-Based Model. (arXiv:2211.13837v1 [cs.LG])
    Decision-focused learning (DFL) was recently proposed for stochastic optimization problems that involve unknown parameters. By integrating predictive modeling with an implicitly differentiable optimization layer, DFL has shown superior performance to the standard two-stage predict-then-optimize pipeline. However, most existing DFL methods are only applicable to convex problems or a subset of nonconvex problems that can be easily relaxed to convex ones. Further, they can be inefficient in training due to the requirement of solving and differentiating through the optimization problem in every training iteration. We propose SO-EBM, a general and efficient DFL method for stochastic optimization using energy-based models. Instead of relying on KKT conditions to induce an implicit optimization layer, SO-EBM explicitly parameterizes the original optimization problem using a differentiable optimization layer based on energy functions. To better approximate the optimization landscape, we propose a coupled training objective that uses a maximum likelihood loss to capture the optimum location and a distribution-based regularizer to capture the overall energy landscape. Finally, we propose an efficient training procedure for SO-EBM with a self-normalized importance sampler based on a Gaussian mixture proposal. We evaluate SO-EBM in three applications: power scheduling, COVID-19 resource allocation, and non-convex adversarial security game, demonstrating the effectiveness and efficiency of SO-EBM.
    Ladder Siamese Network: a Method and Insights for Multi-level Self-Supervised Learning. (arXiv:2211.13844v1 [cs.CV])
    Siamese-network-based self-supervised learning (SSL) suffers from slow convergence and instability in training. To alleviate this, we propose a framework to exploit intermediate self-supervisions in each stage of deep nets, called the Ladder Siamese Network. Our self-supervised losses encourage the intermediate layers to be consistent with different data augmentations to single samples, which facilitates training progress and enhances the discriminative ability of the intermediate layers themselves. While some existing work has already utilized multi-level self supervisions in SSL, ours is different in that 1) we reveal its usefulness with non-contrastive Siamese frameworks in both theoretical and empirical viewpoints, and 2) ours improves image-level classification, instance-level detection, and pixel-level segmentation simultaneously. Experiments show that the proposed framework can improve BYOL baselines by 1.0% points in ImageNet linear classification, 1.2% points in COCO detection, and 3.1% points in PASCAL VOC segmentation. In comparison with the state-of-the-art methods, our Ladder-based model achieves competitive and balanced performances in all tested benchmarks without causing large degradation in one.
    Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism. (arXiv:2211.13878v1 [cs.LG])
    Transformer models have achieved state-of-the-art performance on various domains of applications and gradually becomes the foundations of the advanced large deep learning (DL) models. However, how to train these models over multiple GPUs efficiently is still challenging due to a large number of parallelism choices. Existing DL systems either rely on manual efforts to make distributed training plans or apply parallelism combinations within a very limited search space. In this approach, we propose Galvatron, a new system framework that incorporates multiple popular parallelism dimensions and automatically finds the most efficient hybrid parallelism strategy. To better explore such a rarely huge search space, we 1) involve a decision tree to make decomposition and pruning based on some reasonable intuitions, and then 2) design a dynamic programming search algorithm to generate the optimal plan. Evaluations on four representative Transformer workloads show that Galvatron could perform automatically distributed training with different GPU memory budgets. Among all evluated scenarios, Galvatron always achieves superior system throughput compared to previous work with limited parallelism.
    Sequential Gradient Coding For Straggler Mitigation. (arXiv:2211.13802v1 [cs.LG])
    In distributed computing, slower nodes (stragglers) usually become a bottleneck. Gradient Coding (GC), introduced by Tandon et al., is an efficient technique that uses principles of error-correcting codes to distribute gradient computation in the presence of stragglers. In this paper, we consider the distributed computation of a sequence of gradients $\{g(1),g(2),\ldots,g(J)\}$, where processing of each gradient $g(t)$ starts in round-$t$ and finishes by round-$(t+T)$. Here $T\geq 0$ denotes a delay parameter. For the GC scheme, coding is only across computing nodes and this results in a solution where $T=0$. On the other hand, having $T>0$ allows for designing schemes which exploit the temporal dimension as well. In this work, we propose two schemes that demonstrate improved performance compared to GC. Our first scheme combines GC with selective repetition of previously unfinished tasks and achieves improved straggler mitigation. In our second scheme, which constitutes our main contribution, we apply GC to a subset of the tasks and repetition for the remainder of the tasks. We then multiplex these two classes of tasks across workers and rounds in an adaptive manner, based on past straggler patterns. Using theoretical analysis, we demonstrate that our second scheme achieves significant reduction in the computational load. In our experiments, we study a practical setting of concurrently training multiple neural networks over an AWS Lambda cluster involving 256 worker nodes, where our framework naturally applies. We demonstrate that the latter scheme can yield a 16\% improvement in runtime over the baseline GC scheme, in the presence of naturally occurring, non-simulated stragglers.
    Combining Constructive and Perturbative Deep Learning Algorithms for the Capacitated Vehicle Routing Problem. (arXiv:2211.13922v1 [cs.LG])
    The Capacitated Vehicle Routing Problem is a well-known NP-hard problem that poses the challenge of finding the optimal route of a vehicle delivering products to multiple locations. Recently, new efforts have emerged to create constructive and perturbative heuristics to tackle this problem using Deep Learning. In this paper, we join these efforts to develop the Combined Deep Constructor and Perturbator, which combines two powerful constructive and perturbative Deep Learning-based heuristics, using attention mechanisms at their core. Furthermore, we improve the Attention Model-Dynamic for the Capacitated Vehicle Routing Problem by proposing a memory-efficient algorithm that reduces its memory complexity by a factor of the number of nodes. Our method shows promising results. It demonstrates a cost improvement in common datasets when compared against other multiple Deep Learning methods. It also obtains close results to the state-of-the art heuristics from the Operations Research field. Additionally, the proposed memory efficient algorithm for the Attention Model-Dynamic model enables its use in problem instances with more than 100 nodes.
    Learning-enhanced Nonlinear Model Predictive Control using Knowledge-based Neural Ordinary Differential Equations and Deep Ensembles. (arXiv:2211.13829v1 [eess.SY])
    Nonlinear model predictive control (MPC) is a flexible and increasingly popular framework used to synthesize feedback control strategies that can satisfy both state and control input constraints. In this framework, an optimization problem, subjected to a set of dynamics constraints characterized by a nonlinear dynamics model, is solved at each time step. Despite its versatility, the performance of nonlinear MPC often depends on the accuracy of the dynamics model. In this work, we leverage deep learning tools, namely knowledge-based neural ordinary differential equations (KNODE) and deep ensembles, to improve the prediction accuracy of this model. In particular, we learn an ensemble of KNODE models, which we refer to as the KNODE ensemble, to obtain an accurate prediction of the true system dynamics. This learned model is then integrated into a novel learning-enhanced nonlinear MPC framework. We provide sufficient conditions that guarantees asymptotic stability of the closed-loop system and show that these conditions can be implemented in practice. We show that the KNODE ensemble provides more accurate predictions and illustrate the efficacy and closed-loop performance of the proposed nonlinear MPC framework using two case studies.
    SkillS: Adaptive Skill Sequencing for Efficient Temporally-Extended Exploration. (arXiv:2211.13743v1 [cs.LG])
    The ability to effectively reuse prior knowledge is a key requirement when building general and flexible Reinforcement Learning (RL) agents. Skill reuse is one of the most common approaches, but current methods have considerable limitations.For example, fine-tuning an existing policy frequently fails, as the policy can degrade rapidly early in training. In a similar vein, distillation of expert behavior can lead to poor results when given sub-optimal experts. We compare several common approaches for skill transfer on multiple domains including changes in task and system dynamics. We identify how existing methods can fail and introduce an alternative approach to mitigate these problems. Our approach learns to sequence existing temporally-extended skills for exploration but learns the final policy directly from the raw experience. This conceptual split enables rapid adaptation and thus efficient data collection but without constraining the final solution.It significantly outperforms many classical methods across a suite of evaluation tasks and we use a broad set of ablations to highlight the importance of differentc omponents of our method.
    Probabilistic Time Series Forecasting for Adaptive Monitoring in Edge Computing Environments. (arXiv:2211.13729v1 [cs.DC])
    With increasingly more computation being shifted to the edge of the network, monitoring of critical infrastructures, such as intermediate processing nodes in autonomous driving, is further complicated due to the typically resource-constrained environments. In order to reduce the resource overhead on the network link imposed by monitoring, various methods have been discussed that either follow a filtering approach for data-emitting devices or conduct dynamic sampling based on employed prediction models. Still, existing methods are mainly requiring adaptive monitoring on edge devices, which demands device reconfigurations, utilizes additional resources, and limits the sophistication of employed models. In this paper, we propose a sampling-based and cloud-located approach that internally utilizes probabilistic forecasts and hence provides means of quantifying model uncertainties, which can be used for contextualized adaptations of sampling frequencies and consequently relieves constrained network resources. We evaluate our prototype implementation for the monitoring pipeline on a publicly available streaming dataset and demonstrate its positive impact on resource efficiency in a method comparison.
    Reduction Algorithms for Persistence Diagrams of Networks: CoralTDA and PrunIT. (arXiv:2211.13708v1 [cs.LG])
    Topological data analysis (TDA) delivers invaluable and complementary information on the intrinsic properties of data inaccessible to conventional methods. However, high computational costs remain the primary roadblock hindering the successful application of TDA in real-world studies, particularly with machine learning on large complex networks. Indeed, most modern networks such as citation, blockchain, and online social networks often have hundreds of thousands of vertices, making the application of existing TDA methods infeasible. We develop two new, remarkably simple but effective algorithms to compute the exact persistence diagrams of large graphs to address this major TDA limitation. First, we prove that $(k+1)$-core of a graph $\mathcal{G}$ suffices to compute its $k^{th}$ persistence diagram, $PD_k(\mathcal{G})$. Second, we introduce a pruning algorithm for graphs to compute their persistence diagrams by removing the dominated vertices. Our experiments on large networks show that our novel approach can achieve computational gains up to 95%. The developed framework provides the first bridge between the graph theory and TDA, with applications in machine learning of large complex networks. Our implementation is available at https://github.com/cakcora/PersistentHomologyWithCoralPrunit
    On Pitfalls of Measuring Occlusion Robustness through Data Distortion. (arXiv:2211.13734v1 [cs.CV])
    Over the past years, the crucial role of data has largely been shadowed by the field's focus on architectures and training procedures. We often cause changes to the data without being aware of their wider implications. In this paper we show that distorting images without accounting for the artefacts introduced leads to biased results when establishing occlusion robustness. To ensure models behave as expected in real-world scenarios, we need to rule out the impact added artefacts have on evaluation. We propose a new approach, iOcclusion, as a fairer alternative for applications where the possible occluders are unknown.
    Responsible Active Learning via Human-in-the-loop Peer Study. (arXiv:2211.13587v1 [cs.LG])
    Active learning has been proposed to reduce data annotation efforts by only manually labelling representative data samples for training. Meanwhile, recent active learning applications have benefited a lot from cloud computing services with not only sufficient computational resources but also crowdsourcing frameworks that include many humans in the active learning loop. However, previous active learning methods that always require passing large-scale unlabelled data to cloud may potentially raise significant data privacy issues. To mitigate such a risk, we propose a responsible active learning method, namely Peer Study Learning (PSL), to simultaneously preserve data privacy and improve model stability. Specifically, we first introduce a human-in-the-loop teacher-student architecture to isolate unlabelled data from the task learner (teacher) on the cloud-side by maintaining an active learner (student) on the client-side. During training, the task learner instructs the light-weight active learner which then provides feedback on the active sampling criterion. To further enhance the active learner via large-scale unlabelled data, we introduce multiple peer students into the active learner which is trained by a novel learning paradigm, including the In-Class Peer Study on labelled data and the Out-of-Class Peer Study on unlabelled data. Lastly, we devise a discrepancy-based active sampling criterion, Peer Study Feedback, that exploits the variability of peer students to select the most informative data to improve model stability. Extensive experiments demonstrate the superiority of the proposed PSL over a wide range of active learning methods in both standard and sensitive protection settings.
    Zeroth-Order Alternating Gradient Descent Ascent Algorithms for a Class of Nonconvex-Nonconcave Minimax Problems. (arXiv:2211.13668v1 [math.OC])
    In this paper, we consider a class of nonconvex-nonconcave minimax problems, i.e., NC-PL minimax problems, whose objective functions satisfy the Polyak-$\L$ojasiewicz (PL) condition with respect to the inner variable. We propose a zeroth-order alternating gradient descent ascent (ZO-AGDA) algorithm and a zeroth-order variance reduced alternating gradient descent ascent (ZO-VRAGDA) algorithm for solving NC-PL minimax problem under the deterministic and the stochastic setting, respectively. The number of iterations to obtain an $\epsilon$-stationary point of ZO-AGDA and ZO-VRAGDA algorithm for solving NC-PL minimax problem is upper bounded by $\mathcal{O}(\varepsilon^{-2})$ and $\mathcal{O}(\varepsilon^{-3})$, respectively. To the best of our knowledge, they are the first two zeroth-order algorithms with the iteration complexity gurantee for solving NC-PL minimax problems.
    Question-type Identification for Academic Questions in Online Learning Platform. (arXiv:2211.13727v1 [cs.CL])
    Online learning platforms provide learning materials and answers to students' academic questions by experts, peers, or systems. This paper explores question-type identification as a step in content understanding for an online learning platform. The aim of the question-type identifier is to categorize question types based on their structure and complexity, using the question text, subject, and structural features. We have defined twelve question-type classes, including Multiple-Choice Question (MCQ), essay, and others. We have compiled an internal dataset of students' questions and used a combination of weak-supervision techniques and manual annotation. We then trained a BERT-based ensemble model on this dataset and evaluated this model on a separate human-labeled test set. Our experiments yielded an F1-score of 0.94 for MCQ binary classification and promising results for 12-class multilabel classification. We deployed the model in our online learning platform as a crucial enabler for content understanding to enhance the student learning experience.
    Sketch-Guided Text-to-Image Diffusion Models. (arXiv:2211.13752v1 [cs.CV])
    Text-to-Image models have introduced a remarkable leap in the evolution of machine learning, demonstrating high-quality synthesis of images from a given text-prompt. However, these powerful pretrained models still lack control handles that can guide spatial properties of the synthesized images. In this work, we introduce a universal approach to guide a pretrained text-to-image diffusion model, with a spatial map from another domain (e.g., sketch) during inference time. Unlike previous works, our method does not require to train a dedicated model or a specialized encoder for the task. Our key idea is to train a Latent Guidance Predictor (LGP) - a small, per-pixel, Multi-Layer Perceptron (MLP) that maps latent features of noisy images to spatial maps, where the deep features are extracted from the core Denoising Diffusion Probabilistic Model (DDPM) network. The LGP is trained only on a few thousand images and constitutes a differential guiding map predictor, over which the loss is computed and propagated back to push the intermediate images to agree with the spatial map. The per-pixel training offers flexibility and locality which allows the technique to perform well on out-of-domain sketches, including free-hand style drawings. We take a particular focus on the sketch-to-image translation task, revealing a robust and expressive way to generate images that follow the guidance of a sketch of arbitrary style or domain. Project page: sketch-guided-diffusion.github.io
    Federated Learning Hyper-Parameter Tuning from a System Perspective. (arXiv:2211.13656v1 [cs.LG])
    Federated learning (FL) is a distributed model training paradigm that preserves clients' data privacy. It has gained tremendous attention from both academia and industry. FL hyper-parameters (e.g., the number of selected clients and the number of training passes) significantly affect the training overhead in terms of computation time, transmission time, computation load, and transmission load. However, the current practice of manually selecting FL hyper-parameters imposes a heavy burden on FL practitioners because applications have different training preferences. In this paper, we propose FedTune, an automatic FL hyper-parameter tuning algorithm tailored to applications' diverse system requirements in FL training. FedTune iteratively adjusts FL hyper-parameters during FL training and can be easily integrated into existing FL systems. Through extensive evaluations of FedTune for diverse applications and FL aggregation algorithms, we show that FedTune is lightweight and effective, achieving 8.48%-26.75% system overhead reduction compared to using fixed FL hyper-parameters. This paper assists FL practitioners in designing high-performance FL training solutions. The source code of FedTune is available at https://github.com/DataSysTech/FedTune.
    Certified data-driven physics-informed greedy auto-encoder simulator. (arXiv:2211.13698v1 [cs.LG])
    A parametric adaptive greedy Latent Space Dynamics Identification (gLaSDI) framework is developed for accurate, efficient, and certified data-driven physics-informed greedy auto-encoder simulators of high-dimensional nonlinear dynamical systems. In the proposed framework, an auto-encoder and dynamics identification models are trained interactively to discover intrinsic and simple latent-space dynamics. To effectively explore the parameter space for optimal model performance, an adaptive greedy sampling algorithm integrated with a physics-informed error indicator is introduced to search for optimal training samples on the fly, outperforming the conventional predefined uniform sampling. Further, an efficient k-nearest neighbor convex interpolation scheme is employed to exploit local latent-space dynamics for improved predictability. Numerical results demonstrate that the proposed method achieves 121 to 2,658x speed-up with 1 to 5% relative errors for radial advection and 2D Burgers dynamical problems.
    To be or not to be stable, that is the question: understanding neural networks for inverse problems. (arXiv:2211.13692v1 [math.NA])
    The solution of linear inverse problems arising, for example, in signal and image processing is a challenging problem, since the ill-conditioning amplifies the noise on the data. Recently introduced deep-learning based algorithms overwhelm the more traditional model-based approaches but they typically suffer from instability with respect to data perturbation. In this paper, we theoretically analyse the trade-off between neural networks stability and accuracy in the solution of linear inverse problems. Moreover, we propose different supervised and unsupervised solutions, to increase network stability by maintaining good accuracy, by inheriting, in the network training, regularization from a model-based iterative scheme. Extensive numerical experiments on image deblurring confirm the theoretical results and the effectiveness of the proposed networks in solving inverse problems with stability with respect to noise.
    Multitask Learning for Low Resource Spoken Language Understanding. (arXiv:2211.13703v1 [cs.CL])
    We explore the benefits that multitask learning offer to speech processing as we train models on dual objectives with automatic speech recognition and intent classification or sentiment classification. Our models, although being of modest size, show improvements over models trained end-to-end on intent classification. We compare different settings to find the optimal disposition of each task module compared to one another. Finally, we study the performance of the models in low-resource scenario by training the models with as few as one example per class. We show that multitask learning in these scenarios compete with a baseline model trained on text features and performs considerably better than a pipeline model. On sentiment classification, we match the performance of an end-to-end model with ten times as many parameters. We consider 4 tasks and 4 datasets in Dutch and English.
    End-to-end Wind Turbine Wake Modelling with Deep Graph Representation Learning. (arXiv:2211.13649v1 [cs.LG])
    Wind turbine wake modelling is of crucial importance to accurate resource assessment, to layout optimisation, and to the operational control of wind farms. This work proposes a surrogate model for the representation of wind turbine wakes based on a state-of-the-art graph representation learning method termed a graph neural network. The proposed end-to-end deep learning model operates directly on unstructured meshes and has been validated against high-fidelity data, demonstrating its ability to rapidly make accurate 3D flow field predictions for various inlet conditions and turbine yaw angles. The specific graph neural network model employed here is shown to generalise well to unseen data and is less sensitive to over-smoothing compared to common graph neural networks. A case study based upon a real world wind farm further demonstrates the capability of the proposed approach to predict farm scale power generation. Moreover, the proposed graph neural network framework is flexible and highly generic and as formulated here can be applied to any steady state computational fluid dynamics simulations on unstructured meshes.
    Immersive Neural Graphics Primitives. (arXiv:2211.13494v1 [cs.CV])
    Neural radiance field (NeRF), in particular its extension by instant neural graphics primitives, is a novel rendering method for view synthesis that uses real-world images to build photo-realistic immersive virtual scenes. Despite its potential, research on the combination of NeRF and virtual reality (VR) remains sparse. Currently, there is no integration into typical VR systems available, and the performance and suitability of NeRF implementations for VR have not been evaluated, for instance, for different scene complexities or screen resolutions. In this paper, we present and evaluate a NeRF-based framework that is capable of rendering scenes in immersive VR allowing users to freely move their heads to explore complex real-world scenes. We evaluate our framework by benchmarking three different NeRF scenes concerning their rendering performance at different scene complexities and resolutions. Utilizing super-resolution, our approach can yield a frame rate of 30 frames per second with a resolution of 1280x720 pixels per eye. We discuss potential applications of our framework and provide an open source implementation online.
    How to predict and optimise with asymmetric error metrics. (arXiv:2211.13586v1 [cs.LG])
    In this paper, we examine the concept of the predict and optimise problem with specific reference to the third Technical Challenge of the IEEE Computational Intelligence Society. In this competition, entrants were asked to forecast building energy use and solar generation at six buildings and six solar installations, and then use their forecast to optimize energy cost while scheduling classes and batteries over a month. We examine the possible effect of underforecasting and overforecasting and asymmetric errors on the optimisation cost. We explore the different nature of loss functions for the prediction and optimisation phase and propose to adjust the final forecasts for a better optimisation cost. We report that while there is a positive correlation between these two, more appropriate loss functions can be used to optimise the costs associated with final decisions.
    A Privacy-Preserving Outsourced Data Model in Cloud Environment. (arXiv:2211.13542v1 [cs.CR])
    Nowadays, more and more machine learning applications, such as medical diagnosis, online fraud detection, email spam filtering, etc., services are provided by cloud computing. The cloud service provider collects the data from the various owners to train or classify the machine learning system in the cloud environment. However, multiple data owners may not entirely rely on the cloud platform that a third party engages. Therefore, data security and privacy problems are among the critical hindrances to using machine learning tools, particularly with multiple data owners. In addition, unauthorized entities can detect the statistical input data and infer the machine learning model parameters. Therefore, a privacy-preserving model is proposed, which protects the privacy of the data without compromising machine learning efficiency. In order to protect the data of data owners, the epsilon-differential privacy is used, and fog nodes are used to address the problem of the lower bandwidth and latency in this proposed scheme. The noise is produced by the epsilon-differential mechanism, which is then added to the data. Moreover, the noise is injected at the data owner site to protect the owners data. Fog nodes collect the noise-added data from the data owners, then shift it to the cloud platform for storage, computation, and performing the classification tasks purposes.
    ACROBAT -- a multi-stain breast cancer histological whole-slide-image data set from routine diagnostics for computational pathology. (arXiv:2211.13621v1 [eess.IV])
    The analysis of FFPE tissue sections stained with haematoxylin and eosin (H&E) or immunohistochemistry (IHC) is an essential part of the pathologic assessment of surgically resected breast cancer specimens. IHC staining has been broadly adopted into diagnostic guidelines and routine workflows to manually assess status and scoring of several established biomarkers, including ER, PGR, HER2 and KI67. However, this is a task that can also be facilitated by computational pathology image analysis methods. The research in computational pathology has recently made numerous substantial advances, often based on publicly available whole slide image (WSI) data sets. However, the field is still considerably limited by the sparsity of public data sets. In particular, there are no large, high quality publicly available data sets with WSIs of matching IHC and H&E-stained tissue sections. Here, we publish the currently largest publicly available data set of WSIs of tissue sections from surgical resection specimens from female primary breast cancer patients with matched WSIs of corresponding H&E and IHC-stained tissue, consisting of 4,212 WSIs from 1,153 patients. The primary purpose of the data set was to facilitate the ACROBAT WSI registration challenge, aiming at accurately aligning H&E and IHC images. For research in the area of image registration, automatic quantitative feedback on registration algorithm performance remains available through the ACROBAT challenge website, based on more than 37,000 manually annotated landmark pairs from 13 annotators. Beyond registration, this data set has the potential to enable many different avenues of computational pathology research, including stain-guided learning, virtual staining, unsupervised pre-training, artefact detection and stain-independent models.
    Learning with Partial Labels from Semi-supervised Perspective. (arXiv:2211.13655v1 [cs.LG])
    Partial Label (PL) learning refers to the task of learning from the partially labeled data, where each training instance is ambiguously equipped with a set of candidate labels but only one is valid. Advances in the recent deep PL learning literature have shown that the deep learning paradigms, e.g., self-training, contrastive learning, or class activate values, can achieve promising performance. Inspired by the impressive success of deep Semi-Supervised (SS) learning, we transform the PL learning problem into the SS learning problem, and propose a novel PL learning method, namely Partial Label learning with Semi-supervised Perspective (PLSP). Specifically, we first form the pseudo-labeled dataset by selecting a small number of reliable pseudo-labeled instances with high-confidence prediction scores and treating the remaining instances as pseudo-unlabeled ones. Then we design a SS learning objective, consisting of a supervised loss for pseudo-labeled instances and a semantic consistency regularization for pseudo-unlabeled instances. We further introduce a complementary regularization for those non-candidate labels to constrain the model predictions on them to be as small as possible. Empirical results demonstrate that PLSP significantly outperforms the existing PL baseline methods, especially on high ambiguity levels. Code available: https://github.com/changchunli/PLSP.
    Using Focal Loss to Fight Shallow Heuristics: An Empirical Analysis of Modulated Cross-Entropy in Natural Language Inference. (arXiv:2211.13331v1 [cs.CL])
    There is no such thing as a perfect dataset. In some datasets, deep neural networks discover underlying heuristics that allow them to take shortcuts in the learning process, resulting in poor generalization capability. Instead of using standard cross-entropy, we explore whether a modulated version of cross-entropy called focal loss can constrain the model so as not to use heuristics and improve generalization performance. Our experiments in natural language inference show that focal loss has a regularizing impact on the learning process, increasing accuracy on out-of-distribution data, but slightly decreasing performance on in-distribution data. Despite the improved out-of-distribution performance, we demonstrate the shortcomings of focal loss and its inferiority in comparison to the performance of methods such as unbiased focal loss and self-debiasing ensembles.
    An Algebraically Converging Stochastic Gradient Descent Algorithm for Global Optimization. (arXiv:2204.05923v2 [math.OC] UPDATED)
    We propose a new gradient descent algorithm with added stochastic terms for finding the global optimizers of nonconvex optimization problems, referred to as ``AdaVar'' here. A key component in the algorithm is the adaptive tuning of the randomness based on the value of the objective function. In the language of simulated annealing, the temperature is state-dependent. With this, we prove the global convergence of the algorithm with an algebraic rate both in probability and in the parameter space. This is an improvement over the classical rate from using a simpler control of the noise term. The convergence proof is based on the actual discrete setup of the algorithm. We also present several numerical examples to demonstrate the efficiency and robustness of the algorithm for reasonably complex objective functions.
    Learning Heterogeneous Interaction Strengths by Trajectory Prediction with Graph Neural Network. (arXiv:2208.13179v2 [cs.LG] UPDATED)
    Dynamical systems with interacting agents are universal in nature, commonly modeled by a graph of relationships between their constituents. Recently, various works have been presented to tackle the problem of inferring those relationships from the system trajectories via deep neural networks, but most of the studies assume binary or discrete types of interactions for simplicity. In the real world, the interaction kernels often involve continuous interaction strengths, which cannot be accurately approximated by discrete relations. In this work, we propose the relational attentive inference network (RAIN) to infer continuously weighted interaction graphs without any ground-truth interaction strengths. Our model employs a novel pairwise attention (PA) mechanism to refine the trajectory representations and a graph transformer to extract heterogeneous interaction weights for each pair of agents. We show that our RAIN model with the PA mechanism accurately infers continuous interaction strengths for simulated physical systems in an unsupervised manner. Further, RAIN with PA successfully predicts trajectories from motion capture data with an interpretable interaction graph, demonstrating the virtue of modeling unknown dynamics with continuous weights.
    Prosody-controllable spontaneous TTS with neural HMMs. (arXiv:2211.13533v1 [eess.AS])
    Spontaneous speech has many affective and pragmatic functions that are interesting and challenging to model in TTS (text-to-speech). However, the presence of reduced articulation, fillers, repetitions, and other disfluencies mean that text and acoustics are less well aligned than in read speech. This is problematic for attention-based TTS. We propose a TTS architecture that is particularly suited for rapidly learning to speak from irregular and small datasets while also reproducing the diversity of expressive phenomena present in spontaneous speech. Specifically, we modify an existing neural HMM-based TTS system, which is capable of stable, monotonic alignments for spontaneous speech, and add utterance-level prosody control, so that the system can represent the wide range of natural variability in a spontaneous speech corpus. We objectively evaluate control accuracy and perform a subjective listening test to compare to a system without prosody control. To exemplify the power of combining mid-level prosody control and ecologically valid data for reproducing intricate spontaneous speech phenomena, we evaluate the system's capability of synthesizing two types of creaky phonation. Audio samples are available at https://hfkml.github.io/pc_nhmm_tts/
    Group SELFIES: A Robust Fragment-Based Molecular String Representation. (arXiv:2211.13322v1 [cs.LG])
    We introduce Group SELFIES, a molecular string representation that leverages group tokens to represent functional groups or entire substructures while maintaining chemical robustness guarantees. Molecular string representations, such as SMILES and SELFIES, serve as the basis for molecular generation and optimization in chemical language models, deep generative models, and evolutionary methods. While SMILES and SELFIES leverage atomic representations, Group SELFIES builds on top of the chemical robustness guarantees of SELFIES by enabling group tokens, thereby creating additional flexibility to the representation. Moreover, the group tokens in Group SELFIES can take advantage of inductive biases of molecular fragments that capture meaningful chemical motifs. The advantages of capturing chemical motifs and flexibility are demonstrated in our experiments, which show that Group SELFIES improves distribution learning of common molecular datasets. Further experiments also show that random sampling of Group SELFIES strings improves the quality of generated molecules compared to regular SELFIES strings. Our open-source implementation of Group SELFIES is available online, which we hope will aid future research in molecular generation and optimization.
    1st Workshop on Maritime Computer Vision (MaCVi) 2023: Challenge Results. (arXiv:2211.13508v1 [cs.CV])
    The 1$^{\text{st}}$ Workshop on Maritime Computer Vision (MaCVi) 2023 focused on maritime computer vision for Unmanned Aerial Vehicles (UAV) and Unmanned Surface Vehicle (USV), and organized several subchallenges in this domain: (i) UAV-based Maritime Object Detection, (ii) UAV-based Maritime Object Tracking, (iii) USV-based Maritime Obstacle Segmentation and (iv) USV-based Maritime Obstacle Detection. The subchallenges were based on the SeaDronesSee and MODS benchmarks. %This report summarizes the main findings of the individual subchallenges, which are (1) UAV-based Maritime Object Detection, (2) UAV-based Maritime Object Tracking, (3) USV-based Maritime Obstacle Segmentation and (4) USV-based Maritime Obstacle Detection. This report summarizes the main findings of the individual subchallenges and introduces %Furthermore, we introduce a new benchmark, called SeaDronesSee Object Detection v2, which extends the previous benchmark by including more classes and footage. We provide statistical and qualitative analyses, and assess trends in the best-performing methodologies of over 130 submissions. The methods are summarized in the appendix. %The tech report for most of the top performing methods is attached. The datasets, evaluation code and the %competition's final standing leaderboard are publicly available at https://seadronessee.cs.uni-tuebingen.de/macvi.
    Learning Compact Features via In-Training Representation Alignment. (arXiv:2211.13332v1 [cs.LG])
    Deep neural networks (DNNs) for supervised learning can be viewed as a pipeline of the feature extractor (i.e., last hidden layer) and a linear classifier (i.e., output layer) that are trained jointly with stochastic gradient descent (SGD) on the loss function (e.g., cross-entropy). In each epoch, the true gradient of the loss function is estimated using a mini-batch sampled from the training set and model parameters are then updated with the mini-batch gradients. Although the latter provides an unbiased estimation of the former, they are subject to substantial variances derived from the size and number of sampled mini-batches, leading to noisy and jumpy updates. To stabilize such undesirable variance in estimating the true gradients, we propose In-Training Representation Alignment (ITRA) that explicitly aligns feature distributions of two different mini-batches with a matching loss in the SGD training process. We also provide a rigorous analysis of the desirable effects of the matching loss on feature representation learning: (1) extracting compact feature representation; (2) reducing over-adaption on mini-batches via an adaptive weighting mechanism; and (3) accommodating to multi-modalities. Finally, we conduct large-scale experiments on both image and text classifications to demonstrate its superior performance to the strong baselines.
    A Moment-Matching Approach to Testable Learning and a New Characterization of Rademacher Complexity. (arXiv:2211.13312v1 [cs.LG])
    A remarkable recent paper by Rubinfeld and Vasilyan (2022) initiated the study of \emph{testable learning}, where the goal is to replace hard-to-verify distributional assumptions (such as Gaussianity) with efficiently testable ones and to require that the learner succeed whenever the unknown distribution passes the corresponding test. In this model, they gave an efficient algorithm for learning halfspaces under testable assumptions that are provably satisfied by Gaussians. In this paper we give a powerful new approach for developing algorithms for testable learning using tools from moment matching and metric distances in probability. We obtain efficient testable learners for any concept class that admits low-degree \emph{sandwiching polynomials}, capturing most important examples for which we have ordinary agnostic learners. We recover the results of Rubinfeld and Vasilyan as a corollary of our techniques while achieving improved, near-optimal sample complexity bounds for a broad range of concept classes and distributions. Surprisingly, we show that the information-theoretic sample complexity of testable learning is tightly characterized by the Rademacher complexity of the concept class, one of the most well-studied measures in statistical learning theory. In particular, uniform convergence is necessary and sufficient for testable learning. This leads to a fundamental separation from (ordinary) distribution-specific agnostic learning, where uniform convergence is sufficient but not necessary.
    SciRepEval: A Multi-Format Benchmark for Scientific Document Representations. (arXiv:2211.13308v1 [cs.CL])
    Learned representations of scientific documents can serve as valuable input features for downstream tasks, without the need for further fine-tuning. However, existing benchmarks for evaluating these representations fail to capture the diversity of relevant tasks. In response, we introduce SciRepEval, the first comprehensive benchmark for training and evaluating scientific document representations. It includes 25 challenging and realistic tasks, 11 of which are new, across four formats: classification, regression, ranking and search. We then use the benchmark to study and improve the generalization ability of scientific document representation models. We show how state-of-the-art models struggle to generalize across task formats, and that simple multi-task training fails to improve them. However, a new approach that learns multiple embeddings per document, each tailored to a different format, can improve performance. We experiment with task-format-specific control codes and adapters in a multi-task setting and find that they outperform the existing single-embedding state-of-the-art by up to 1.5 points absolute.
    Data-Driven Offline Decision-Making via Invariant Representation Learning. (arXiv:2211.11349v2 [cs.LG] UPDATED)
    The goal in offline data-driven decision-making is synthesize decisions that optimize a black-box utility function, using a previously-collected static dataset, with no active interaction. These problems appear in many forms: offline reinforcement learning (RL), where we must produce actions that optimize the long-term reward, bandits from logged data, where the goal is to determine the correct arm, and offline model-based optimization (MBO) problems, where we must find the optimal design provided access to only a static dataset. A key challenge in all these settings is distributional shift: when we optimize with respect to the input into a model trained from offline data, it is easy to produce an out-of-distribution (OOD) input that appears erroneously good. In contrast to prior approaches that utilize pessimism or conservatism to tackle this problem, in this paper, we formulate offline data-driven decision-making as domain adaptation, where the goal is to make accurate predictions for the value of optimized decisions ("target domain"), when training only on the dataset ("source domain"). This perspective leads to invariant objective models (IOM), our approach for addressing distributional shift by enforcing invariance between the learned representations of the training dataset and optimized decisions. In IOM, if the optimized decisions are too different from the training dataset, the representation will be forced to lose much of the information that distinguishes good designs from bad ones, making all choices seem mediocre. Critically, when the optimizer is aware of this representational tradeoff, it should choose not to stray too far from the training distribution, leading to a natural trade-off between distributional shift and learning performance.
    BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning. (arXiv:2206.08657v2 [cs.CV] UPDATED)
    Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language representation learning in recent years. Current VL models either use lightweight uni-modal encoders and learn to extract, align and fuse both modalities simultaneously in a deep cross-modal encoder, or feed the last-layer uni-modal representations from the deep pre-trained uni-modal encoders into the top cross-modal encoder. Both approaches potentially restrict vision-language representation learning and limit model performance. In this paper, we propose Bridge-Tower, which introduces multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the cross-modal encoder. This enables effective bottom-up cross-modal alignment and fusion between visual and textual representations of different semantic levels of pre-trained uni-modal encoders in the cross-modal encoder. Pre-trained with only 4M images, Bridge-Tower achieves state-of-the-art performance on various downstream vision-language tasks. In particular, on the VQAv2 test-std set, Bridge-Tower achieves an accuracy of 78.73%, outperforming the previous state-of-the-art model METER by 1.09% with the same pre-training data and almost negligible additional parameters and computational costs. Notably, when further scaling the model, Bridge-Tower achieves an accuracy of 81.15%, surpassing models that are pre-trained on orders-of-magnitude larger datasets. Code and checkpoints are available at \url{https://github.com/microsoft/BridgeTower}.
    Choreographer: Learning and Adapting Skills in Imagination. (arXiv:2211.13350v1 [cs.AI])
    Unsupervised skill learning aims to learn a rich repertoire of behaviors without external supervision, providing artificial agents with the ability to control and influence the environment. However, without appropriate knowledge and exploration, skills may provide control only over a restricted area of the environment, limiting their applicability. Furthermore, it is unclear how to leverage the learned skill behaviors for adapting to downstream tasks in a data-efficient manner. We present Choreographer, a model-based agent that exploits its world model to learn and adapt skills in imagination. Our method decouples the exploration and skill learning processes, being able to discover skills in the latent state space of the model. During adaptation, the agent uses a meta-controller to evaluate and adapt the learned skills efficiently by deploying them in parallel in imagination. Choreographer is able to learn skills both from offline data, and by collecting data simultaneously with an exploration policy. The skills can be used to effectively adapt to downstream tasks, as we show in the URL benchmark, where we outperform previous approaches from both pixels and states inputs. The learned skills also explore the environment thoroughly, finding sparse rewards more frequently, as shown in goal-reaching tasks from the DMC Suite and Meta-World. Project website: https://skillchoreographer.github.io/
    Randomized K-FACs: Speeding up K-FAC with Randomized Numerical Linear Algebra. (arXiv:2206.15397v3 [cs.LG] UPDATED)
    K-FAC is a successful tractable implementation of Natural Gradient for Deep Learning, which nevertheless suffers from the requirement to compute the inverse of the Kronecker factors (through an eigen-decomposition). This can be very time-consuming (or even prohibitive) when these factors are large. In this paper, we theoretically show that, owing to the exponential-average construction paradigm of the Kronecker factors that is typically used, their eigen-spectrum must decay. We show numerically that in practice this decay is very rapid, leading to the idea that we could save substantial computation by only focusing on the first few eigen-modes when inverting the Kronecker-factors. Importantly, the spectrum decay happens over a constant number of modes irrespectively of the layer width. This allows us to reduce the time complexity of K-FAC from cubic to quadratic in layer width, partially closing the gap w.r.t. SENG (another practical Natural Gradient implementation for Deep learning which scales linearly in width). Randomized Numerical Linear Algebra provides us with the necessary tools to do so. Numerical results show we obtain $\approx2.5\times$ reduction in per-epoch time and $\approx3.3\times$ reduction in time to target accuracy. We compare our proposed K-FAC sped-up versions SENG, and observe that for CIFAR10 classification with VGG16_bn we perform on par with it.
    Optimal Weak to Strong Learning. (arXiv:2206.01563v4 [cs.LG] UPDATED)
    The classic algorithm AdaBoost allows to convert a weak learner, that is an algorithm that produces a hypothesis which is slightly better than chance, into a strong learner, achieving arbitrarily high accuracy when given enough training data. We present a new algorithm that constructs a strong learner from a weak learner but uses less training data than AdaBoost and all other weak to strong learners to achieve the same generalization bounds. A sample complexity lower bound shows that our new algorithm uses the minimum possible amount of training data and is thus optimal. Hence, this work settles the sample complexity of the classic problem of constructing a strong learner from a weak learner.
    Enhancing Targeted Attack Transferability via Diversified Weight Pruning. (arXiv:2208.08677v2 [cs.CV] UPDATED)
    Malicious attackers can generate targeted adversarial examples by imposing tiny noises, forcing neural networks to produce specific incorrect outputs. With cross-model transferability, network models remain vulnerable even in black-box settings. Recent studies have shown the effectiveness of ensemble-based methods in generating transferable adversarial examples. To further enhance transferability, model augmentation methods aim to produce more networks participating in the ensemble. However, existing model augmentation methods are only proven effective in untargeted attacks. In this work, we propose Diversified Weight Pruning (DWP), a novel model augmentation technique for generating transferable targeted attacks. DWP leverages the weight pruning method commonly used in model compression. Compared with prior work, DWP protects necessary connections and ensures the diversity of the pruned models simultaneously, which we show are crucial for targeted transferability. Experiments on the ImageNet-compatible dataset under various and more challenging scenarios confirm the effectiveness: transferring to adversarially trained models, Non-CNN architectures, and Google Cloud Vision. The results show that our proposed DWP improves the targeted attack success rates with up to $10.1$%, $6.6$%, and $7.0$% on the combination of state-of-the-art methods, respectively. The source code will be made available after acceptance.
    Probabilistic Rank and Reward: A Scalable Model for Slate Recommendation. (arXiv:2208.06263v2 [cs.IR] UPDATED)
    We introduce Probabilistic Rank and Reward (PRR), a scalable probabilistic model for personalized slate recommendation. Our approach allows state-of-the-art estimation of the user interests in the ubiquitous scenario where the user interacts with at most one item from a slate of K items. We show that the probability of a slate being successful can be learned efficiently by combining the reward, whether the user successfully interacted with the slate, and the rank, the item that was selected within the slate. PRR outperforms competing approaches that use one signal or the other and is far more scalable to large action spaces. Moreover, PRR allows fast delivery of recommendations powered by maximum inner product search (MIPS), making it suitable in low latency domains such as computational advertising.
    Gradient Estimation with Discrete Stein Operators. (arXiv:2202.09497v4 [stat.ML] UPDATED)
    Gradient estimation -- approximating the gradient of an expectation with respect to the parameters of a distribution -- is central to the solution of many machine learning problems. However, when the distribution is discrete, most common gradient estimators suffer from excessive variance. To improve the quality of gradient estimation, we introduce a variance reduction technique based on Stein operators for discrete distributions. We then use this technique to build flexible control variates for the REINFORCE leave-one-out estimator. Our control variates can be adapted online to minimize variance and do not require extra evaluations of the target function. In benchmark generative modeling tasks such as training binary variational autoencoders, our gradient estimator achieves substantially lower variance than state-of-the-art estimators with the same number of function evaluations.
    Continual Learning for Peer-to-Peer Federated Learning: A Study on Automated Brain Metastasis Identification. (arXiv:2204.13591v3 [cs.LG] UPDATED)
    Due to data privacy constraints, data sharing among multiple centers is restricted. Continual learning, as one approach to peer-to-peer federated learning, can promote multicenter collaboration on deep learning algorithm development by sharing intermediate models instead of training data. This work aims to investigate the feasibility of continual learning for multicenter collaboration on an exemplary application of brain metastasis identification using DeepMedic. 920 T1 MRI contrast enhanced volumes are split to simulate multicenter collaboration scenarios. A continual learning algorithm, synaptic intelligence (SI), is applied to preserve important model weights for training one center after another. In a bilateral collaboration scenario, continual learning with SI achieves a sensitivity of 0.917, and naive continual learning without SI achieves a sensitivity of 0.906, while two models trained on internal data solely without continual learning achieve sensitivity of 0.853 and 0.831 only. In a seven-center multilateral collaboration scenario, the models trained on internal datasets (100 volumes each center) without continual learning obtain a mean sensitivity value of 0.699. With single-visit continual learning (i.e., the shared model visits each center only once during training), the sensitivity is improved to 0.788 and 0.849 without SI and with SI, respectively. With iterative continual learning (i.e., the shared model revisits each center multiple times during training), the sensitivity is further improved to 0.914, which is identical to the sensitivity using mixed data for training. Our experiments demonstrate that continual learning can improve brain metastasis identification performance for centers with limited data. This study demonstrates the feasibility of applying continual learning for peer-to-peer federated learning in multicenter collaboration.
    Go Beyond Point Pairs: A General and Accurate Sim2Real Object Pose Voting Method with Efficient Online Synthetic Training. (arXiv:2211.13398v1 [cs.CV])
    Object pose estimation is an important topic in 3D vision. Though most current state-of-the-art method that trains on real-world pose annotations achieve good results, the cost of such real-world training data is too high. In this paper, we propose a novel method for sim-to-real pose estimation, which is effective on both instance-level and category-level settings. The proposed method is based on the point-pair voting scheme from CPPF to vote for object centers, orientations, and scales. Unlike naive point pairs, to enrich the context provided by each voting unit, we introduce N-point tuples to fuse features from more than two points. Besides, a novel vote selection module is leveraged in order to discard those `bad' votes. Experiments show that our proposed method greatly advances the performance on both instance-level and category-level scenarios. Our method further narrows the gap between sim-to-real and real-training methods by generating synthetic training data online efficiently, while all previous sim-to-real methods need to generate data offline, because of their complex background synthesizing or photo-realistic rendering. Code repository: https://github.com/qq456cvb/BeyondPPF.
    Estimating Regression Predictive Distributions with Sample Networks. (arXiv:2211.13724v1 [cs.LG])
    Estimating the uncertainty in deep neural network predictions is crucial for many real-world applications. A common approach to model uncertainty is to choose a parametric distribution and fit the data to it using maximum likelihood estimation. The chosen parametric form can be a poor fit to the data-generating distribution, resulting in unreliable uncertainty estimates. In this work, we propose SampleNet, a flexible and scalable architecture for modeling uncertainty that avoids specifying a parametric form on the output distribution. SampleNets do so by defining an empirical distribution using samples that are learned with the Energy Score and regularized with the Sinkhorn Divergence. SampleNets are shown to be able to well-fit a wide range of distributions and to outperform baselines on large-scale real-world regression tasks.
    Estimation of a Causal Directed Acyclic Graph Process using Non-Gaussianity. (arXiv:2211.13800v1 [cs.LG])
    Numerous approaches have been proposed to discover causal dependencies in machine learning and data mining; among them, the state-of-the-art VAR-LiNGAM (short for Vector Auto-Regressive Linear Non-Gaussian Acyclic Model) is a desirable approach to reveal both the instantaneous and time-lagged relationships. However, all the obtained VAR matrices need to be analyzed to infer the final causal graph, leading to a rise in the number of parameters. To address this issue, we propose the CGP-LiNGAM (short for Causal Graph Process-LiNGAM), which has significantly fewer model parameters and deals with only one causal graph for interpreting the causal relations by exploiting Graph Signal Processing (GSP).
    Extreme Acceleration of Graph Neural Network-based Prediction Models for Quantum Chemistry. (arXiv:2211.13853v1 [cs.LG])
    Molecular property calculations are the bedrock of chemical physics. High-fidelity \textit{ab initio} modeling techniques for computing the molecular properties can be prohibitively expensive, and motivate the development of machine-learning models that make the same predictions more efficiently. Training graph neural networks over large molecular databases introduces unique computational challenges such as the need to process millions of small graphs with variable size and support communication patterns that are distinct from learning over large graphs such as social networks. This paper demonstrates a novel hardware-software co-design approach to scale up the training of graph neural networks for molecular property prediction. We introduce an algorithm to coalesce the batches of molecular graphs into fixed size packs to eliminate redundant computation and memory associated with alternative padding techniques and improve throughput via minimizing communication. We demonstrate the effectiveness of our co-design approach by providing an implementation of a well-established molecular property prediction model on the Graphcore Intelligence Processing Units (IPU). We evaluate the training performance on multiple molecular graph databases with varying degrees of graph counts, sizes and sparsity. We demonstrate that such a co-design approach can reduce the training time of such molecular property prediction models from days to less than two hours, opening new possibilities for AI-driven scientific discovery.
    PAC-Bayes Compression Bounds So Tight That They Can Explain Generalization. (arXiv:2211.13609v1 [cs.LG])
    While there has been progress in developing non-vacuous generalization bounds for deep neural networks, these bounds tend to be uninformative about why deep learning works. In this paper, we develop a compression approach based on quantizing neural network parameters in a linear subspace, profoundly improving on previous results to provide state-of-the-art generalization bounds on a variety of tasks, including transfer learning. We use these tight bounds to better understand the role of model size, equivariance, and the implicit biases of optimization, for generalization in deep learning. Notably, we find large models can be compressed to a much greater extent than previously known, encapsulating Occam's razor. We also argue for data-independent bounds in explaining generalization.
    One-Shot General Object Localization. (arXiv:2211.13392v1 [cs.CV])
    This paper presents a general one-shot object localization algorithm called OneLoc. Current one-shot object localization or detection methods either rely on a slow exhaustive feature matching process or lack the ability to generalize to novel objects. In contrast, our proposed OneLoc algorithm efficiently finds the object center and bounding box size by a special voting scheme. To keep our method scale-invariant, only unit center offset directions and relative sizes are estimated. A novel dense equalized voting module is proposed to better locate small texture-less objects. Experiments show that the proposed method achieves state-of-the-art overall performance on two datasets: OnePose dataset and LINEMOD dataset. In addition, our method can also achieve one-shot multi-instance detection and non-rigid object localization. Code repository: https://github.com/qq456cvb/OneLoc.
    LU decomposition and Toeplitz decomposition of a neural network. (arXiv:2211.13935v1 [cs.LG])
    It is well-known that any matrix $A$ has an LU decomposition. Less well-known is the fact that it has a 'Toeplitz decomposition' $A = T_1 T_2 \cdots T_r$ where $T_i$'s are Toeplitz matrices. We will prove that any continuous function $f : \mathbb{R}^n \to \mathbb{R}^m$ has an approximation to arbitrary accuracy by a neural network that takes the form $L_1 \sigma_1 U_1 \sigma_2 L_2 \sigma_3 U_2 \cdots L_r \sigma_{2r-1} U_r$, i.e., where the weight matrices alternate between lower and upper triangular matrices, $\sigma_i(x) := \sigma(x - b_i)$ for some bias vector $b_i$, and the activation $\sigma$ may be chosen to be essentially any uniformly continuous nonpolynomial function. The same result also holds with Toeplitz matrices, i.e., $f \approx T_1 \sigma_1 T_2 \sigma_2 \cdots \sigma_{r-1} T_r$ to arbitrary accuracy, and likewise for Hankel matrices. A consequence of our Toeplitz result is a fixed-width universal approximation theorem for convolutional neural networks, which so far have only arbitrary width versions. Since our results apply in particular to the case when $f$ is a general neural network, we may regard them as LU and Toeplitz decompositions of a neural network. The practical implication of our results is that one may vastly reduce the number of weight parameters in a neural network without sacrificing its power of universal approximation. We will present several experiments on real data sets to show that imposing such structures on the weight matrices sharply reduces the number of training parameters with almost no noticeable effect on test accuracy.
    Tensor Decomposition of Large-scale Clinical EEGs Reveals Interpretable Patterns of Brain Physiology. (arXiv:2211.13793v1 [eess.SP])
    Identifying abnormal patterns in electroencephalography (EEG) remains the cornerstone of diagnosing several neurological diseases. The current clinical EEG review process relies heavily on expert visual review, which is unscalable and error-prone. In an effort to augment the expert review process, there is a significant interest in mining population-level EEG patterns using unsupervised approaches. Current approaches rely either on two-dimensional decompositions (e.g., principal and independent component analyses) or deep representation learning (e.g., auto-encoders, self-supervision). However, most approaches do not leverage the natural multi-dimensional structure of EEGs and lack interpretability. In this study, we propose a tensor decomposition approach using the canonical polyadic decomposition to discover a parsimonious set of population-level EEG patterns, retaining the natural multi-dimensional structure of EEGs (time x space x frequency). We then validate their clinical value using a cohort of patients including varying stages of cognitive impairment. Our results show that the discovered patterns reflect physiologically meaningful features and accurately classify the stages of cognitive impairment (healthy vs mild cognitive impairment vs Alzheimer's dementia) with substantially fewer features compared to classical and deep learning-based baselines. We conclude that the decomposition of population-level EEG tensors recovers expert-interpretable EEG patterns that can aid in the study of smaller specialized clinical cohorts.
    Meta-Learning for Automated Selection of Anomaly Detectors for Semi-Supervised Datasets. (arXiv:2211.13681v1 [cs.LG])
    In anomaly detection, a prominent task is to induce a model to identify anomalies learned solely based on normal data. Generally, one is interested in finding an anomaly detector that correctly identifies anomalies, i.e., data points that do not belong to the normal class, without raising too many false alarms. Which anomaly detector is best suited depends on the dataset at hand and thus needs to be tailored. The quality of an anomaly detector may be assessed via confusion-based metrics such as the Matthews correlation coefficient (MCC). However, since during training only normal data is available in a semi-supervised setting, such metrics are not accessible. To facilitate automated machine learning for anomaly detectors, we propose to employ meta-learning to predict MCC scores based on metrics that can be computed with normal data only. First promising results can be obtained considering the hypervolume and the false positive rate as meta-features.
    ML Interpretability: Simple Isn't Easy. (arXiv:2211.13617v1 [cs.LG])
    The interpretability of ML models is important, but it is not clear what it amounts to. So far, most philosophers have discussed the lack of interpretability of black-box models such as neural networks, and methods such as explainable AI that aim to make these models more transparent. The goal of this paper is to clarify the nature of interpretability by focussing on the other end of the 'interpretability spectrum'. The reasons why some models, linear models and decision trees, are highly interpretable will be examined, and also how more general models, MARS and GAM, retain some degree of interpretability. I find that while there is heterogeneity in how we gain interpretability, what interpretability is in particular cases can be explicated in a clear manner.
    Learning to Take a Break: Sustainable Optimization of Long-Term User Engagement. (arXiv:2211.13585v1 [cs.LG])
    Optimizing user engagement is a key goal for modern recommendation systems, but blindly pushing users towards increased consumption risks burn-out, churn, or even addictive habits. To promote digital well-being, most platforms now offer a service that periodically prompts users to take a break. These, however, must be set up manually, and so may be suboptimal for both users and the system. In this paper, we propose a framework for optimizing long-term engagement by learning individualized breaking policies. Using Lotka-Volterra dynamics, we model users as acting based on two balancing latent states: drive, and interest -- which must be conserved. We then give an efficient learning algorithm, provide theoretical guarantees, and empirically evaluate its performance on semi-synthetic data.
    Multi-Environment Pretraining Enables Transfer to Action Limited Datasets. (arXiv:2211.13337v1 [cs.LG])
    Using massive datasets to train large-scale models has emerged as a dominant approach for broad generalization in natural language and vision applications. In reinforcement learning, however, a key challenge is that available data of sequential decision making is often not annotated with actions - for example, videos of game-play are much more available than sequences of frames paired with their logged game controls. We propose to circumvent this challenge by combining large but sparsely-annotated datasets from a \emph{target} environment of interest with fully-annotated datasets from various other \emph{source} environments. Our method, Action Limited PreTraining (ALPT), leverages the generalization capabilities of inverse dynamics modelling (IDM) to label missing action data in the target environment. We show that utilizing even one additional environment dataset of labelled data during IDM pretraining gives rise to substantial improvements in generating action labels for unannotated sequences. We evaluate our method on benchmark game-playing environments and show that we can significantly improve game performance and generalization capability compared to other approaches, using annotated datasets equivalent to only $12$ minutes of gameplay. Highlighting the power of IDM, we show that these benefits remain even when target and source environments share no common actions.
    The intersection of machine learning with forecasting and optimisation: theory and applications. (arXiv:2211.13583v1 [cs.LG])
    Forecasting and optimisation are two major fields of operations research that are widely used in practice. These methods have contributed to each other growth in several ways. However, the nature of the relationship between these two fields and integrating them have not been explored or understood enough. We advocate the integration of these two fields and explore several problems that require both forecasting and optimisation to deal with the uncertainties. We further investigate some of the methodologies that lie at the intersection of machine learning with prediction and optimisation to address real-world problems. Finally, we provide several research directions for those interested to work in this domain.  ( 2 min )
    Towards Interpretable Anomaly Detection via Invariant Rule Mining. (arXiv:2211.13577v1 [cs.LG])
    In the research area of anomaly detection, novel and promising methods are frequently developed. However, most existing studies, especially those leveraging deep neural networks, exclusively focus on the detection task only and ignore the interpretability of the underlying models as well as their detection results. However, anomaly interpretation, which aims to provide explanation of why specific data instances are identified as anomalies, is an equally (if not more) important task in many real-world applications. In this work, we pursue highly interpretable anomaly detection via invariant rule mining. Specifically, we leverage decision tree learning and association rule mining to automatically generate invariant rules that are consistently satisfied by the underlying data generation process. The generated invariant rules can provide explicit explanation of anomaly detection results and thus are extremely useful for subsequent decision-making. Furthermore, our empirical evaluation shows that the proposed method can also achieve comparable performance in terms of AUC and partial AUC with popular anomaly detection models in various benchmark datasets.  ( 2 min )
    GitHub Considered Harmful? Analyzing Open-Source Projects for the Automatic Generation of Cryptographic API Call Sequences. (arXiv:2211.13498v1 [cs.CR])
    GitHub is a popular data repository for code examples. It is being continuously used to train several AI-based tools to automatically generate code. However, the effectiveness of such tools in correctly demonstrating the usage of cryptographic APIs has not been thoroughly assessed. In this paper, we investigate the extent and severity of misuses, specifically caused by incorrect cryptographic API call sequences in GitHub. We also analyze the suitability of GitHub data to train a learning-based model to generate correct cryptographic API call sequences. For this, we manually extracted and analyzed the call sequences from GitHub. Using this data, we augmented an existing learning-based model called DeepAPI to create two security-specific models that generate cryptographic API call sequences for a given natural language (NL) description. Our results indicate that it is imperative to not neglect the misuses in API call sequences while using data sources like GitHub, to train models that generate code.  ( 2 min )
    Spatial Mixture-of-Experts. (arXiv:2211.13491v1 [cs.LG])
    Many data have an underlying dependence on spatial location; it may be weather on the Earth, a simulation on a mesh, or a registered image. Yet this feature is rarely taken advantage of, and violates common assumptions made by many neural network layers, such as translation equivariance. Further, many works that do incorporate locality fail to capture fine-grained structure. To address this, we introduce the Spatial Mixture-of-Experts (SMoE) layer, a sparsely-gated layer that learns spatial structure in the input domain and routes experts at a fine-grained level to utilize it. We also develop new techniques to train SMoEs, including a self-supervised routing loss and damping expert errors. Finally, we show strong results for SMoEs on numerous tasks, and set new state-of-the-art results for medium-range weather prediction and post-processing ensemble weather forecasts.  ( 2 min )
    Online Regularized Learning Algorithm for Functional Data. (arXiv:2211.13549v1 [stat.ML])
    In recent years, functional linear models have attracted growing attention in statistics and machine learning, with the aim of recovering the slope function or its functional predictor. This paper considers online regularized learning algorithm for functional linear models in reproducing kernel Hilbert spaces. Convergence analysis of excess prediction error and estimation error are provided with polynomially decaying step-size and constant step-size, respectively. Fast convergence rates can be derived via a capacity dependent analysis. By introducing an explicit regularization term, we uplift the saturation boundary of unregularized online learning algorithms when the step-size decays polynomially, and establish fast convergence rates of estimation error without capacity assumption. However, it remains an open problem to obtain capacity independent convergence rates for the estimation error of the unregularized online learning algorithm with decaying step-size. It also shows that convergence rates of both prediction error and estimation error with constant step-size are competitive with those in the literature.  ( 2 min )
    CoMadOut -- A Robust Outlier Detection Algorithm based on CoMAD. (arXiv:2211.13314v1 [cs.LG])
    Unsupervised learning methods are well established in the area of anomaly detection and achieve state of the art performances on outlier data sets. Outliers play a significant role, since they bear the potential to distort the predictions of a machine learning algorithm on a given data set. Especially among PCA-based methods, outliers have an additional destructive potential regarding the result: they may not only distort the orientation and translation of the principal components, they also make it more complicated to detect outliers. To address this problem, we propose the robust outlier detection algorithm CoMadOut, which satisfies two required properties: (1) being robust towards outliers and (2) detecting them. Our outlier detection method using coMAD-PCA defines dependent on its variant an inlier region with a robust noise margin by measures of in-distribution (ID) and out-of-distribution (OOD). These measures allow distribution based outlier scoring for each principal component, and thus, for an appropriate alignment of the decision boundary between normal and abnormal instances. Experiments comparing CoMadOut with traditional, deep and other comparable robust outlier detection methods showed that the performance of the introduced CoMadOut approach is competitive to well established methods related to average precision (AP), recall and area under the receiver operating characteristic (AUROC) curve. In summary our approach can be seen as a robust alternative for outlier detection tasks.  ( 2 min )
    Understanding Sample Generation Strategies for Learning Heuristic Functions in Classical Planning. (arXiv:2211.13316v1 [cs.AI])
    We study the problem of learning good heuristic functions for classical planning tasks with neural networks based on samples that are states with their cost-to-goal estimates. It is well known that the learned model quality depends on the training data quality. Our main goal is to understand better the influence of sample generation strategies on the performance of a greedy best-first heuristic search guided by a learned heuristic function. In a set of controlled experiments, we find that two main factors determine the quality of the learned heuristic: the regions of the state space included in the samples and the quality of the cost-to-goal estimates. Also, these two factors are interdependent: having perfect estimates of cost-to-goal is insufficient if an unrepresentative part of the state space is included in the sample set. Additionally, we study the effects of restricting samples to only include states that could be evaluated when solving a given task and the effects of adding samples with high-value estimates. Based on our findings, we propose practical strategies to improve the quality of learned heuristics: three strategies that aim to generate more representative states and two strategies that improve the cost-to-goal estimates. Our resulting neural network heuristic has higher coverage than a basic satisficing heuristic. Also, compared to a baseline learned heuristic, our best neural network heuristic almost doubles the mean coverage and can increase it for some domains by more than six times.  ( 2 min )
    Dual Graphs of Polyhedral Decompositions for the Detection of Adversarial Attacks. (arXiv:2211.13305v1 [cs.CV])
    Previous work has shown that a neural network with the rectified linear unit (ReLU) activation function leads to a convex polyhedral decomposition of the input space. These decompositions can be represented by a dual graph with vertices corresponding to polyhedra and edges corresponding to polyhedra sharing a facet, which is a subgraph of a Hamming graph. This paper illustrates how one can utilize the dual graph to detect and analyze adversarial attacks in the context of digital images. When an image passes through a network containing ReLU nodes, the firing or non-firing at a node can be encoded as a bit ($1$ for ReLU activation, $0$ for ReLU non-activation). The sequence of all bit activations identifies the image with a bit vector, which identifies it with a polyhedron in the decomposition and, in turn, identifies it with a vertex in the dual graph. We identify ReLU bits that are discriminators between non-adversarial and adversarial images and examine how well collections of these discriminators can ensemble vote to build an adversarial image detector. Specifically, we examine the similarities and differences of ReLU bit vectors for adversarial images, and their non-adversarial counterparts, using a pre-trained ResNet-50 architecture. While this paper focuses on adversarial digital images, ResNet-50 architecture, and the ReLU activation function, our methods extend to other network architectures, activation functions, and types of datasets.  ( 2 min )
    Learning and Testing Latent-Tree Ising Models Efficiently. (arXiv:2211.13291v1 [cs.LG])
    We provide time- and sample-efficient algorithms for learning and testing latent-tree Ising models, i.e. Ising models that may only be observed at their leaf nodes. On the learning side, we obtain efficient algorithms for learning a tree-structured Ising model whose leaf node distribution is close in Total Variation Distance, improving on the results of prior work. On the testing side, we provide an efficient algorithm with fewer samples for testing whether two latent-tree Ising models have leaf-node distributions that are close or far in Total Variation distance. We obtain our algorithms by showing novel localization results for the total variation distance between the leaf-node distributions of tree-structured Ising models, in terms of their marginals on pairs of leaves.  ( 2 min )
    Supervised Hypergraph Reconstruction. (arXiv:2211.13343v1 [cs.SI])
    We study an issue commonly seen with graph data analysis: many real-world complex systems involving high-order interactions are best encoded by hypergraphs; however, their datasets often end up being published or studied only in the form of their projections (with dyadic edges). To understand this issue, we first establish a theoretical framework to characterize this issue's implications and worst-case scenarios. The analysis motivates our formulation of the new task, supervised hypergraph reconstruction: reconstructing a real-world hypergraph from its projected graph, with the help of some existing knowledge of the application domain. To reconstruct hypergraph data, we start by analyzing hyperedge distributions in the projection, based on which we create a framework containing two modules: (1) to handle the enormous search space of potential hyperedges, we design a sampling strategy with efficacy guarantees that significantly narrows the space to a smaller set of candidates; (2) to identify hyperedges from the candidates, we further design a hyperedge classifier in two well-working variants that capture structural features in the projection. Extensive experiments validate our claims, approach, and extensions. Remarkably, our approach outperforms all baselines by an order of magnitude in accuracy on hard datasets. Our code and data can be downloaded from bit.ly/SHyRe.  ( 2 min )
    Multiple Imputation with Neural Network Gaussian Process for High-dimensional Incomplete Data. (arXiv:2211.13297v1 [cs.LG])
    Missing data are ubiquitous in real world applications and, if not adequately handled, may lead to the loss of information and biased findings in downstream analysis. Particularly, high-dimensional incomplete data with a moderate sample size, such as analysis of multi-omics data, present daunting challenges. Imputation is arguably the most popular method for handling missing data, though existing imputation methods have a number of limitations. Single imputation methods such as matrix completion methods do not adequately account for imputation uncertainty and hence would yield improper statistical inference. In contrast, multiple imputation (MI) methods allow for proper inference but existing methods do not perform well in high-dimensional settings. Our work aims to address these significant methodological gaps, leveraging recent advances in neural network Gaussian process (NNGP) from a Bayesian viewpoint. We propose two NNGP-based MI methods, namely MI-NNGP, that can apply multiple imputations for missing values from a joint (posterior predictive) distribution. The MI-NNGP methods are shown to significantly outperform existing state-of-the-art methods on synthetic and real datasets, in terms of imputation error, statistical inference, robustness to missing rates, and computation costs, under three missing data mechanisms, MCAR, MAR, and MNAR.  ( 2 min )
    Lempel-Ziv Networks. (arXiv:2211.13250v1 [cs.LG])
    Sequence processing has long been a central area of machine learning research. Recurrent neural nets have been successful in processing sequences for a number of tasks; however, they are known to be both ineffective and computationally expensive when applied to very long sequences. Compression-based methods have demonstrated more robustness when processing such sequences -- in particular, an approach pairing the Lempel-Ziv Jaccard Distance (LZJD) with the k-Nearest Neighbor algorithm has shown promise on long sequence problems (up to $T=200,000,000$ steps) involving malware classification. Unfortunately, use of LZJD is limited to discrete domains. To extend the benefits of LZJD to a continuous domain, we investigate the effectiveness of a deep-learning analog of the algorithm, the Lempel-Ziv Network. While we achieve successful proof of concept, we are unable to improve meaningfully on the performance of a standard LSTM across a variety of datasets and sequence processing tasks. In addition to presenting this negative result, our work highlights the problem of sub-par baseline tuning in newer research areas.  ( 2 min )
    DeltaNet:Conditional Medical Report Generation for COVID-19 Diagnosis. (arXiv:2211.13229v1 [eess.IV])
    Fast screening and diagnosis are critical in COVID-19 patient treatment. In addition to the gold standard RT-PCR, radiological imaging like X-ray and CT also works as an important means in patient screening and follow-up. However, due to the excessive number of patients, writing reports becomes a heavy burden for radiologists. To reduce the workload of radiologists, we propose DeltaNet to generate medical reports automatically. Different from typical image captioning approaches that generate reports with an encoder and a decoder, DeltaNet applies a conditional generation process. In particular, given a medical image, DeltaNet employs three steps to generate a report: 1) first retrieving related medical reports, i.e., the historical reports from the same or similar patients; 2) then comparing retrieved images and current image to find the differences; 3) finally generating a new report to accommodate identified differences based on the conditional report. We evaluate DeltaNet on a COVID-19 dataset, where DeltaNet outperforms state-of-the-art approaches. Besides COVID-19, the proposed DeltaNet can be applied to other diseases as well. We validate its generalization capabilities on the public IU-Xray and MIMIC-CXR datasets for chest-related diseases. Code is available at \url{https://github.com/LX-doctorAI1/DeltaNet}.  ( 2 min )
    Predicting Biomedical Interactions with Probabilistic Model Selection for Graph Neural Networks. (arXiv:2211.13231v1 [q-bio.QM])
    A biological system is a complex network of heterogeneous molecular entities and their interactions contributing to various biological characteristics of the system. However, current biological networks are noisy, sparse, and incomplete, limiting our ability to create a holistic view of the biological system and understand the biological phenomena. Experimental identification of such interactions is both time-consuming and expensive. With the recent advancements in high-throughput data generation and significant improvement in computational power, various computational methods have been developed to predict novel interactions in the noisy network. Recently, deep learning methods such as graph neural networks have shown their effectiveness in modeling graph-structured data and achieved good performance in biomedical interaction prediction. However, graph neural networks-based methods require human expertise and experimentation to design the appropriate complexity of the model and significantly impact the performance of the model. Furthermore, deep graph neural networks face overfitting problems and tend to be poorly calibrated with high confidence on incorrect predictions. To address these challenges, we propose Bayesian model selection for graph convolutional networks to jointly infer the most plausible number of graph convolution layers (depth) warranted by data and perform dropout regularization simultaneously. Experiments on four interaction datasets show that our proposed method achieves accurate and calibrated predictions. Our proposed method enables the graph convolutional networks to dynamically adapt their depths to accommodate an increasing number of interactions.  ( 2 min )
    Shapley Curves: A Smoothing Perspective. (arXiv:2211.13289v1 [stat.ML])
    Originating from cooperative game theory, Shapley values have become one of the most widely used measures for variable importance in applied Machine Learning. However, the statistical understanding of Shapley values is still limited. In this paper, we take a nonparametric (or smoothing) perspective by introducing Shapley curves as a local measure of variable importance. We propose two estimation strategies and derive the consistency and asymptotic normality both under independence and dependence among the features. This allows us to construct confidence intervals and conduct inference on the estimated Shapley curves. The asymptotic results are validated in extensive experiments. In an empirical application, we analyze which attributes drive the prices of vehicles.  ( 2 min )
    Distilling Knowledge from Self-Supervised Teacher by Embedding Graph Alignment. (arXiv:2211.13264v1 [cs.CV])
    Recent advances have indicated the strengths of self-supervised pre-training for improving representation learning on downstream tasks. Existing works often utilize self-supervised pre-trained models by fine-tuning on downstream tasks. However, fine-tuning does not generalize to the case when one needs to build a customized model architecture different from the self-supervised model. In this work, we formulate a new knowledge distillation framework to transfer the knowledge from self-supervised pre-trained models to any other student network by a novel approach named Embedding Graph Alignment. Specifically, inspired by the spirit of instance discrimination in self-supervised learning, we model the instance-instance relations by a graph formulation in the feature embedding space and distill the self-supervised teacher knowledge to a student network by aligning the teacher graph and the student graph. Our distillation scheme can be flexibly applied to transfer the self-supervised knowledge to enhance representation learning on various student networks. We demonstrate that our model outperforms multiple representative knowledge distillation methods on three benchmark datasets, including CIFAR100, STL10, and TinyImageNet. Code is here: https://github.com/yccm/EGA.  ( 2 min )
    Actively Learning Costly Reward Functions for Reinforcement Learning. (arXiv:2211.13260v1 [cs.LG])
    Transfer of recent advances in deep reinforcement learning to real-world applications is hindered by high data demands and thus low efficiency and scalability. Through independent improvements of components such as replay buffers or more stable learning algorithms, and through massively distributed systems, training time could be reduced from several days to several hours for standard benchmark tasks. However, while rewards in simulated environments are well-defined and easy to compute, reward evaluation becomes the bottleneck in many real-world environments, e.g., in molecular optimization tasks, where computationally demanding simulations or even experiments are required to evaluate states and to quantify rewards. Therefore, training might become prohibitively expensive without an extensive amount of computational resources and time. We propose to alleviate this problem by replacing costly ground-truth rewards with rewards modeled by neural networks, counteracting non-stationarity of state and reward distributions during training with an active learning component. We demonstrate that using our proposed ACRL method (Actively learning Costly rewards for Reinforcement Learning), it is possible to train agents in complex real-world environments orders of magnitudes faster. By enabling the application of reinforcement learning methods to new domains, we show that we can find interesting and non-trivial solutions to real-world optimization problems in chemistry, materials science and engineering.  ( 2 min )
    How do Cross-View and Cross-Modal Alignment Affect Representations in Contrastive Learning?. (arXiv:2211.13309v1 [cs.CV])
    Various state-of-the-art self-supervised visual representation learning approaches take advantage of data from multiple sensors by aligning the feature representations across views and/or modalities. In this work, we investigate how aligning representations affects the visual features obtained from cross-view and cross-modal contrastive learning on images and point clouds. On five real-world datasets and on five tasks, we train and evaluate 108 models based on four pretraining variations. We find that cross-modal representation alignment discards complementary visual information, such as color and texture, and instead emphasizes redundant depth cues. The depth cues obtained from pretraining improve downstream depth prediction performance. Also overall, cross-modal alignment leads to more robust encoders than pre-training by cross-view alignment, especially on depth prediction, instance segmentation, and object detection.  ( 2 min )
    Representation Learning for Continuous Action Spaces is Beneficial for Efficient Policy Learning. (arXiv:2211.13257v1 [cs.LG])
    Deep reinforcement learning (DRL) breaks through the bottlenecks of traditional reinforcement learning (RL) with the help of the perception capability of deep learning and has been widely applied in real-world problems.While model-free RL, as a class of efficient DRL methods, performs the learning of state representations simultaneously with policy learning in an end-to-end manner when facing large-scale continuous state and action spaces. However, training such a large policy model requires a large number of trajectory samples and training time. On the other hand, the learned policy often fails to generalize to large-scale action spaces, especially for the continuous action spaces. To address this issue, in this paper we propose an efficient policy learning method in latent state and action spaces. More specifically, we extend the idea of state representations to action representations for better policy generalization capability. Meanwhile, we divide the whole learning task into learning with the large-scale representation models in an unsupervised manner and learning with the small-scale policy model in the RL manner.The small policy model facilitates policy learning, while not sacrificing generalization and expressiveness via the large representation model. Finally,the effectiveness of the proposed method is demonstrated by MountainCar,CarRacing and Cheetah experiments.  ( 2 min )
    MEGAN: Multi-Explanation Graph Attention Network. (arXiv:2211.13236v1 [cs.LG])
    Explainable artificial intelligence (XAI) methods are expected to improve trust during human-AI interactions, provide tools for model analysis and extend human understanding of complex problems. Explanation-supervised training allows to improve explanation quality by training self-explaining XAI models on ground truth or human-generated explanations. However, existing explanation methods have limited expressiveness and interoperability due to the fact that only single explanations in form of node and edge importance are generated. To that end we propose the novel multi-explanation graph attention network (MEGAN). Our fully differentiable, attention-based model features multiple explanation channels, which can be chosen independently of the task specifications. We first validate our model on a synthetic graph regression dataset. We show that for the special single explanation case, our model significantly outperforms existing post-hoc and explanation-supervised baseline methods. Furthermore, we demonstrate significant advantages when using two explanations, both in quantitative explanation measures as well as in human interpretability. Finally, we demonstrate our model's capabilities on multiple real-world datasets. We find that our model produces sparse high-fidelity explanations consistent with human intuition about those tasks and at the same time matches state-of-the-art graph neural networks in predictive performance, indicating that explanations and accuracy are not necessarily a trade-off.  ( 2 min )
    RNTrajRec: Road Network Enhanced Trajectory Recovery with Spatial-Temporal Transformer. (arXiv:2211.13234v1 [cs.LG])
    GPS trajectories are the essential foundations for many trajectory-based applications, such as travel time estimation, traffic prediction and trajectory similarity measurement. Most applications require a large amount of high sample rate trajectories to achieve a good performance. However, many real-life trajectories are collected with low sample rate due to energy concern or other constraints.We study the task of trajectory recovery in this paper as a means for increasing the sample rate of low sample trajectories. Currently, most existing works on trajectory recovery follow a sequence-to-sequence diagram, with an encoder to encode a trajectory and a decoder to recover real GPS points in the trajectory. However, these works ignore the topology of road network and only use grid information or raw GPS points as input. Therefore, the encoder model is not able to capture rich spatial information of the GPS points along the trajectory, making the prediction less accurate and lack spatial consistency. In this paper, we propose a road network enhanced transformer-based framework, namely RNTrajRec, for trajectory recovery. RNTrajRec first uses a graph model, namely GridGNN, to learn the embedding features of each road segment. It next develops a Sub-Graph Generation module to represent each GPS point as a sub-graph structure of the road network around the GPS point. It then introduces a spatial-temporal transformer model, namely GPSFormer, to learn rich spatial and temporal features. It finally forwards the outputs of encoder model into a multi-task decoder model to recover the missing GPS points. Extensive experiments based on three large-scale real-life trajectory datasets confirm the effectiveness of our approach.  ( 2 min )
    Relating Regularization and Generalization through the Intrinsic Dimension of Activations. (arXiv:2211.13239v1 [cs.LG])
    Given a pair of models with similar training set performance, it is natural to assume that the model that possesses simpler internal representations would exhibit better generalization. In this work, we provide empirical evidence for this intuition through an analysis of the intrinsic dimension (ID) of model activations, which can be thought of as the minimal number of factors of variation in the model's representation of the data. First, we show that common regularization techniques uniformly decrease the last-layer ID (LLID) of validation set activations for image classification models and show how this strongly affects generalization performance. We also investigate how excessive regularization decreases a model's ability to extract features from data in earlier layers, leading to a negative effect on validation accuracy even while LLID continues to decrease and training accuracy remains near-perfect. Finally, we examine the LLID over the course of training of models that exhibit grokking. We observe that well after training accuracy saturates, when models ``grok'' and validation accuracy suddenly improves from random to perfect, there is a co-occurent sudden drop in LLID, thus providing more insight into the dynamics of sudden generalization.  ( 2 min )
    ProstAttention-Net: A deep attention model for prostate cancer segmentation by aggressiveness in MRI scans. (arXiv:2211.13238v1 [eess.IV])
    Multiparametric magnetic resonance imaging (mp-MRI) has shown excellent results in the detection of prostate cancer (PCa). However, characterizing prostate lesions aggressiveness in mp-MRI sequences is impossible in clinical practice, and biopsy remains the reference to determine the Gleason score (GS). In this work, we propose a novel end-to-end multi-class network that jointly segments the prostate gland and cancer lesions with GS group grading. After encoding the information on a latent space, the network is separated in two branches: 1) the first branch performs prostate segmentation 2) the second branch uses this zonal prior as an attention gate for the detection and grading of prostate lesions. The model was trained and validated with a 5-fold cross-validation on an heterogeneous series of 219 MRI exams acquired on three different scanners prior prostatectomy. In the free-response receiver operating characteristics (FROC) analysis for clinically significant lesions (defined as GS > 6) detection, our model achieves 69.0% $\pm$14.5% sensitivity at 2.9 false positive per patient on the whole prostate and 70.8% $\pm$14.4% sensitivity at 1.5 false positive when considering the peripheral zone (PZ) only. Regarding the automatic GS group  ( 2 min )
    Corn Yield Prediction based on Remotely Sensed Variables Using Variational Autoencoder and Multiple Instance Regression. (arXiv:2211.13286v1 [cs.CV])
    In the U.S., corn is the most produced crop and has been an essential part of the American diet. To meet the demand for supply chain management and regional food security, accurate and timely large-scale corn yield prediction is attracting more attention in precision agriculture. Recently, remote sensing technology and machine learning methods have been widely explored for crop yield prediction. Currently, most county-level yield prediction models use county-level mean variables for prediction, ignoring much detailed information. Moreover, inconsistent spatial resolution between crop area and satellite sensors results in mixed pixels, which may decrease the prediction accuracy. Only a few works have addressed the mixed pixels problem in large-scale crop yield prediction. To address the information loss and mixed pixels problem, we developed a variational autoencoder (VAE) based multiple instance regression (MIR) model for large-scaled corn yield prediction. We use all unlabeled data to train a VAE and the well-trained VAE for anomaly detection. As a preprocess method, anomaly detection can help MIR find a better representation of every bag than traditional MIR methods, thus better performing in large-scale corn yield prediction. Our experiments showed that variational autoencoder based multiple instance regression (VAEMIR) outperformed all baseline methods in large-scale corn yield prediction. Though a suitable meta parameter is required, VAEMIR shows excellent potential in feature learning and extraction for large-scale corn yield prediction.  ( 2 min )
    SEAT: Stable and Explainable Attention. (arXiv:2211.13290v1 [cs.CL])
    Currently, attention mechanism becomes a standard fixture in most state-of-the-art natural language processing (NLP) models, not only due to outstanding performance it could gain, but also due to plausible innate explanation for the behaviors of neural architectures it provides, which is notoriously difficult to analyze. However, recent studies show that attention is unstable against randomness and perturbations during training or testing, such as random seeds and slight perturbation of embedding vectors, which impedes it from becoming a faithful explanation tool. Thus, a natural question is whether we can find some substitute of the current attention which is more stable and could keep the most important characteristics on explanation and prediction of attention. In this paper, to resolve the problem, we provide a first rigorous definition of such alternate namely SEAT (Stable and Explainable Attention). Specifically, a SEAT should has the following three properties: (1) Its prediction distribution is enforced to be close to the distribution based on the vanilla attention; (2) Its top-k indices have large overlaps with those of the vanilla attention; (3) It is robust w.r.t perturbations, i.e., any slight perturbation on SEAT will not change the prediction distribution too much, which implicitly indicates that it is stable to randomness and perturbations. Finally, through intensive experiments on various datasets, we compare our SEAT with other baseline methods using RNN, BiLSTM and BERT architectures via six different evaluation metrics for model interpretation, stability and accuracy. Results show that SEAT is more stable against different perturbations and randomness while also keeps the explainability of attention, which indicates it is a more faithful explanation. Moreover, compared with vanilla attention, there is almost no utility (accuracy) degradation for SEAT.  ( 3 min )
    Proceedings of the 4th International Workshop on Reading Music Systems. (arXiv:2211.13285v1 [cs.CV])
    The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 4th International Workshop on Reading Music Systems, held online on Nov. 18th 2022.  ( 2 min )
  • Open

    Quantum tangent kernel. (arXiv:2111.02951v2 [quant-ph] UPDATED)
    Quantum kernel method is one of the key approaches to quantum machine learning, which has the advantages that it does not require optimization and has theoretical simplicity. By virtue of these properties, several experimental demonstrations and discussions of the potential advantages have been developed so far. However, as is the case in classical machine learning, not all quantum machine learning models could be regarded as kernel methods. In this work, we explore a quantum machine learning model with a deep parameterized quantum circuit and aim to go beyond the conventional quantum kernel method. In this case, the representation power and performance are expected to be enhanced, while the training process might be a bottleneck because of the barren plateaus issue. However, we find that parameters of a deep enough quantum circuit do not move much from its initial values during training, allowing first-order expansion with respect to the parameters. This behavior is similar to the neural tangent kernel in the classical literatures, and such a deep variational quantum machine learning can be described by another emergent kernel, quantum tangent kernel. Numerical simulations show that the proposed quantum tangent kernel outperforms the conventional quantum kernel method for an ansatz-generated dataset. This work provides a new direction beyond the conventional quantum kernel method and explores potential power of quantum machine learning with deep parameterized quantum circuits.
    An Ensemble-Based Deep Framework for Estimating Thermo-Chemical State Variables from Flamelet Generated Manifolds. (arXiv:2211.14098v1 [cs.LG])
    Complete computation of turbulent combustion flow involves two separate steps: mapping reaction kinetics to low-dimensional manifolds and looking-up this approximate manifold during CFD run-time to estimate the thermo-chemical state variables. In our previous work, we showed that using a deep architecture to learn the two steps jointly, instead of separately, is 73% more accurate at estimating the source energy, a key state variable, compared to benchmarks and can be integrated within a DNS turbulent combustion framework. In their natural form, such deep architectures do not allow for uncertainty quantification of the quantities of interest: the source energy and key species source terms. In this paper, we expand on such architectures, specifically ChemTab, by introducing deep ensembles to approximate the posterior distribution of the quantities of interest. We investigate two strategies of creating these ensemble models: one that keeps the flamelet origin information (Flamelets strategy) and one that ignores the origin and considers all the data independently (Points strategy). To train these models we used flamelet data generated by the GRI--Mech 3.0 methane mechanism, which consists of 53 chemical species and 325 reactions. Our results demonstrate that the Flamelets strategy is superior in terms of the absolute prediction error for the quantities of interest, but is reliant on the types of flamelets used to train the ensemble. The Points strategy is best at capturing the variability of the quantities of interest, independent of the flamelet types. We conclude that, overall, ChemTab Deep Ensembles allows for a more accurate representation of the source energy and key species source terms, compared to the model without these modifications.
    Exploiting Higher Order Smoothness in Derivative-free Optimization and Continuous Bandits. (arXiv:2006.07862v4 [cs.LG] UPDATED)
    We study the problem of zero-order optimization of a strongly convex function. The goal is to find the minimizer of the function by a sequential exploration of its values, under measurement noise. We study the impact of higher order smoothness properties of the function on the optimization error and on the cumulative regret. To solve this problem we consider a randomized approximation of the projected gradient descent algorithm. The gradient is estimated by a randomized procedure involving two function evaluations and a smoothing kernel. We derive upper bounds for this algorithm both in the constrained and unconstrained settings and prove minimax lower bounds for any sequential search method. Our results imply that the zero-order algorithm is nearly optimal in terms of sample complexity and the problem parameters. Based on this algorithm, we also propose an estimator of the minimum value of the function achieving almost sharp oracle behavior. We compare our results with the state-of-the-art, highlighting a number of key improvements.
    A Non-Classical Parameterization for Density Estimation Using Sample Moments. (arXiv:2201.04786v4 [stat.ML] UPDATED)
    Moment methods are an important means of density estimation, but they are generally strongly dependent on the choice of feasible functions, which severely affects the performance. In this paper, which is a very preliminary version, we propose a non-classical parametrization for density estimation using the sample moments, which does not require the choice of such functions. The parametrization is induced by the squared Hellinger distance, and the solution of it, which is proved to exist and be unique subject to simple prior that does not depend on data, can be obtained by convex optimization. Simulation results show the performance of the proposed estimator in estimating multi-modal densities which are mixtures of different types of functions, with a comparison to the prevailing methods.
    Gradient Estimation with Discrete Stein Operators. (arXiv:2202.09497v4 [stat.ML] UPDATED)
    Gradient estimation -- approximating the gradient of an expectation with respect to the parameters of a distribution -- is central to the solution of many machine learning problems. However, when the distribution is discrete, most common gradient estimators suffer from excessive variance. To improve the quality of gradient estimation, we introduce a variance reduction technique based on Stein operators for discrete distributions. We then use this technique to build flexible control variates for the REINFORCE leave-one-out estimator. Our control variates can be adapted online to minimize variance and do not require extra evaluations of the target function. In benchmark generative modeling tasks such as training binary variational autoencoders, our gradient estimator achieves substantially lower variance than state-of-the-art estimators with the same number of function evaluations.
    Time delay estimation of traffic congestion propagation due to accidents based on statistical causality. (arXiv:2108.06717v3 [stat.ML] UPDATED)
    The accurate estimation of time delays is crucial in traffic congestion analysis, as this information can be used to address fundamental questions regarding the origin and propagation of traffic congestion. However, the exact measurement of time delays during congestion remains a challenge owing to the complex propagation process between roads and high uncertainty regarding future behavior. To overcome this challenge, we propose a novel time delay estimation method for the propagation of traffic congestion due to accidents using lag-specific transfer entropy (TE). The proposed method adopts Markov bootstrap techniques to quantify uncertainty in the time delay estimator. To the best of our knowledge, our proposed method is the first to estimate time delays based on causal relationships between adjacent roads. We validated the method's efficacy using simulated data, as well as real user trajectory data obtained from a major GPS navigation system in South Korea.
    Lifting Weak Supervision To Structured Prediction. (arXiv:2211.13375v1 [cs.LG])
    Weak supervision (WS) is a rich set of techniques that produce pseudolabels by aggregating easily obtained but potentially noisy label estimates from a variety of sources. WS is theoretically well understood for binary classification, where simple approaches enable consistent estimation of pseudolabel noise rates. Using this result, it has been shown that downstream models trained on the pseudolabels have generalization guarantees nearly identical to those trained on clean labels. While this is exciting, users often wish to use WS for structured prediction, where the output space consists of more than a binary or multi-class label set: e.g. rankings, graphs, manifolds, and more. Do the favorable theoretical properties of WS for binary classification lift to this setting? We answer this question in the affirmative for a wide range of scenarios. For labels taking values in a finite metric space, we introduce techniques new to weak supervision based on pseudo-Euclidean embeddings and tensor decompositions, providing a nearly-consistent noise rate estimator. For labels in constant-curvature Riemannian manifolds, we introduce new invariants that also yield consistent noise rate estimation. In both cases, when using the resulting pseudolabels in concert with a flexible downstream model, we obtain generalization guarantees nearly identical to those for models trained on clean data. Several of our results, which can be viewed as robustness guarantees in structured prediction with noisy labels, may be of independent interest. Empirical evaluation validates our claims and shows the merits of the proposed method.
    Analysis of Error Feedback in Federated Non-Convex Optimization with Biased Compression. (arXiv:2211.14292v1 [stat.ML])
    In federated learning (FL) systems, e.g., wireless networks, the communication cost between the clients and the central server can often be a bottleneck. To reduce the communication cost, the paradigm of communication compression has become a popular strategy in the literature. In this paper, we focus on biased gradient compression techniques in non-convex FL problems. In the classical setting of distributed learning, the method of error feedback (EF) is a common technique to remedy the downsides of biased gradient compression. In this work, we study a compressed FL scheme equipped with error feedback, named Fed-EF. We further propose two variants: Fed-EF-SGD and Fed-EF-AMS, depending on the choice of the global model optimizer. We provide a generic theoretical analysis, which shows that directly applying biased compression in FL leads to a non-vanishing bias in the convergence rate. The proposed Fed-EF is able to match the convergence rate of the full-precision FL counterparts under data heterogeneity with a linear speedup. Moreover, we develop a new analysis of the EF under partial client participation, which is an important scenario in FL. We prove that under partial participation, the convergence rate of Fed-EF exhibits an extra slow-down factor due to a so-called ``stale error compensation'' effect. A numerical study is conducted to justify the intuitive impact of stale error accumulation on the norm convergence of Fed-EF under partial participation. Finally, we also demonstrate that incorporating the two-way compression in Fed-EF does not change the convergence results. In summary, our work conducts a thorough analysis of the error feedback in federated non-convex optimization. Our analysis with partial client participation also provides insights on a theoretical limitation of the error feedback mechanism, and possible directions for improvements.
    Generating 2D and 3D Master Faces for Dictionary Attacks with a Network-Assisted Latent Space Evolution. (arXiv:2211.13964v1 [cs.CR])
    A master face is a face image that passes face-based identity authentication for a high percentage of the population. These faces can be used to impersonate, with a high probability of success, any user, without having access to any user information. We optimize these faces for 2D and 3D face verification models, by using an evolutionary algorithm in the latent embedding space of the StyleGAN face generator. For 2D face verification, multiple evolutionary strategies are compared, and we propose a novel approach that employs a neural network to direct the search toward promising samples, without adding fitness evaluations. The results we present demonstrate that it is possible to obtain a considerable coverage of the identities in the LFW or RFW datasets with less than 10 master faces, for six leading deep face recognition systems. In 3D, we generate faces using the 2D StyleGAN2 generator and predict a 3D structure using a deep 3D face reconstruction network. When employing two different 3D face recognition systems, we are able to obtain a coverage of 40%-50%. Additionally, we present the generation of paired 2D RGB and 3D master faces, which simultaneously match 2D and 3D models with high impersonation rates.
    Inverse Solvability and Security with Applications to Federated Learning. (arXiv:2211.14115v1 [stat.ML])
    We introduce the concepts of inverse solvability and security for a generic linear forward model and demonstrate how they can be applied to models used in federated learning. We provide examples of such models which differ in the resulting inverse solvability and security as defined in this paper. We also show how the large number of users participating in a given iteration of federated learning can be leveraged to increase both solvability and security. Finally, we discuss possible extensions of the presented concepts including the nonlinear case.
    Dense Hebbian neural networks: a replica symmetric picture of unsupervised learning. (arXiv:2211.14067v1 [cond-mat.dis-nn])
    We consider dense, associative neural-networks trained with no supervision and we investigate their computational capabilities analytically, via a statistical-mechanics approach, and numerically, via Monte Carlo simulations. In particular, we obtain a phase diagram summarizing their performance as a function of the control parameters such as the quality and quantity of the training dataset and the network storage, valid in the limit of large network size and structureless datasets. Moreover, we establish a bridge between macroscopic observables standardly used in statistical mechanics and loss functions typically used in the machine learning. As technical remarks, from the analytic side, we implement large deviations and stability analysis within Guerra's interpolation to tackle the not-Gaussian distributions involved in the post-synaptic potentials while, from the computational counterpart, we insert Plefka approximation in the Monte Carlo scheme, to speed up the evaluation of the synaptic tensors, overall obtaining a novel and broad approach to investigate neural networks in general.
    Shapley Curves: A Smoothing Perspective. (arXiv:2211.13289v1 [stat.ML])
    Originating from cooperative game theory, Shapley values have become one of the most widely used measures for variable importance in applied Machine Learning. However, the statistical understanding of Shapley values is still limited. In this paper, we take a nonparametric (or smoothing) perspective by introducing Shapley curves as a local measure of variable importance. We propose two estimation strategies and derive the consistency and asymptotic normality both under independence and dependence among the features. This allows us to construct confidence intervals and conduct inference on the estimated Shapley curves. The asymptotic results are validated in extensive experiments. In an empirical application, we analyze which attributes drive the prices of vehicles.
    Doubly robust nearest neighbors in factor models. (arXiv:2211.14297v1 [stat.ML])
    In this technical note, we introduce an improved variant of nearest neighbors for counterfactual inference in panel data settings where multiple units are assigned multiple treatments over multiple time points, each sampled with constant probabilities. We call this estimator a doubly robust nearest neighbor estimator and provide a high probability non-asymptotic error bound for the mean parameter corresponding to each unit at each time. Our guarantee shows that the doubly robust estimator provides a (near-)quadratic improvement in the error compared to nearest neighbor estimators analyzed in prior work for these settings.
    Minimal Width for Universal Property of Deep RNN. (arXiv:2211.13866v1 [stat.ML])
    A recurrent neural network (RNN) is a widely used deep-learning network for dealing with sequential data. Imitating a dynamical system, an infinite-width RNN can approximate any open dynamical system in a compact domain. In general, deep networks with bounded widths are more effective than wide networks in practice; however, the universal approximation theorem for deep narrow structures has yet to be extensively studied. In this study, we prove the universality of deep narrow RNNs and show that the upper bound of the minimum width for universality can be independent of the length of the data. Specifically, we show that a deep RNN with ReLU activation can approximate any continuous function or $L^p$ function with the widths $d_x+d_y+2$ and $\max\{d_x+1,d_y\}$, respectively, where the target function maps a finite sequence of vectors in $\mathbb{R}^{d_x}$ to a finite sequence of vectors in $\mathbb{R}^{d_y}$. We also compute the additional width required if the activation function is $\tanh$ or more. In addition, we prove the universality of other recurrent networks, such as bidirectional RNNs. Bridging a multi-layer perceptron and an RNN, our theory and proof technique can be an initial step toward further research on deep RNNs.
    Nonlinear MCMC for Bayesian Machine Learning. (arXiv:2202.05621v2 [stat.ML] UPDATED)
    We explore the application of a nonlinear MCMC technique first introduced in [1] to problems in Bayesian machine learning. We provide a convergence guarantee in total variation that uses novel results for long-time convergence and large-particle ("propagation of chaos") convergence. We apply this nonlinear MCMC technique to sampling problems including a Bayesian neural network on CIFAR10.
    PAC-Bayes Compression Bounds So Tight That They Can Explain Generalization. (arXiv:2211.13609v1 [cs.LG])
    While there has been progress in developing non-vacuous generalization bounds for deep neural networks, these bounds tend to be uninformative about why deep learning works. In this paper, we develop a compression approach based on quantizing neural network parameters in a linear subspace, profoundly improving on previous results to provide state-of-the-art generalization bounds on a variety of tasks, including transfer learning. We use these tight bounds to better understand the role of model size, equivariance, and the implicit biases of optimization, for generalization in deep learning. Notably, we find large models can be compressed to a much greater extent than previously known, encapsulating Occam's razor. We also argue for data-independent bounds in explaining generalization.
    Particle-based Variational Inference with Preconditioned Functional Gradient Flow. (arXiv:2211.13954v1 [stat.ML])
    Particle-based variational inference (VI) minimizes the KL divergence between model samples and the target posterior with gradient flow estimates. With the popularity of Stein variational gradient descent (SVGD), the focus of particle-based VI algorithms has been on the properties of functions in Reproducing Kernel Hilbert Space (RKHS) to approximate the gradient flow. However, the requirement of RKHS restricts the function class and algorithmic flexibility. This paper remedies the problem by proposing a general framework to obtain tractable functional gradient flow estimates. The functional gradient flow in our framework can be defined by a general functional regularization term that includes the RKHS norm as a special case. We use our framework to propose a new particle-based VI algorithm: preconditioned functional gradient flow (PFG). Compared with SVGD, the proposed method has several advantages: larger function class; greater scalability in large particle-size scenarios; better adaptation to ill-conditioned distributions; provable continuous-time convergence in KL divergence. Non-linear function classes such as neural networks can be incorporated to estimate the gradient flow. Both theory and experiments have shown the effectiveness of our framework.
    Toward Unlimited Self-Learning Monte Carlo with Annealing Process Using VAE's Implicit Isometricity. (arXiv:2211.14024v1 [stat.ML])
    Self-learning Monte Carlo (SLMC) methods are recently proposed to accelerate Markov chain Monte Carlo (MCMC) methods by using a machine learning model.With generative models having latent variables, SLMC methods realize efficient Monte Carlo updates with less autocorrelation. However, SLMC methods are difficult to directly apply to multimodal distributions for which training data are difficult to obtain. In this paper, we propose a novel SLMC method called the ``annealing VAE-SLMC" to drastically expand the range of applications. Our VAE-SLMC utilizes a variational autoencoder (VAE) as a generative model to make efficient parallel proposals independent of any previous state by applying the theoretically derived implicit isometricity of the VAE. We combine an adaptive annealing process to the VAE-SLMC, making our method applicable to the cases where obtaining unbiased training data is difficult in practical sense due to slow mixing. We also propose a parallel annealing process and an exchange process between chains to make the annealing operation more precise and efficient. Experiments validate that our method can proficiently obtain unbiased samples from multiple multimodal toy distributions and practical multimodal posterior distributions, which is difficult to achieve with the existing SLMC methods.
    Zeroth-Order Alternating Gradient Descent Ascent Algorithms for a Class of Nonconvex-Nonconcave Minimax Problems. (arXiv:2211.13668v1 [math.OC])
    In this paper, we consider a class of nonconvex-nonconcave minimax problems, i.e., NC-PL minimax problems, whose objective functions satisfy the Polyak-$\L$ojasiewicz (PL) condition with respect to the inner variable. We propose a zeroth-order alternating gradient descent ascent (ZO-AGDA) algorithm and a zeroth-order variance reduced alternating gradient descent ascent (ZO-VRAGDA) algorithm for solving NC-PL minimax problem under the deterministic and the stochastic setting, respectively. The number of iterations to obtain an $\epsilon$-stationary point of ZO-AGDA and ZO-VRAGDA algorithm for solving NC-PL minimax problem is upper bounded by $\mathcal{O}(\varepsilon^{-2})$ and $\mathcal{O}(\varepsilon^{-3})$, respectively. To the best of our knowledge, they are the first two zeroth-order algorithms with the iteration complexity gurantee for solving NC-PL minimax problems.
    Online Regularized Learning Algorithm for Functional Data. (arXiv:2211.13549v1 [stat.ML])
    In recent years, functional linear models have attracted growing attention in statistics and machine learning, with the aim of recovering the slope function or its functional predictor. This paper considers online regularized learning algorithm for functional linear models in reproducing kernel Hilbert spaces. Convergence analysis of excess prediction error and estimation error are provided with polynomially decaying step-size and constant step-size, respectively. Fast convergence rates can be derived via a capacity dependent analysis. By introducing an explicit regularization term, we uplift the saturation boundary of unregularized online learning algorithms when the step-size decays polynomially, and establish fast convergence rates of estimation error without capacity assumption. However, it remains an open problem to obtain capacity independent convergence rates for the estimation error of the unregularized online learning algorithm with decaying step-size. It also shows that convergence rates of both prediction error and estimation error with constant step-size are competitive with those in the literature.
    Regret Bounds for Information-Directed Reinforcement Learning. (arXiv:2206.04640v2 [cs.LG] UPDATED)
    Information-directed sampling (IDS) has revealed its potential as a data-efficient algorithm for reinforcement learning (RL). However, theoretical understanding of IDS for Markov Decision Processes (MDPs) is still limited. We develop novel information-theoretic tools to bound the information ratio and cumulative information gain about the learning target. Our theoretical results shed light on the importance of choosing the learning target such that the practitioners can balance the computation and regret bounds. As a consequence, we derive prior-free Bayesian regret bounds for vanilla-IDS which learns the whole environment under tabular finite-horizon MDPs. In addition, we propose a computationally-efficient regularized-IDS that maximizes an additive form rather than the ratio form and show that it enjoys the same regret bound as vanilla-IDS. With the aid of rate-distortion theory, we improve the regret bound by learning a surrogate, less informative environment. Furthermore, we extend our analysis to linear MDPs and prove similar regret bounds for Thompson sampling as a by-product.
    Trust Your $\nabla$: Gradient-based Intervention Targeting for Causal Discovery. (arXiv:2211.13715v1 [stat.ML])
    Inferring causal structure from data is a challenging task of fundamental importance in science. Observational data are often insufficient to identify a system's causal structure uniquely. While conducting interventions (i.e., experiments) can improve the identifiability, such samples are usually challenging and expensive to obtain. Hence, experimental design approaches for causal discovery aim to minimize the number of interventions by estimating the most informative intervention target. In this work, we propose a novel Gradient-based Intervention Targeting method, abbreviated GIT, that 'trusts' the gradient estimator of a gradient-based causal discovery framework to provide signals for the intervention acquisition function. We provide extensive experiments in simulated and real-world datasets and demonstrate that GIT performs on par with competitive baselines, surpassing them in the low-data regime.
    Optimal Weak to Strong Learning. (arXiv:2206.01563v4 [cs.LG] UPDATED)
    The classic algorithm AdaBoost allows to convert a weak learner, that is an algorithm that produces a hypothesis which is slightly better than chance, into a strong learner, achieving arbitrarily high accuracy when given enough training data. We present a new algorithm that constructs a strong learner from a weak learner but uses less training data than AdaBoost and all other weak to strong learners to achieve the same generalization bounds. A sample complexity lower bound shows that our new algorithm uses the minimum possible amount of training data and is thus optimal. Hence, this work settles the sample complexity of the classic problem of constructing a strong learner from a weak learner.
    Probabilistic Rank and Reward: A Scalable Model for Slate Recommendation. (arXiv:2208.06263v2 [cs.IR] UPDATED)
    We introduce Probabilistic Rank and Reward (PRR), a scalable probabilistic model for personalized slate recommendation. Our approach allows state-of-the-art estimation of the user interests in the ubiquitous scenario where the user interacts with at most one item from a slate of K items. We show that the probability of a slate being successful can be learned efficiently by combining the reward, whether the user successfully interacted with the slate, and the rank, the item that was selected within the slate. PRR outperforms competing approaches that use one signal or the other and is far more scalable to large action spaces. Moreover, PRR allows fast delivery of recommendations powered by maximum inner product search (MIPS), making it suitable in low latency domains such as computational advertising.
    Randomized K-FACs: Speeding up K-FAC with Randomized Numerical Linear Algebra. (arXiv:2206.15397v3 [cs.LG] UPDATED)
    K-FAC is a successful tractable implementation of Natural Gradient for Deep Learning, which nevertheless suffers from the requirement to compute the inverse of the Kronecker factors (through an eigen-decomposition). This can be very time-consuming (or even prohibitive) when these factors are large. In this paper, we theoretically show that, owing to the exponential-average construction paradigm of the Kronecker factors that is typically used, their eigen-spectrum must decay. We show numerically that in practice this decay is very rapid, leading to the idea that we could save substantial computation by only focusing on the first few eigen-modes when inverting the Kronecker-factors. Importantly, the spectrum decay happens over a constant number of modes irrespectively of the layer width. This allows us to reduce the time complexity of K-FAC from cubic to quadratic in layer width, partially closing the gap w.r.t. SENG (another practical Natural Gradient implementation for Deep learning which scales linearly in width). Randomized Numerical Linear Algebra provides us with the necessary tools to do so. Numerical results show we obtain $\approx2.5\times$ reduction in per-epoch time and $\approx3.3\times$ reduction in time to target accuracy. We compare our proposed K-FAC sped-up versions SENG, and observe that for CIFAR10 classification with VGG16_bn we perform on par with it.
    JAWS: Auditing Predictive Uncertainty Under Covariate Shift. (arXiv:2207.10716v2 [cs.LG] UPDATED)
    We propose \textbf{JAWS}, a series of wrapper methods for distribution-free uncertainty quantification tasks under covariate shift, centered on the core method \textbf{JAW}, the \textbf{JA}ckknife+ \textbf{W}eighted with data-dependent likelihood-ratio weights. JAWS also includes computationally efficient \textbf{A}pproximations of JAW using higher-order influence functions: \textbf{JAWA}. Theoretically, we show that JAW relaxes the jackknife+'s assumption of data exchangeability to achieve the same finite-sample coverage guarantee even under covariate shift. JAWA further approaches the JAW guarantee in the limit of the sample size or the influence function order under common regularity assumptions. Moreover, we propose a general approach to repurposing predictive interval-generating methods and their guarantees to the reverse task: estimating the probability that a prediction is erroneous, based on user-specified error criteria such as a safe or acceptable tolerance threshold around the true label. We then propose \textbf{JAW-E} and \textbf{JAWA-E} as the repurposed proposed methods for this \textbf{E}rror assessment task. Practically, JAWS outperform state-of-the-art predictive inference baselines in a variety of biased real world data sets for interval-generation and error-assessment predictive uncertainty auditing tasks.
    The Design Space of E(3)-Equivariant Atom-Centered Interatomic Potentials. (arXiv:2205.06643v2 [stat.ML] UPDATED)
    The rapid progress of machine learning interatomic potentials over the past couple of years produced a number of new architectures. Particularly notable among these are the Atomic Cluster Expansion (ACE), which unified many of the earlier ideas around atom density-based descriptors, and Neural Equivariant Interatomic Potentials (NequIP), a message passing neural network with equivariant features that showed state of the art accuracy. In this work, we construct a mathematical framework that unifies these models: ACE is generalised so that it can be recast as one layer of a multi-layer architecture. From another point of view, the linearised version of NequIP is understood as a particular sparsification of a much larger polynomial model. Our framework also provides a practical tool for systematically probing different choices in the unified design space. We demonstrate this by an ablation study of NequIP via a set of experiments looking at in- and out-of-domain accuracy and smooth extrapolation very far from the training data, and shed some light on which design choices are critical for achieving high accuracy. Finally, we present BOTNet (Body-Ordered-Tensor-Network), a much-simplified version of NequIP, which has an interpretable architecture and maintains accuracy on benchmark datasets.
    Offline Reinforcement Learning with Differentiable Function Approximation is Provably Efficient. (arXiv:2210.00750v2 [cs.LG] UPDATED)
    Offline reinforcement learning, which aims at optimizing sequential decision-making strategies with historical data, has been extensively applied in real-life applications. State-Of-The-Art algorithms usually leverage powerful function approximators (e.g. neural networks) to alleviate the sample complexity hurdle for better empirical performances. Despite the successes, a more systematic understanding of the statistical complexity for function approximation remains lacking. Towards bridging the gap, we take a step by considering offline reinforcement learning with differentiable function class approximation (DFA). This function class naturally incorporates a wide range of models with nonlinear/nonconvex structures. Most importantly, we show offline RL with differentiable function approximation is provably efficient by analyzing the pessimistic fitted Q-learning (PFQL) algorithm, and our results provide the theoretical basis for understanding a variety of practical heuristics that rely on Fitted Q-Iteration style design. In addition, we further improve our guarantee with a tighter instance-dependent characterization. We hope our work could draw interest in studying reinforcement learning with differentiable function approximation beyond the scope of current research.
    Asymptotic Properties for Bayesian Neural Network in Besov Space. (arXiv:2206.00241v3 [stat.ML] UPDATED)
    Neural networks have shown great predictive power when dealing with various unstructured data such as images and natural languages. The Bayesian neural network captures the uncertainty of prediction by putting a prior distribution for the parameter of the model and computing the posterior distribution. In this paper, we show that the Bayesian neural network using spike-and-slab prior has consistency with nearly minimax convergence rate when the true regression function is in the Besov space. Even when the smoothness of the regression function is unknown the same posterior convergence rate holds and thus the spike-and-slab prior is adaptive to the smoothness of the regression function. We also consider the shrinkage prior, which is more feasible than other priors, and show that it has the same convergence rate. In other words, we propose a practical Bayesian neural network with guaranteed asymptotic properties.
    A Note on Model-Free Reinforcement Learning with the Decision-Estimation Coefficient. (arXiv:2211.14250v1 [cs.LG])
    We consider the problem of interactive decision making, encompassing structured bandits and reinforcement learning with general function approximation. Recently, Foster et al. (2021) introduced the Decision-Estimation Coefficient, a measure of statistical complexity that lower bounds the optimal regret for interactive decision making, as well as a meta-algorithm, Estimation-to-Decisions, which achieves upper bounds in terms of the same quantity. Estimation-to-Decisions is a reduction, which lifts algorithms for (supervised) online estimation into algorithms for decision making. In this note, we show that by combining Estimation-to-Decisions with a specialized form of optimistic estimation introduced by Zhang (2022), it is possible to obtain guarantees that improve upon those of Foster et al. (2021) by accommodating more lenient notions of estimation error. We use this approach to derive regret bounds for model-free reinforcement learning with value function approximation.
    Bypass Exponential Time Preprocessing: Fast Neural Network Training via Weight-Data Correlation Preprocessing. (arXiv:2211.14227v1 [cs.LG])
    Over the last decade, deep neural networks have transformed our society, and they are already widely applied in various machine learning applications. State-of-art deep neural networks are becoming larger in size every year to deliver increasing model accuracy, and as a result, model training consumes substantial computing resources and will only consume more in the future. Using current training methods, in each iteration, to process a data point $x \in \mathbb{R}^d$ in a layer, we need to spend $\Theta(md)$ time to evaluate all the $m$ neurons in the layer. This means processing the entire layer takes $\Theta(nmd)$ time for $n$ data points. Recent work [Song, Yang and Zhang, NeurIPS 2021] reduces this time per iteration to $o(nmd)$, but requires exponential time to preprocess either the data or the neural network weights, making it unlikely to have practical usage. In this work, we present a new preprocessing method that simply stores the weight-data correlation in a tree data structure in order to quickly, dynamically detect which neurons fire at each iteration. Our method requires only $O(nmd)$ time in preprocessing and still achieves $o(nmd)$ time per iteration. We complement our new algorithm with a lower bound, proving that assuming a popular conjecture from complexity theory, one could not substantially speed up our algorithm for dynamic detection of firing neurons.
    Latent Space Diffusion Models of Cryo-EM Structures. (arXiv:2211.14169v1 [q-bio.QM])
    Cryo-electron microscopy (cryo-EM) is unique among tools in structural biology in its ability to image large, dynamic protein complexes. Key to this ability is image processing algorithms for heterogeneous cryo-EM reconstruction, including recent deep learning-based approaches. The state-of-the-art method cryoDRGN uses a Variational Autoencoder (VAE) framework to learn a continuous distribution of protein structures from single particle cryo-EM imaging data. While cryoDRGN can model complex structural motions, the Gaussian prior distribution of the VAE fails to match the aggregate approximate posterior, which prevents generative sampling of structures especially for multi-modal distributions (e.g. compositional heterogeneity). Here, we train a diffusion model as an expressive, learnable prior in the cryoDRGN framework. Our approach learns a high-quality generative model over molecular conformations directly from cryo-EM imaging data. We show the ability to sample from the model on two synthetic and two real datasets, where samples accurately follow the data distribution unlike samples from the VAE prior distribution. We also demonstrate how the diffusion model prior can be leveraged for fast latent space traversal and interpolation between states of interest. By learning an accurate model of the data distribution, our method unlocks tools in generative modeling, sampling, and distribution analysis for heterogeneous cryo-EM ensembles.
    Multi-Class Uncertainty Calibration via Mutual Information Maximization-based Binning. (arXiv:2006.13092v7 [cs.LG] UPDATED)
    Post-hoc multi-class calibration is a common approach for providing high-quality confidence estimates of deep neural network predictions. Recent work has shown that widely used scaling methods underestimate their calibration error, while alternative Histogram Binning (HB) methods often fail to preserve classification accuracy. When classes have small prior probabilities, HB also faces the issue of severe sample-inefficiency after the conversion into K one-vs-rest class-wise calibration problems. The goal of this paper is to resolve the identified issues of HB in order to provide calibrated confidence estimates using only a small holdout calibration dataset for bin optimization while preserving multi-class ranking accuracy. From an information-theoretic perspective, we derive the I-Max concept for binning, which maximizes the mutual information between labels and quantized logits. This concept mitigates potential loss in ranking performance due to lossy quantization, and by disentangling the optimization of bin edges and representatives allows simultaneous improvement of ranking and calibration performance. To improve the sample efficiency and estimates from a small calibration set, we propose a shared class-wise (sCW) calibration strategy, sharing one calibrator among similar classes (e.g., with similar class priors) so that the training sets of their class-wise calibration problems can be merged to train the single calibrator. The combination of sCW and I-Max binning outperforms the state of the art calibration methods on various evaluation metrics across different benchmark datasets and models, using a small calibration set (e.g., 1k samples for ImageNet).
    A Survey of Learning Curves with Bad Behavior: or How More Data Need Not Lead to Better Performance. (arXiv:2211.14061v1 [cs.LG])
    Plotting a learner's generalization performance against the training set size results in a so-called learning curve. This tool, providing insight in the behavior of the learner, is also practically valuable for model selection, predicting the effect of more training data, and reducing the computational complexity of training. We set out to make the (ideal) learning curve concept precise and briefly discuss the aforementioned usages of such curves. The larger part of this survey's focus, however, is on learning curves that show that more data does not necessarily leads to better generalization performance. A result that seems surprising to many researchers in the field of artificial intelligence. We point out the significance of these findings and conclude our survey with an overview and discussion of open problems in this area that warrant further theoretical and empirical investigation.
    A System for Morphology-Task Generalization via Unified Representation and Behavior Distillation. (arXiv:2211.14296v1 [cs.LG])
    The rise of generalist large-scale models in natural language and vision has made us expect that a massive data-driven approach could achieve broader generalization in other domains such as continuous control. In this work, we explore a method for learning a single policy that manipulates various forms of agents to solve various tasks by distilling a large amount of proficient behavioral data. In order to align input-output (IO) interface among multiple tasks and diverse agent morphologies while preserving essential 3D geometric relations, we introduce morphology-task graph, which treats observations, actions and goals/task in a unified graph representation. We also develop MxT-Bench for fast large-scale behavior generation, which supports procedural generation of diverse morphology-task combinations with a minimal blueprint and hardware-accelerated simulator. Through efficient representation and architecture selection on MxT-Bench, we find out that a morphology-task graph representation coupled with Transformer architecture improves the multi-task performances compared to other baselines including recent discrete tokenization, and provides better prior knowledge for zero-shot transfer or sample efficiency in downstream multi-task imitation learning. Our work suggests large diverse offline datasets, unified IO representation, and policy representation and architecture selection through supervised learning form a promising approach for studying and advancing morphology-task generalization.
    Revisiting Active Sets for Gaussian Process Decoders. (arXiv:2209.04636v2 [stat.ML] UPDATED)
    Decoders built on Gaussian processes (GPs) are enticing due to the marginalisation over the non-linear function space. Such models (also known as GP-LVMs) are often expensive and notoriously difficult to train in practice, but can be scaled using variational inference and inducing points. In this paper, we revisit active set approximations. We develop a new stochastic estimate of the log-marginal likelihood based on recently discovered links to cross-validation, and propose a computationally efficient approximation thereof. We demonstrate that the resulting stochastic active sets (SAS) approximation significantly improves the robustness of GP decoder training while reducing computational cost. The SAS-GP obtains more structure in the latent space, scales to many datapoints and learns better representations than variational autoencoders, which is rarely the case for GP decoders.
    Operator Splitting Value Iteration. (arXiv:2211.13937v1 [cs.LG])
    We introduce new planning and reinforcement learning algorithms for discounted MDPs that utilize an approximate model of the environment to accelerate the convergence of the value function. Inspired by the splitting approach in numerical linear algebra, we introduce Operator Splitting Value Iteration (OS-VI) for both Policy Evaluation and Control problems. OS-VI achieves a much faster convergence rate when the model is accurate enough. We also introduce a sample-based version of the algorithm called OS-Dyna. Unlike the traditional Dyna architecture, OS-Dyna still converges to the correct value function in presence of model approximation error.
    A Moment-Matching Approach to Testable Learning and a New Characterization of Rademacher Complexity. (arXiv:2211.13312v1 [cs.LG])
    A remarkable recent paper by Rubinfeld and Vasilyan (2022) initiated the study of \emph{testable learning}, where the goal is to replace hard-to-verify distributional assumptions (such as Gaussianity) with efficiently testable ones and to require that the learner succeed whenever the unknown distribution passes the corresponding test. In this model, they gave an efficient algorithm for learning halfspaces under testable assumptions that are provably satisfied by Gaussians. In this paper we give a powerful new approach for developing algorithms for testable learning using tools from moment matching and metric distances in probability. We obtain efficient testable learners for any concept class that admits low-degree \emph{sandwiching polynomials}, capturing most important examples for which we have ordinary agnostic learners. We recover the results of Rubinfeld and Vasilyan as a corollary of our techniques while achieving improved, near-optimal sample complexity bounds for a broad range of concept classes and distributions. Surprisingly, we show that the information-theoretic sample complexity of testable learning is tightly characterized by the Rademacher complexity of the concept class, one of the most well-studied measures in statistical learning theory. In particular, uniform convergence is necessary and sufficient for testable learning. This leads to a fundamental separation from (ordinary) distribution-specific agnostic learning, where uniform convergence is sufficient but not necessary.

  • Open

    Can a complex task (e.g. peg-in-hole) divided into multiple agents?
    Hi, is it inappropriate to divide one task into subtasks and assign one agent to each subtasks? In case of peg-in-hole task, agent 1 can be responsible for approaching the robot to the hole. Once agent 1 has succeeded its task, agent 2 is activated for the peg task. What would be the cons of this approach? submitted by /u/Fun-Moose-3841 [link] [comments]  ( 23 min )
    OpenAI announces "text-davinci-003" upgrade to their InstructGPT (preference RL-finetuned GPT-3) models
    submitted by /u/gwern [link] [comments]  ( 60 min )
    [Research] NeurIPS 2022 highlights: Towards a Standardised Performance Evaluation Protocol for Cooperative MARL
    Arxiv OpenReview Abstract: Multi-agent reinforcement learning (MARL) has emerged as a useful approach to solving decentralised decision-making problems at scale. Research in the field has been growing steadily with many breakthrough algorithms proposed in recent years. In this work, we take a closer look at this rapid development with a focus on evaluation methodologies employed across a large body of research in cooperative MARL. By conducting a detailed meta-analysis of prior work, spanning 75 papers accepted for publication from 2016 to 2022, we bring to light worrying trends that put into question the true rate of progress. We further consider these trends in a wider context and take inspiration from single-agent RL literature on similar issues with recommendations that remain applicable to MARL. Combining these recommendations, with novel insights from our analysis, we propose a standardised performance evaluation protocol for cooperative MARL. We argue that such a standard protocol, if widely adopted, would greatly improve the validity and credibility of future research, make replication and reproducibility easier, as well as improve the ability of the field to accurately gauge the rate of progress over time by being able to make sound comparisons across different works. Finally, we release our meta-analysis data publicly on our project website for future research on evaluation accompanied by our open-source evaluation tools repository. submitted by /u/The_Human-Animal [link] [comments]  ( 65 min )
  • Open

    Three Digital Transformation Strategic Initiatives to Prioritize In 2023
    What is Digital Transformation?  ( 6 min )
    Most Recent AI Frameworks That Can Help Content Writers
    AI is often perceived as a solution to the content writing crisis, but it’s not quite that simple. The technology has been around for years…  ( 9 min )
    Advancements You Can Expect in 2023 in Artificial Intelligence Industry
    Artificial intelligence is a field that has grown from humble beginnings to become one of the top priorities for government, businesses…  ( 12 min )
    How we saved 60% of our monthly Azure Databricks cost
    Following these 4 quick tips can help you save big on your Azure Databricks monthly costs  ( 11 min )
    The Importance of AI in Web Development
    Artificial Intelligence (AI) in business impacts multiple fields, such as blockchain, education, website development, banking, data…  ( 12 min )
    Object Detection state-of-the-art methods using Deep Learning: Part 1
    No content preview
    Democracy at risk: the case for a new media.
    If we consider democracy as the act of “collectively deciding what’s best for us”, then it feels that we increasingly suck at working…  ( 23 min )
    Can We Form Relationships With AI Powered Robots?
    Introduction Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 10 min )
    Python Workout — Sorting and formatting tuples
    No content preview
    Want to build a career in data science? Master these programming languages
    The rise of data science has been fast and is a niche in huge demand. Therefore we have created the top 5 programming languages you should…  ( 7 min )
  • Open

    How AI Understand Words (Text Embedding Explained)
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 44 min )
    New and the largest AI Search Engine
    I just found a new website where you can search various AI websites for your needs. You can try https://www.creaitives.com/ submitted by /u/Icetanium_ [link] [comments]  ( 46 min )
    Tencent AI generated vocals in a song reach 100 million views, but I can’t find the song anywhere?
    Hey guys! I’m not sure if you’ve also heard about this, but apparently Tencent has created over a thousand songs with generated voices and one of those tracks reached a hundred million views but I can’t seem to find it anywhere. The song title translated to English is just “today” but googling I can’t. Seem to find it anywhere. Do you guys have any links to it? It’s made using the lingyin ai, if that helps. submitted by /u/ChipsAhoiMcCoy [link] [comments]  ( 45 min )
    ai hiring
    Hi community, I am creating a research journal on the use of AI hiring, if you are somebody who’s had the experience of Artificial Intelligence during your hiring process or application, please feel free to fulfill my survey, thank you. submitted by /u/PurpleRelevant2146 [link] [comments]  ( 44 min )
    AI Dream 40 - When you get Lost in a Fractal Maze
    submitted by /u/LordPewPew777 [link] [comments]  ( 44 min )
    can i ask a question what are the best online discussion forums on artificial intelligence?
    submitted by /u/vivid_confused_hill [link] [comments]  ( 51 min )
    Why Amazon Alexa (and other voice assistants) aren’t making money
    submitted by /u/bendee983 [link] [comments]  ( 23 min )
  • Open

    [D] Difference between sparse and dense information retrieval
    I was looking at the BEIR dataset and the leaderboard has two different pages, one for dense IR and sparse IR. I am curious to know what the difference was, I googled around but couldn't find anything conclusive. Is there anyone that's familiar with the difference or anywhere where I can read about it? submitted by /u/itsyourboiirow [link] [comments]  ( 63 min )
    [D] Reporting model performance on unavailable dataset
    I am currently working on a solution to a problem which hasn't been touched since 2014. Their model was evaluated using a script and dataset from SemEval 2007, which I am unable to find a copy of, and the project itself was deprecated several years ago. More recent works on a related task have been done on an updated dataset which is readily available, and I am able to report performance on it. My current plan is to make a note that the scripts and datasets used in previous works are no longer available, and that the performances are not directly comparable. The datasets should be somewhat comparable, but not being able to see the dataset itself makes it impossible to know that for sure. I'm wondering if reviewers will dislike this, even though it seems like the only option really available. What is the best way to compare the performance of my model with past works? submitted by /u/chad_as [link] [comments]  ( 64 min )
    [D] NeuRIPS Proceedings
    The proceedings of this year's Neurips are absent on its usual site. Has there been a change in policy ever since reviewing moved to OpenReview? On a different note, has anyone scraped the accepted paper PDFs? (I always find that quite easy to pdfgrep through specific terms or even references from whole set) submitted by /u/coredump3d [link] [comments]  ( 62 min )
    [P] Stable Diffusion 2.0 and the Importance of Negative Prompts for Good Results (+ Colab Notebooks + Negative Embedding)
    I just published a blog post with many academic experiments on getting good results from Stable Diffusion 2.0, showing that negative prompts are the key with its new text encoder: https://minimaxir.com/2022/11/stable-diffusion-negative-prompt/ I also released Colab Notebooks to reproduce the workflow and use the negative embeddings yourself (links in comment due to antispam filters for too many URLs) submitted by /u/minimaxir [link] [comments]  ( 65 min )
    [D] In the exploratory phase of model building, how do you track versions while accomodating for mistakes in the process?
    For example, I might track model versions in a table like this: https://preview.redd.it/8piw0vzvaq2a1.png?width=2042&format=png&auto=webp&s=c0e04cb3816ec21f9510727c7a6beb96bf71ad39 Then suppose that after adding a few rows to this table, I realize there was an error up until this point. Sometimes the error is so significant that it fully invalidates all the previous data (e.g. I was calculating the AUC incorrectly) but other times the error is smallish and I want to note it and still keep my results (e.g. the features are built in a slightly different way than I thought). How do you keep organized tracking model versions? Although some errors are inevitable when starting a new project, any tips for actively discovering errors are also appreciated. submitted by /u/papayamaia [link] [comments]  ( 69 min )
    [D] Why is rho in RMSprop much smaller vs. beta_2 in Adam?
    As far as I understand the hyperparameter rho in RMSprop is equivalent to Adam's beta_2, neglecting the bias correction in Adam. In most implementations of RMSprop the hyperparameter rho is either 0.9 or 0.99 versus the default value of beta_2 which is 0.999. This results in a much much larger time constant for the smoothing in the case of the Adam optimizer. From my intuition I would assume it would be more sensible if the default values would be more or less the same, no? Or is there any particular reason the time constant in Adam is much larger? Any insights are much appreciated. submitted by /u/flxh13 [link] [comments]  ( 66 min )
    [D] Training LLMs collaboratively
    Are there existing solutions/experiments to train LLMs collaboratively on distributed machines over the Internet? Something similar to the SETI@Home project. submitted by /u/dogonix [link] [comments]  ( 71 min )
    [R] [NeurIPS 2022] 3DOS: Towards 3D Open Set Learning - Benchmarking and Understanding Semantic Novelty Detection on Point Clouds
    NeurIPS: https://nips.cc/virtual/2022/poster/55764 Arxiv: https://arxiv.org/abs/2207.11554 Code and data: https://github.com/antoalli/3D_OS Abstract In recent years there has been significant progress in the field of 3D learning on classification, detection and segmentation problems. The vast majority of the existing studies focus on canonical closed-set conditions, neglecting the intrinsic open nature of the real-world. This limits the abilities of robots and autonomous systems involved in safety-critical applications that require managing novel and unknown signals. In this context exploiting 3D data can be a valuable asset since it provides rich information about the geometry of perceived objects and scenes. With this paper we provide the first broad study on 3D Open Set learning. We introduce 3DOS: a novel testbed for semantic novelty detection that considers several settings with increasing difficulties in terms of semantic (category) shift, and covers both in-domain (synthetic-to-synthetic, real-to-real) and cross-domain (synthetic-to-real) scenarios. Moreover, we investigate the related 2D Open Set literature to understand if and how its recent improvements are effective on 3D data. Our extensive benchmark positions several algorithms in the same coherent picture, revealing their strengths and limitations. The results of our analysis may serve as a reliable foothold for future tailored 3D Open Set methods. Schematic illustration of the OOD detection, semantic novelty detection and Open Set tasks on 3D data. 3D point clouds capture the complete object geometry, but miss the color, scale and object context which are naturally present in images. Things get worse at low resolution, where object details are lost. We propose 3DOS, the first benchmark for 3D Open Set learning, considering several settings with increasing levels of difficulty. It includes three main tracks: Synthetic, Real to Real, and Synthetic to Real. submitted by /u/antoalli [link] [comments]  ( 67 min )
    [Research] [R] NeurIPS 2022 highlights: Towards a Standardised Performance Evaluation Protocol for Cooperative MARL
    Arxiv Abstract: Multi-agent reinforcement learning (MARL) has emerged as a useful approach to solving decentralised decision-making problems at scale. Research in the field has been growing steadily with many breakthrough algorithms proposed in recent years. In this work, we take a closer look at this rapid development with a focus on evaluation methodologies employed across a large body of research in cooperative MARL. By conducting a detailed meta-analysis of prior work, spanning 75 papers accepted for publication from 2016 to 2022, we bring to light worrying trends that put into question the true rate of progress. We further consider these trends in a wider context and take inspiration from single-agent RL literature on similar issues with recommendations that remain applicable to MARL. Combining these recommendations, with novel insights from our analysis, we propose a standardised performance evaluation protocol for cooperative MARL. We argue that such a standard protocol, if widely adopted, would greatly improve the validity and credibility of future research, make replication and reproducibility easier, as well as improve the ability of the field to accurately gauge the rate of progress over time by being able to make sound comparisons across different works. Finally, we release our meta-analysis data publicly on our project website for future research on evaluation accompanied by our open-source evaluation tools repository. submitted by /u/The_Human-Animal [link] [comments]  ( 64 min )
  • Open

    Deploy an MLOps solution that hosts your model endpoints in AWS Lambda
    In 2019, Amazon co-founded the climate pledge. The pledge’s goal is to achieve net zero carbon by 2040. This is 10 years earlier than the Paris agreement outlines. Companies who sign up are committed to regular reporting, carbon elimination, and credible offsets. At the time of this writing, 377 companies have signed the climate pledge, […]  ( 10 min )
  • Open

    Google at NeurIPS 2022
    Posted by Cat Armato, Program Manager, Google This week marks the beginning of the 36th annual Conference on Neural Information Processing Systems (NeurIPS 2022), the biggest machine learning conference of the year, which is being held in New Orleans, LA. NeurIPS 2022 will be held in person with additional options for virtual attendees, and includes invited talks, demonstrations and presentations of some of the latest in machine learning research. This year, NeurIPS is also offering a new track, called Spotlight Papers, which will provide opportunities to highlight papers presented in prestigious journals that would otherwise not have been eligible for submission. Google is proud to be a Diamond level sponsor of NeurIPS this year and will have a significant presence year with more…  ( 104 min )
  • Open

    New Book: Synthetic Data – Generation and Applications
    Synthetic data is used more and more to augment real-life datasets. It enriches them and allow black-box systems to correctly classify observations or predict values that are well outside of training and validation sets. In addition, it helps understand decisions from obscure systems such as deep neural networks. Thus, it contributes to the development of… Read More »New Book: Synthetic Data – Generation and Applications The post New Book: Synthetic Data – Generation and Applications appeared first on Data Science Central.  ( 20 min )
  • Open

    Conformal map between square and disk
    Conformal maps transform one region into another while preserving angles. You might solve a PDE, for example, by mapping it to a standard region, solving it there, then mapping the solution back to the original region. Some tasks are easier to do in a square and others in a disk, so it’s clearly useful to […] Conformal map between square and disk first appeared on John D. Cook.  ( 5 min )
  • Open

    NVIDIA Wins NeurIPS Awards for Research on Generative AI, Generalist AI Agents
    Two NVIDIA Research papers — one exploring diffusion-based generative AI models and another on training generalist AI agents — have been honored with NeurIPS 2022 Awards for their contributions to the field of AI and machine learning. These are among more than 60+ talks, posters and workshops with NVIDIA authors being presented at the NeurIPs Read article > The post NVIDIA Wins NeurIPS Awards for Research on Generative AI, Generalist AI Agents appeared first on NVIDIA Blog.  ( 6 min )
    MAP Once, Run Anywhere: MONAI Introduces Framework for Deploying Medical Imaging AI Apps
    Delivering AI-accelerated healthcare at scale will take thousands of neural networks working together to cover the breadth of human physiology, diseases and even hospital operations — a significant challenge in today’s smart hospital environment. MONAI, an open-source medical-imaging AI framework with more than 650,000 downloads, accelerated by NVIDIA, is making it easier to integrate these Read article > The post MAP Once, Run Anywhere: MONAI Introduces Framework for Deploying Medical Imaging AI Apps appeared first on NVIDIA Blog.  ( 6 min )
    NVIDIA Partners With NHS Trusts to Deploy AI Platform in UK Hospitals
    A consortium of 10 National Health Service Trusts — the publicly funded healthcare system in England — is now deploying the MONAI-based AIDE platform across four of its hospitals, providing AI-enabled disease-detection tools to healthcare professionals serving 5 million patients a year. AIDE, short for AI Deployment Engine, is expected to be rolled out next Read article > The post NVIDIA Partners With NHS Trusts to Deploy AI Platform in UK Hospitals appeared first on NVIDIA Blog.  ( 5 min )

  • Open

    Implementing Gradient Descent in PyTorch
    The gradient descent algorithm is one of the most popular techniques for training deep neural networks. It has many applications in fields such as computer vision, speech recognition, and natural language processing. While the idea of gradient descent has been around for decades, it’s only recently that it’s been applied to applications related to deep […] The post Implementing Gradient Descent in PyTorch appeared first on MachineLearningMastery.com.  ( 25 min )

  • Open

    Training a Linear Regression Model in PyTorch
    Linear regression is a simple yet powerful technique for predicting the values of variables based on other variables. It is often used for modeling relationships between two or more continuous variables, such as the relationship between income and age, or the relationship between weight and height. Likewise, linear regression can be used to predict continuous […] The post Training a Linear Regression Model in PyTorch appeared first on MachineLearningMastery.com.  ( 24 min )
    Making Linear Predictions in PyTorch
    Linear regression is a statistical technique for estimating the relationship between two variables. A simple example of linear regression is to predict the height of someone based on the square root of the person’s weight (that’s what BMI is based on). To do this, we need to find the slope and intercept of the line. […] The post Making Linear Predictions in PyTorch appeared first on MachineLearningMastery.com.  ( 21 min )

  • Open

    Loading and Providing Datasets in PyTorch
    Structuring the data pipeline in a way that it can be effortlessly linked to your deep learning model is an important aspect of any deep learning-based system. PyTorch packs everything to do just that. While in the previous tutorial, we used simple datasets, we’ll need to work with larger datasets in real world scenarios in […] The post Loading and Providing Datasets in PyTorch appeared first on MachineLearningMastery.com.  ( 20 min )

  • Open

    Using Dataset Classes in PyTorch
    In machine learning and deep learning problems, a lot of effort goes into preparing the data. Data is usually messy and needs to be preprocessed before it can be used for training a model. If the data is not prepared correctly, the model won’t be able to generalize well. Some of the common steps required […] The post Using Dataset Classes in PyTorch appeared first on MachineLearningMastery.com.  ( 21 min )

  • Open

    Calculating Derivatives in PyTorch
    Derivatives are one of the most fundamental concepts in calculus. They describe how changes in the variable inputs affect the function outputs. The objective of this article is to provide a high-level introduction to calculating derivatives in PyTorch for those who are new to the framework. PyTorch offers a convenient way to calculate derivatives for […] The post Calculating Derivatives in PyTorch appeared first on Machine Learning Mastery.  ( 20 min )

  • Open

    Two-Dimensional Tensors in Pytorch
    Two-dimensional tensors are analogous to two-dimensional metrics. Like a two-dimensional metric, a two-dimensional tensor also has $n$ number of rows and columns. Let’s take a gray-scale image as an example, which is a two-dimensional matrix of numeric values, commonly known as pixels. Ranging from ‘0’ to ‘255’, each number represents a pixel intensity value. Here, […] The post Two-Dimensional Tensors in Pytorch appeared first on Machine Learning Mastery.  ( 21 min )

  • Open

    One-Dimensional Tensors in Pytorch
    PyTorch is an open-source deep learning framework based on Python language. It allows you to build, train, and deploy deep learning models, offering a lot of versatility and efficiency. PyTorch is primarily focused on tensor operations while a tensor can be a number, matrix, or a multi-dimensional array. In this tutorial, we will perform some […] The post One-Dimensional Tensors in Pytorch appeared first on Machine Learning Mastery.  ( 22 min )

  • Open

    365 Data Science courses free until November 21
    Sponsored Post   The unlimited access initiative presents a risk-free way to break into data science.     The online educational platform 365 Data Science launches the #21DaysFREE campaign and provides 100% free unlimited access to all content for three weeks. From November 1 to 21, you can take courses from renowned instructors and earn […] The post 365 Data Science courses free until November 21 appeared first on Machine Learning Mastery.  ( 15 min )

  • Open

    Attend the Data Science Symposium 2022, November 8 in Cincinnati
    Sponsored Post      Attend the Data Science Symposium 2022 on November 8 The Center for Business Analytics at the University of Cincinnati will present its annual Data Science Symposium 2022 on November 8. This all day in-person event will have three featured speakers and two tech talk tracks with four concurrent presentations in each track. The […] The post Attend the Data Science Symposium 2022, November 8 in Cincinnati appeared first on Machine Learning Mastery.  ( 10 min )

  • Open

    My family's unlikely homeschooling journey
    My husband Jeremy and I never intended to homeschool, and yet we have now, unexpectedly, committed to homeschooling long-term. Prior to the pandemic, we both worked full-time in careers that we loved and found meaningful, and we sent our daughter to a full-day Montessori school. Although I struggled with significant health issues, I felt unbelievably lucky and fulfilled in both my family life and my professional life. The pandemic upended my careful balance. Every family is different, with different needs, circumstances, and constraints, and what works for one may not work for others. My intention here is primarily to share the journey of my own (very privileged) family. Our unplanned introduction to homeschooling For the first year of the pandemic, most schools in California, where …  ( 7 min )

  • Open

    The Jupyter+git problem is now solved
    Jupyter notebooks don’t work with git by default. With nbdev2, the Jupyter+git problem has been totally solved. It provides a set of hooks which provide clean git diffs, solve most git conflicts automatically, and ensure that any remaining conflicts can be resolved entirely within the standard Jupyter notebook environment. To get started, follow the directions on Git-friendly Jupyter. Contents The Jupyter+git problem The solution The nbdev2 git merge driver The nbdev2 Jupyter save hook Background The result Postscript: other Jupyter+git tools ReviewNB An alternative solution: Jupytext nbdime The Jupyter+git problem Jupyter notebooks are a powerful tool for scientists, engineers, technical writers, students, teachers, and more. They provide an ideal notebook environment for interact…  ( 7 min )
2022-12-28T00:51:15.495Z osmosfeed 1.15.1